Release Performance improvements for training large datasets · minimaxir/aitextgen

The TokenDataset is now backed by numpy, which means that aitextgen is now compatible with larger (>1GB) datasets without going OOM on the host machine!

Loading the dataset uses preallocated numpy arrays populated by tokenized minibatches, ensuring constant memory usage.
Training also has constant memory usage on the host system (by using native numpy/Torch integration and not creating copied arrays).
Loading the dataset now has a progress bar!
For single-texts, aitextgen uses a trick to parse the text as multiple texts (delimited by newlines), allowing multithreaded tokenization, at the minor cost of slightly different tokenization at newline-boundaries. (to disable this behavior and parse the text as a single text, set text_delim = "/r")
Smaller file sizes when compressing TokenDatasets.

Additionally, progress bar refresh rates (for train() and dataset loading) now update every 10 steps by default, to avoid extra bandwidth usage when using a cloud-based Notebook (e.g. Colab). As a side effect, when using a GPU, this update results in ~25% faster training speeds, unexpectedly.

Breaking Changes

Loading datasets from previous versions will not work. This is a side effect of being a beta, and not something I intend to break often.
shuffle/seed on TokenDatasets no longer works; handle that before loading the dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements for training large datasets

Breaking Changes