Skip to content

Performance improvements for training large datasets

Compare
Choose a tag to compare
@minimaxir minimaxir released this 02 Jun 04:22
· 110 commits to master since this release

The TokenDataset is now backed by numpy, which means that aitextgen is now compatible with larger (>1GB) datasets without going OOM on the host machine!

  • Loading the dataset uses preallocated numpy arrays populated by tokenized minibatches, ensuring constant memory usage.
  • Training also has constant memory usage on the host system (by using native numpy/Torch integration and not creating copied arrays).
  • Loading the dataset now has a progress bar!
  • For single-texts, aitextgen uses a trick to parse the text as multiple texts (delimited by newlines), allowing multithreaded tokenization, at the minor cost of slightly different tokenization at newline-boundaries. (to disable this behavior and parse the text as a single text, set text_delim = "/r")
  • Smaller file sizes when compressing TokenDatasets.

Additionally, progress bar refresh rates (for train() and dataset loading) now update every 10 steps by default, to avoid extra bandwidth usage when using a cloud-based Notebook (e.g. Colab). As a side effect, when using a GPU, this update results in ~25% faster training speeds, unexpectedly.

Breaking Changes

  • Loading datasets from previous versions will not work. This is a side effect of being a beta, and not something I intend to break often.
  • shuffle/seed on TokenDatasets no longer works; handle that before loading the dataset.