Skip to content

Latest commit

 

History

History

llm-pretraining

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

LLM pre-training using Lance text dataset

Overview

Using a Lance text dataset for pre-training / fine-tuning a Large Language model is straightforward and memory-efficient. We'll be using the wikitext_100K.lance dataset that we created in the Creating text dataset for LLM pre-training example to train a basic GPT2 model from scratch using 🤗 transformers on a 1x A100 GPU. The wikitext dataset, is a collection of over 100 million tokens extracted from the set of verified good and featured articles on Wikipedia.

Code and Blog

Open In Colab