This is an experiment built on a fork of smol-gpt to train a 'previous word/token' type gpt text generation instead of 'next word/token'.
We use the Fineweb 100BT sample for pre-training our base model.
- Prepare Dataset
# This will:
# 1. Download Fineweb 100BT sample from HuggingFace
# 2. Train tokenizer (vocab size 8888)
# 3. Preprocess and tokenize the data
python preprocess_xcoax.py --vocab-size 8888 --num-chunks 1000
- Train Model
# Train on Fineweb 100BT
python train_xcoax.py
- Sample from Base Model
python sample_xcoax.py
After pre-training, we fine-tune the model on Open Instruct V1 to create BackChat, an instruction-following model that works in reverse - given a response, it generates the instruction that could have led to that response.
- Prepare Instruction Dataset
# Process and tokenize the instruction dataset
python preprocess_instruct.py --vocab-size 8888
- Finetune Model
# Finetune the pre-trained model on instruction data
python finetune_xcoax.py --model-path out/xcoax/best_checkpoint.pt
- 8888 token vocabulary
- 16 attention heads
- 12-layer transformer
- 1024 embedding dimension
- Training hyperparameters:
- Batch size: 64
- Gradient accumulation steps: 4
- Learning rate: 3e-4 with cosine decay
- Block size: 1024
- Mixed precision: bfloat16
The instruction-tuned model maintains the same architecture as the base model but is fine-tuned on the Open Instruct V1 dataset in a unique way:
- Given a response, it generates the instruction that could have led to that response
- Both response and instruction are processed backwards (word by word)
- Uses special tokens to mark response and instruction sections
- Dataset includes:
- 51,759 samples from Alpaca
- 82,599 samples from Self Instruct
- 18,194 samples from GPT-4 Instruct
- And more instruction-following data
# Interactive sampling with adjustable parameters
python sample_xcoax.py
# Parameters:
# - temp=X: Set temperature (default 0.8)
# - top_k=X: Set top-k sampling (default 200)
# - tokens=X: Set max tokens to generate (default 500)
# Interactive sampling - provide a response, get an instruction
python sample_xcoax_instruct.py
Example interaction:
Response: The cat is sleeping.
Generated Instruction: What is the cat doing?
Response: Python is a high-level programming language.
Generated Instruction: Define what Python is.
# Parameters same as base model
# Original data:
instruction = "Identity the odd one out."
input_text = "Twitter, Instagram, Telegram"
output = "Telegram"
# Training format:
<|im_start|><|response|>Telegram<|im_end|>
<|im_start|><|instruction|>Twitter Instagram, Telegram out odd the Identity<|im_end|>
# During inference:
1. User provides response
2. We reverse it: "sleeping is cat The"
3. Format: <|im_start|><|response|>sleeping is cat The<|im_end|>
4. Model generates reversed instruction
5. We un-reverse the instruction for display
- Project setup
- Base model architecture
- Fineweb 100BT preprocessing
- Pre-training on Fineweb 100BT
- Instruction tuning setup
- BackChat instruction tuning
- Model evaluation and benchmarks