Skip to content

a previous-token prediction gpt model - writing text backwards

License

Notifications You must be signed in to change notification settings

isaac-art/smolGPT_back

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BackGPT & BackChat

This is an experiment built on a fork of smol-gpt to train a 'previous word/token' type gpt text generation instead of 'next word/token'.

BackChat

Training Plan

Phase 1: Pre-training on Fineweb 100BT

We use the Fineweb 100BT sample for pre-training our base model.

  1. Prepare Dataset
# This will:
# 1. Download Fineweb 100BT sample from HuggingFace
# 2. Train tokenizer (vocab size 8888)
# 3. Preprocess and tokenize the data
python preprocess_xcoax.py --vocab-size 8888 --num-chunks 1000
  1. Train Model
# Train on Fineweb 100BT
python train_xcoax.py
  1. Sample from Base Model
python sample_xcoax.py

Phase 2: Instruction Tuning

After pre-training, we fine-tune the model on Open Instruct V1 to create BackChat, an instruction-following model that works in reverse - given a response, it generates the instruction that could have led to that response.

  1. Prepare Instruction Dataset
# Process and tokenize the instruction dataset
python preprocess_instruct.py --vocab-size 8888
  1. Finetune Model
# Finetune the pre-trained model on instruction data
python finetune_xcoax.py --model-path out/xcoax/best_checkpoint.pt

Model Architecture

XCOAX Model (Pre-trained)

  • 8888 token vocabulary
  • 16 attention heads
  • 12-layer transformer
  • 1024 embedding dimension
  • Training hyperparameters:
    • Batch size: 64
    • Gradient accumulation steps: 4
    • Learning rate: 3e-4 with cosine decay
    • Block size: 1024
    • Mixed precision: bfloat16

BackChat Model (Instruction-tuned)

The instruction-tuned model maintains the same architecture as the base model but is fine-tuned on the Open Instruct V1 dataset in a unique way:

  • Given a response, it generates the instruction that could have led to that response
  • Both response and instruction are processed backwards (word by word)
  • Uses special tokens to mark response and instruction sections
  • Dataset includes:
    • 51,759 samples from Alpaca
    • 82,599 samples from Self Instruct
    • 18,194 samples from GPT-4 Instruct
    • And more instruction-following data

Usage Examples

Base Model (Pre-trained)

# Interactive sampling with adjustable parameters
python sample_xcoax.py

# Parameters:
# - temp=X: Set temperature (default 0.8)
# - top_k=X: Set top-k sampling (default 200)
# - tokens=X: Set max tokens to generate (default 500)

BackChat (Instruction-tuned)

# Interactive sampling - provide a response, get an instruction
python sample_xcoax_instruct.py

Example interaction:
Response: The cat is sleeping.
Generated Instruction: What is the cat doing?

Response: Python is a high-level programming language.
Generated Instruction: Define what Python is.

# Parameters same as base model

How It Works

Backwards Instruction Format

# Original data:
instruction = "Identity the odd one out."
input_text = "Twitter, Instagram, Telegram"
output = "Telegram"

# Training format:
<|im_start|><|response|>Telegram<|im_end|>
<|im_start|><|instruction|>Twitter Instagram, Telegram out odd the Identity<|im_end|>

# During inference:
1. User provides response
2. We reverse it: "sleeping is cat The"
3. Format: <|im_start|><|response|>sleeping is cat The<|im_end|>
4. Model generates reversed instruction
5. We un-reverse the instruction for display

Training Progress

  • Project setup
  • Base model architecture
  • Fineweb 100BT preprocessing
  • Pre-training on Fineweb 100BT
  • Instruction tuning setup
  • BackChat instruction tuning
  • Model evaluation and benchmarks

About

a previous-token prediction gpt model - writing text backwards

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 70.0%
  • TeX 15.2%
  • HTML 11.4%
  • JavaScript 3.3%
  • Shell 0.1%