🧠 TinyLLM — Build an LLM From Scratch

A minimal, heavily-commented GPT-style language model (~20M parameters) for learning purposes. Every component is implemented from scratch — no HuggingFace, no pre-built transformers.

📚 What You'll Learn

🔤 1. BPE Tokenization (`tokenizer.py`)

Byte Pair Encoding is how GPT-2/GPT-4 turn raw text into numbers.

The algorithm:

Start with characters as vocabulary
Count all adjacent token pairs in the corpus
Merge the most frequent pair into a new token
Repeat until vocab size is reached

💡 Key insight: "lower" → ["▁low", "er"], "lowest" → ["▁low", "est"]. Common subwords get merged; rare words stay as characters. Handles OOV words gracefully.

Special tokens:

Token	ID	Purpose
`<PAD>`	0	Padding
`<UNK>`	1	Unknown token
`<BOS>`	2	Beginning of sequence
`<EOS>`	3	End of sequence ← model stops here

🏗️ 2. Transformer Architecture (`model.py`)

🔍 Multi-Head Self-Attention

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Symbol	Role
Q (Query)	"What am I looking for?"
K (Key)	"What information do I have?"
V (Value)	"What do I actually return?"

Division by sqrt(d_k) prevents softmax from saturating in high dimensions
Causal mask: future tokens get -inf → 0 probability (autoregressive)
Multiple heads: each head learns different relationship types (syntax, semantics, coreference…)

⚡ Feed-Forward Network

FFN(x) = GELU(xW₁ + b₁)W₂ + b₂

Applied position-wise. Acts like "memory" — stores factual associations.

🔧 Other Key Components

Component	What it does
Pre-norm (`x + sublayer(LayerNorm(x))`)	More stable gradients than post-norm (GPT-2 style)
Residual connections (`x = x + sublayer(x)`)	Prevent vanishing gradients in deep networks
Weight tying	Embedding matrix and LM head share weights → fewer params

🚀 3. Training (`train.py`)

Objective: Causal Language Modeling — given [t1, t2, t3, t4], predict [t2, t3, t4, t5].

📈 Learning Rate Schedule

Warmup  →  lr = max_lr × (iter / warmup_iters)
Cosine  →  lr = min_lr + (max_lr - min_lr) × 0.5 × (1 + cos(π × progress))

🛡️ Gradient Clipping

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Prevents sudden loss spikes by scaling down large gradients.

🗜️ Gradient Accumulation

Simulates larger batches without extra memory:

for micro_step in range(grad_accumulation_steps):
    loss = model(batch) / grad_accumulation_steps
    loss.backward()
optimizer.step()  # Only step once per "logical" batch

⚡ Mixed Precision (bfloat16)

Format	Bits	Range	Precision	Benefit
float32	32	±3.4×10³⁸	~7 digits	Training stable
bfloat16	16	±3.4×10³⁸	~3 digits	2-4× faster, ~50% less VRAM

bfloat16 > float16 for training — same dynamic range, no loss scaling needed.

🖥️ 4. Multi-GPU Training with DDP (`train.py`)

GPU 0: model copy → forward(batch_shard_0) → backward → gradients ─┐
GPU 1: model copy → forward(batch_shard_1) → backward → gradients ─┤
                                                                     ↓
                                             All-Reduce (NCCL): avg gradients
                                                                     ↓
                                         Both GPUs update weights identically

torchrun sets these environment variables automatically:

Variable	Meaning
`RANK`	Global process index (0 = master)
`LOCAL_RANK`	GPU index on this node
`WORLD_SIZE`	Total number of processes

💬 5. Text Generation (`generate.py`)

Autoregressive loop: feed tokens → sample next token → append → repeat → stop at <EOS>

🎲 Sampling Strategies

Strategy	How	Effect
Greedy (temp=0)	Always pick argmax	Deterministic, can be repetitive
Temperature `T < 1`	Sharpen distribution	More confident, less creative
Temperature `T > 1`	Flatten distribution	More random, more creative
Top-k	Sample from top-k tokens only	Blocks very unlikely tokens
Top-p (nucleus)	Sample from smallest set with cumulative prob ≥ p	Adaptive to model certainty

📦 Custom Q&A Dataset

You can train TinyLLM on your own question-answer data using a simple JSON format.

Format

[
  {
    "id": 1,
    "category": "Identity",
    "question": "Who are you?",
    "answer": "I am TinyLLM, a small but capable language model here to help you!"
  },
  {
    "id": 2,
    "category": "Identity",
    "question": "What is your name?",
    "answer": "My name is TinyLLM."
  }
]

Field	Required	Description
`id`	No	Unique identifier (ignored during training)
`category`	No	Grouping label (ignored during training)
`question`	✅ Yes	The input question text
`answer`	✅ Yes	The expected answer text

How It Works

Each Q&A pair is automatically wrapped with boundary tokens before training:

<BOS> Question: Who are you?
Answer: I am TinyLLM, a small but capable language model here to help you! <EOS>

This teaches the model where answers end — without <EOS> boundaries, the model would answer a question and then immediately ask itself another one and keep going indefinitely.

Training on Custom Data

from data_utils import prepare_custom_data, create_dataloader

train_ds, val_ds, tokenizer = prepare_custom_data(
    json_path="data/my_dataset.json",
    vocab_size=2000,
    context_length=128,
    force_retrain_tokenizer=True,  # retrain so BOS/EOS appear in the corpus
)

loader = create_dataloader(train_ds, batch_size=8)

Stopping at `<EOS>` During Inference

Your generation loop must honour the <EOS> token:

for _ in range(max_new_tokens):
    logits = model(input_ids)
    next_token = logits[:, -1, :].argmax(dim=-1)
    if next_token.item() == tokenizer.eos_id:
        break                                        # ← stop here!
    input_ids = torch.cat([input_ids, next_token.unsqueeze(0)], dim=1)

⚡ Quick Start

Install

pip install torch --index-url https://download.pytorch.org/whl/cu121

Train (single GPU)

python train.py

Train (multi-GPU)

torchrun --nproc_per_node=2 train.py

Train with custom settings

torchrun --nproc_per_node=2 train.py \
    --batch_size 32 \
    --max_iters 5000 \
    --d_model 512 \
    --n_layers 8

Generate text

python generate.py \
    --checkpoint checkpoints/latest.pt \
    --prompt "Who are you?" \
    --temperature 0.8 \
    --top_k 50 \
    --max_tokens 300

📐 Model Size Reference

Config	d_model	n_layers	n_heads	vocab_size	context_length	d_ff	Params
🐭 Tiny (default)	384	6	6	10000	256	512	~10M
🐱 Small	512	8	8	10000	256	1024	~22M
🐻 Medium	768	12	12	10000	256	1536	~65M

🔬 Experiment Ideas

Once the base model is training, try these:

#	Experiment	Where
1	Sinusoidal vs learned positional embeddings	`--use_learned_pos_emb False`
2	Swap GELU for ReLU or SiLU	`model.py` FFN block
3	RoPE positional encoding (used in LLaMA)	Add to `model.py`
4	Flash Attention (drop-in, better memory)	Replace `MultiHeadAttention`
5	Different datasets — Bible, Project Gutenberg	`data_utils.py`
6	Scaling laws — train on 10× more data	Watch val loss curve
7	SwiGLU activation (`SwiGLU(x) = (xW+b) × σ(xV+c)`)	`model.py` FFN block

📖 Key Papers

Paper	Authors	Why read it
Attention Is All You Need	Vaswani et al., 2017	Original Transformer
Language Models are Unsupervised Multitask Learners	Radford et al., 2019	GPT-2
An Image is Worth 16×16 Words	Dosovitskiy et al., 2020	ViT — shows transformers work everywhere
Training Compute-Optimal LLMs	Hoffmann et al., 2022	Chinchilla scaling laws

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
model		model
tokenizer		tokenizer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data_utils.py		data_utils.py
inference.py		inference.py
pretokenize.py		pretokenize.py
pretrain.py		pretrain.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

🧠 TinyLLM — Build an LLM From Scratch

📚 What You'll Learn

🔤 1. BPE Tokenization (tokenizer.py)

🏗️ 2. Transformer Architecture (model.py)

🔍 Multi-Head Self-Attention

⚡ Feed-Forward Network

🔧 Other Key Components

🚀 3. Training (train.py)

📈 Learning Rate Schedule

🛡️ Gradient Clipping

🗜️ Gradient Accumulation

⚡ Mixed Precision (bfloat16)

🖥️ 4. Multi-GPU Training with DDP (train.py)

💬 5. Text Generation (generate.py)

🎲 Sampling Strategies

📦 Custom Q&A Dataset

Format

How It Works

Training on Custom Data

Stopping at <EOS> During Inference

⚡ Quick Start

Install

Train (single GPU)

Train (multi-GPU)

Train with custom settings

Generate text

📐 Model Size Reference

🔬 Experiment Ideas

📖 Key Papers

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🔤 1. BPE Tokenization (`tokenizer.py`)

🏗️ 2. Transformer Architecture (`model.py`)

🚀 3. Training (`train.py`)

🖥️ 4. Multi-GPU Training with DDP (`train.py`)

💬 5. Text Generation (`generate.py`)

Stopping at `<EOS>` During Inference

Packages