Training GPT-2 language models from scratch for learning purposes.
- File:
Gpt_2_variations.ipynb - Dataset: 10 classic books from Project Gutenberg (~1.25M tokens)
- Models: Comparison of 3 positional encoding strategies
- Learnable positional embeddings
- Sinusoidal positional encoding
- No positional encoding (baseline)
- File:
gpt2_ptbr_wikipedia.ipynb - Dataset: Portuguese Wikipedia (5-20%, ~15-60M tokens)
- Model: GPT-2 with learnable positional embeddings (~30M params)
┌─────────────────────────────────────────┐
│ Token Embedding (vocab_size × emb_dim) │
│ + Positional Embedding │
├─────────────────────────────────────────┤
│ Transformer Block × n_layers │
│ ├── LayerNorm │
│ ├── Multi-Head Self-Attention │
│ ├── Residual Connection │
│ ├── LayerNorm │
│ ├── Feed-Forward (4× expansion) │
│ └── Residual Connection │
├─────────────────────────────────────────┤
│ Final LayerNorm │
│ Output Head (weight tied with tok_emb) │
└─────────────────────────────────────────┘
GPT_CONFIG = {
"vocab_size": 50257,
"context_length": 256,
"emb_dim": 256,
"n_heads": 4,
"n_layers": 4,
"drop_rate": 0.1,
}GPT_CONFIG = {
"vocab_size": 50257,
"context_length": 256,
"emb_dim": 384,
"n_heads": 6,
"n_layers": 6,
"drop_rate": 0.1,
}
- Auto-detect platform: Colab, Kaggle, or local
- Checkpoints: Saved every epoch (resume if disconnected)
- DataParallel: Automatic multi-GPU support
- Text generation: Top-k, top-p, temperature, repetition penalty
gpt-2/
├── README.md
├── Gpt_2_variations.ipynb # English model experiments
├── gpt2_ptbr_wikipedia.ipynb # Portuguese model
├── results.md # Training results log
└── [downloaded books].txt # Gutenberg books (auto-downloaded)
torch>=2.0
tiktoken
datasets
matplotlib