Skip to content

igoramf/gpt-2-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

GPT-2 from Scratch

Training GPT-2 language models from scratch for learning purposes.

Projects

1. GPT-2 English (Classic Literature)

  • File: Gpt_2_variations.ipynb
  • Dataset: 10 classic books from Project Gutenberg (~1.25M tokens)
  • Models: Comparison of 3 positional encoding strategies
    • Learnable positional embeddings
    • Sinusoidal positional encoding
    • No positional encoding (baseline)

2. GPT-2 Portuguese (Wikipedia)

  • File: gpt2_ptbr_wikipedia.ipynb
  • Dataset: Portuguese Wikipedia (5-20%, ~15-60M tokens)
  • Model: GPT-2 with learnable positional embeddings (~30M params)

Architecture

┌─────────────────────────────────────────┐
│  Token Embedding (vocab_size × emb_dim) │
│  + Positional Embedding                 │
├─────────────────────────────────────────┤
│  Transformer Block × n_layers           │
│  ├── LayerNorm                          │
│  ├── Multi-Head Self-Attention          │
│  ├── Residual Connection                │
│  ├── LayerNorm                          │
│  ├── Feed-Forward (4× expansion)        │
│  └── Residual Connection                │
├─────────────────────────────────────────┤
│  Final LayerNorm                        │
│  Output Head (weight tied with tok_emb) │
└─────────────────────────────────────────┘

Model Configurations

English Model (~16M params)

GPT_CONFIG = {
    "vocab_size": 50257,
    "context_length": 256,
    "emb_dim": 256,
    "n_heads": 4,
    "n_layers": 4,
    "drop_rate": 0.1,
}

Portuguese Model (~30M params)

GPT_CONFIG = {
    "vocab_size": 50257,
    "context_length": 256,
    "emb_dim": 384,
    "n_heads": 6,
    "n_layers": 6,
    "drop_rate": 0.1,
}

Features

  • Auto-detect platform: Colab, Kaggle, or local
  • Checkpoints: Saved every epoch (resume if disconnected)
  • DataParallel: Automatic multi-GPU support
  • Text generation: Top-k, top-p, temperature, repetition penalty

Project Structure

gpt-2/
├── README.md
├── Gpt_2_variations.ipynb    # English model experiments
├── gpt2_ptbr_wikipedia.ipynb     # Portuguese model
├── results.md                     # Training results log
└── [downloaded books].txt         # Gutenberg books (auto-downloaded)

Requirements

torch>=2.0
tiktoken
datasets
matplotlib

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published