A small Transformer-based language model built from scratch in PyTorch and trained on CPU-only hardware.
This project was created to understand how modern language models work at a fundamental level. Instead of using pre-trained models or external APIs, every major component was implemented manually, including tokenization, self-attention, Transformer blocks, training, and text generation.
The model was trained on the Tiny Shakespeare dataset and generates Shakespeare-like text at the character level.
I wanted to understand how language models work internally rather than relying on pre-trained models or APIs. My goal was to build a complete Transformer from scratch that could be trained on modest hardware and still demonstrate the core principles used in modern language models.
- Character-level language modeling
- Transformer architecture with causal self-attention
- Multi-head attention
- Positional embeddings
- Autoregressive text generation
- CPU-only training and inference
- Implemented entirely in PyTorch
| Parameter | Value |
|---|---|
| Embedding Size | 32 |
| Attention Heads | 2 |
| Transformer Layers | 2 |
| Context Length | 64 |
| Vocabulary Type | Character-level |
| Parameters | 31,553 |
Input Characters
↓
Token Embeddings
+
Positional Embeddings
↓
2 Transformer Blocks
├─ Multi-Head Self-Attention
├─ Feed Forward Network
├─ Residual Connections
└─ Layer Normalization
↓
Language Modeling Head
↓
Predicted Next Character
- CPU: Intel Core i5-3320M
- Cores / Threads: 2 Cores, 4 Threads
- RAM: 8 GB
- Operating System: Debian 13 (Trixie)
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning Rate | 1e-3 |
| Batch Size | 16 |
| Training Steps | 3000 |
| Device | CPU |
| Metric | Value |
|---|---|
| Initial Loss | 4.41 |
| Final Loss | 2.11 |
| Parameters | 31,553 |
The loss decreased steadily during training, indicating that the model successfully learned character patterns, word boundaries, punctuation usage, and text structure from the training corpus.
Walive neall tas o how no ger shat,
Yors lag ce sveflid nasthbl prou, at
am knoffet toue as arve tre, hrom Anjandke ass
Whainener can ate allles ther ireg; ofe is,
That iver, mel: pre thabl boourds:
Although the generated text is not fully coherent, it demonstrates that the model learned aspects of English spelling, formatting, capitalization, and Shakespeare-like structure.
tinygpt/
│
├── data/
│ └── corpus.txt
│
├── dataset.py
├── model.py
├── train.py
├── generate.py
│
├── checkpoints/
│
└── README.md
Through this project, I gained practical experience with:
- Neural network fundamentals
- Transformer architecture
- Self-attention mechanisms
- Tokenization and vocabulary construction
- Language model training
- Loss functions and optimization
- Text generation
- PyTorch development
- Running machine learning workloads on limited hardware
Potential improvements include:
- Weight tying
- Dropout regularization
- Checkpoint resume support
- Better datasets
- Subword tokenization
- Larger model sizes
- Validation loss tracking
- Training visualizations
This project demonstrates that even a small Transformer with only 31,553 parameters can learn meaningful patterns from text. Building and training the model from scratch provided a practical understanding of the core ideas behind modern language models while remaining lightweight enough to run on older consumer hardware.