| title | MiniGPT-from-Scratch |
|---|---|
| emoji | 🚀 |
| colorFrom | indigo |
| colorTo | purple |
| sdk | gradio |
| sdk_version | 5.16.2 |
| app_file | app.py |
| pinned | false |
A from-scratch implementation of a modern, decoder-only Transformer language model, built in 30 days. This project is a deep-dive into state-of-the-art LLM engineering, transitioning from classic GPT-2 designs to modern Llama-style architectures.
- Custom BPE Tokenizer: Fully trained on the dataset using Hugging Face
tokenizers. - Modern Architecture: Multi-head causal self-attention, SwiGLU MLPs, RMSNorm (Pre-Norm), and Rotary Positional Embeddings (RoPE).
- Optimized Training: Supports PyTorch Flash Attention (A100/V100), AMP (Mixed Precision), Cosine Decay with Warmup, and Gradient Accumulation.
- Interactive Demo: Built-in Gradio web app for real-time text generation.
- Evaluation: Integrated perplexity calculation for model benchmarking.
- Language: Python 3.10+
- Deep Learning: PyTorch
- Tokenization: Hugging Face Tokenizers
- Interface: Gradio
- Data: FineWeb-Edu (Sample)
MiniGPT/
├── data/ # Raw and processed datasets
├── notebooks/ # Colab/Kaggle training templates
├── src/
│ ├── datasets/ # Data loading and preprocessing logic
│ ├── model/ # Transformer architecture (GPT, Attention, Blocks)
│ ├── tokenizer/ # BPE training and wrapper
│ ├── train/ # Training loop with AMP and validation
│ └── app.py # Gradio Web Demo
├── checkpoints/ # Saved model weights (.pt)
└── requirements.txt # Project dependencies
git clone https://github.com/mrshibly/MiniGPT-from-Scratch.git
cd MiniGPT-from-Scratch
pip install -r requirements.txtpython src/datasets/download_fineweb.py
python src/datasets/clean_text.py
python src/tokenizer/train_tokenizer.py
python src/datasets/prepare_data.pyTo train locally (CPU/GPU):
python src/train/train.pyNote: For full 50M training, use the Colab Template.
Once you have a checkpoint in checkpoints/ckpt.pt:
python src/app.py| Config | Params | Layers | Heads | d_model | Target Device |
|---|---|---|---|---|---|
| Tiny | ~7M | 4 | 4 | 256 | CPU/Laptop |
| Standard | ~50M | 6 | 8 | 512 | Free GPU |
| Stretch | ~124M | 12 | 12 | 768 | Colab Pro (A100) |
- Dataset: 10GB FineWeb-Edu (Targeting)
- Parameters: 124 Million
- Optimization: FlashAttention-2 enabled via Colab Pro.
- Status: Scaling up to GPT-2 Small equivalent intelligence.
MIT