Status: Phase 9 - Architecture v2.0 Training 🚀 In Progress
Latest Achievement: v2.0 Architecture implemented & verified (Input-Dependent Memory, Soft Patching, ACT)
A research implementation of a byte-level language model featuring:
- 🧠 Hebbian Memory with Input-Dependent Decay (Selective Forgetting)
- 📚 Curriculum Learning (3-stage developmental approach)
- 🔄 System 2 Reasoning with Adaptive Computation Time (ACT)
- 🚀 Linear Complexity attention mechanism
- ⚡ Parallel MLP Decoder (No more GRU bottleneck)
pip install torch datasets tqdmpython train_scaled.py # 50K steps, 129M paramspython generate.py best_model_scaled.pthpython test_recall_fixed.py # Memory test (Needle in Haystack)
python overfit_test.py # Stability verificationBytes → Encoder (Soft Patching) → Hebbian Memory → Reasoning Loop → MLP Decoder → Bytes
(Overlap=2) (Input-Dep λ) (ACT Exit) (Parallel)
- ByteLatentEncoder: Soft Patching (Kernel=6, Stride=4) for smoother boundaries.
- HebbianMemory: Input-Dependent Decay ($\lambda_t = \sigma(W x_t)$) for selective memory.
- RecurrentReasoningBlock: Adaptive Computation Time (ACT) with Exit Gate.
- LocalAutoregressiveHead: Parallel MLP Decoder (4x faster than GRU).
- HybridBlock: Gated Fusion (Sigmoid) + SwiGLU + RMSNorm.
See docs/architecture.md for technical details.
✅ No Tokenization - Universal byte-level processing
✅ Linear Complexity - O(N) attention with Hebbian memory
✅ Smart Memory - Input-Dependent Decay (can "lock" important info)
✅ Curriculum Learning - 3-stage developmental training
✅ Adaptive Reasoning - Dynamic thinking steps (ACT)
✅ Modern Components - SwiGLU, RMSNorm, Soft Patching
| Stage | Steps | Plasticity (α) | Data | Purpose |
|---|---|---|---|---|
| 1. Childhood | 0-3K | 0.10 | Dictionary | Lexical grounding |
| 2. Youth | 3K-8K | 0.50 | Stories | Syntactic scaffolding |
| 3. Adulthood | 8K-20K | 0.99 | Wikipedia | Semantic expansion |
Metrics:
- Final BPC: 1.85 (↓77% from initialization)
- Best Val BPC: 1.78
- Training Time: ~50 minutes (CUDA GPU)
- Stability: 0 NaN occurrences across 20K steps
Progress:
Step 0: BPC = 8.04 (Random initialization)
Step 5K: BPC = 2.23 (Initial curriculum complete)
Step 10K: BPC = 1.98 (Mid-training)
Step 20K: BPC = 1.85 (Final)
Improvement: 6.19 BPC reduction (77% improvement)
Problem: Float16 overflow in Hebbian Memory with low plasticity (α=0.1)
Solution: Force float32 computation for memory module
@torch.amp.autocast('cuda', enabled=False)
def forward(self, x):
x = x.float() # Bypass AMP for numerical stability
# ... Hebbian computation ...
return out.to(input_dtype)This fix enables stable 20K+ step training with AMP enabled.
- Architecture Guide - Technical deep dive
- Training Guide - Training from scratch
- Inference Guide - Generation and sampling
- API Reference - Code documentation
- RFC 007: Curriculum Learning - Phase 7 design
best_model_curriculum.pth- Best checkpoint (Val BPC: 1.78)last_model_curriculum.pth- Final model state (20K steps)metrics_curriculum.json- Full training metrics
- Extended Training: 30K-50K steps for further convergence
- Larger Model: Increase d_model=768, n_layers=8
- Longer Context: Extend to 2048 token window
- Fine-tuning: Domain-specific Turkish datasets
- Adaptive plasticity scheduling
- Multi-stage curriculum optimization
- Cross-lingual transfer learning
- Sparse Hebbian memory
@software{agiformer2025,
title={AGIFORMER: Byte-Level Language Model with Hebbian Memory and Neuroplasticity},
author={inkbytefo},
year={2025},
note={Phase 7: Curriculum Learning with Dynamic Plasticity},
url={https://github.com/inkbytefo/agi-former}
}MIT License - see LICENSE file for details.
- Built with PyTorch
- Turkish Wikipedia dataset (trwiki)
- Turkish Dictionary dataset (TDK)
- Inspired by Fast Weights, Linear Transformers, and developmental neuroscience
Developer: inkbytefo
Phase: 7 (Curriculum Learning & Neuroplasticity)
Status: Production Ready ✅
Last Updated: 2025-11-23