BERT-style MLM optimized — 8-layer transformer with GQA, RoPE, sliding-window + global attention, RMSNorm, and weight tying. Optimized for speed, memory, and contextual modeling.
🧠 Core: 8L | 320d | 8H → GQA (2 groups) | RMSNorm | Weight Tied
🧭 Position: Rotary (RoPE) — extrapolates beyond 1024
👁️ Attention: Sliding Window (16) + Global Tokens → Vectorized Masks
⚡ Innovations: 4x smaller KV cache • No learned pos-embeds • Local+Global context
Prototype research code — not production-ready. Learning by building.
✅ Potentially Ideal for: Efficient training/inference • Medium-length NLP tasks • Memory-constrained environments 💡 Target domain: chemistry & molecular generation (SELFIES). 🚀 Architecture is potentially generalizable to other sequence domains.
RougeBERT is a hybrid transformer architecture that combines modern efficiency techniques with BERT-style masked language modeling. The model integrates several key innovations:
- 8-layer transformer with 320-dimensional hidden states and 8 attention heads
- Grouped Query Attention (GQA) with 2 key-value groups, reducing memory usage while maintaining performance
- RMSNorm instead of LayerNorm for improved training stability and efficiency
- Weight tying between input embeddings and output head for parameter efficiency
- Rotary Position Embedding (RoPE) replaces traditional learned position embeddings
- Provides better length extrapolation and relative position understanding
- Configured for sequences up to 1024 tokens with RoPE
- Sliding window attention with configurable window size (default: 16 tokens)
- Global attention tokens that can attend to and be attended by all positions
- Combines local efficiency with global context modeling
- Vectorized attention mask computation for optimal performance
- Data: 1,000 SMILES molecular representations from
sample_1k_smi_42.csv(sample 1K molecules from combined curated dataset built from COCONUTDB (Sorokina et al., 2021),
ChemBL34 (Zdrazil et al., 2023), and SuperNatural3 (Gallo et al. 2023) dataset - Split: 70% train / 15% validation / 15% test (stratified random split, seed=42)
- Tokenization: FastChemTokenizer with max sequence length of 512 tokens
- Task: Masked Language Modeling (MLM) with 15% token masking probability
- Architecture Comparison: RougeBERT (hybrid model) vs RoBERTa baseline (~9M parameters each)
- Training: 10 epochs, batch size 16, gradient accumulation 4 steps, learning rate 1e-5
- Optimizer: Ranger21 with AdaBelief, warmup, and MADGrad components
- Early Stopping: Patience of 10 validation steps based on validation loss
- Perplexity: Primary metric for language modeling quality (lower is better)
- MLM Accuracy: Token-level accuracy on masked positions
- Validation Loss: Cross-entropy loss on held-out validation set
- Evaluation Frequency: Every 10 training steps with continuous monitoring
This project is a learning experiment — all contributions are welcome!
- 🧠 Have a better way to implement the methods?
- 📊 Want to add evaluation metrics?
- ✨ Found a bug? Please open an issue!
👉 Please:
- Keep changes minimal and focused.
- Add comments if you change core logic.
This is NOT a production model.
- Built during late-night prototyping sessions 🌙
- Not thoroughly validated or benchmarked due to compute constraint
- Some components are heuristic and unproven
- May crash, overfit, or generate nonsense (especially outside molecular data)
- I’m still learning PyTorch, attention mechanisms, and transformer internals
Use this code to learn and experiment — not to deploy.
MIT
@article{sorokina2021coconut,
title={COCONUT online: Collection of Open Natural Products database},
author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
journal={Journal of Cheminformatics},
volume={13},
number={1},
pages={2},
year={2021},
doi={10.1186/s13321-020-00478-9}
}@article{zdrazil2023chembl,
title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
journal={Nucleic Acids Research},
year={2023},
volume={gkad1004},
doi={10.1093/nar/gkad1004}
}
@misc{chembl34,
title={ChemBL34},
year={2023},
doi={10.6019/CHEMBL.database.34}
}@article{Gallo2023,
author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
journal = {Nucleic Acids Research},
year = {2023},
month = jan,
day = {6},
volume = {51},
number = {D1},
pages = {D654-D659},
doi = {10.1093/nar/gkac1008}
}@article{wright2021ranger21,
title={Ranger21: a synergistic deep learning optimizer},
author={Wright, Less and Demeure, Nestor},
year={2021},
journal={arXiv preprint arXiv:2106.13731},
}
