Skip to content

Lightweight BERT-like model with Rotary Position Embeddings (RoPE), Grouped-Query Attention (GQA) for faster inference, and configurable sliding-window + global-token attention for long-context modeling. Includes weight-tied embeddings and RMSNorm.

License

Notifications You must be signed in to change notification settings

gbyuvd/RougeBERT

Repository files navigation

🚩 RougeBERT: A Lightweight Experimental BERT-like Hybrid Transformer with RoPE+GQA

BERT-style MLM optimized — 8-layer transformer with GQA, RoPE, sliding-window + global attention, RMSNorm, and weight tying. Optimized for speed, memory, and contextual modeling.

🧠 Core: 8L | 320d | 8H → GQA (2 groups) | RMSNorm | Weight Tied
🧭 Position: Rotary (RoPE) — extrapolates beyond 1024
👁️ Attention: Sliding Window (16) + Global Tokens → Vectorized Masks
⚡ Innovations: 4x smaller KV cache • No learned pos-embeds • Local+Global context

Prototype research code — not production-ready. Learning by building.

✅ Potentially Ideal for: Efficient training/inference • Medium-length NLP tasks • Memory-constrained environments 💡 Target domain: chemistry & molecular generation (SELFIES). 🚀 Architecture is potentially generalizable to other sequence domains.


Model Architecture

RougeBERT is a hybrid transformer architecture that combines modern efficiency techniques with BERT-style masked language modeling. The model integrates several key innovations:

Core Architecture

  • 8-layer transformer with 320-dimensional hidden states and 8 attention heads
  • Grouped Query Attention (GQA) with 2 key-value groups, reducing memory usage while maintaining performance
  • RMSNorm instead of LayerNorm for improved training stability and efficiency
  • Weight tying between input embeddings and output head for parameter efficiency

Positional Encoding

  • Rotary Position Embedding (RoPE) replaces traditional learned position embeddings
  • Provides better length extrapolation and relative position understanding
  • Configured for sequences up to 1024 tokens with RoPE

Attention Mechanism

  • Sliding window attention with configurable window size (default: 16 tokens)
  • Global attention tokens that can attend to and be attended by all positions
  • Combines local efficiency with global context modeling
  • Vectorized attention mask computation for optimal performance

Evaluation vs. RoBERTa on MLM (WIP, for now 1K 10 Epoch | Val/10% steps)

Dataset & Preprocessing

  • Data: 1,000 SMILES molecular representations from sample_1k_smi_42.csv (sample 1K molecules from combined curated dataset built from COCONUTDB (Sorokina et al., 2021),
    ChemBL34 (Zdrazil et al., 2023), and SuperNatural3 (Gallo et al. 2023) dataset
  • Split: 70% train / 15% validation / 15% test (stratified random split, seed=42)
  • Tokenization: FastChemTokenizer with max sequence length of 512 tokens

Training Setup

  • Task: Masked Language Modeling (MLM) with 15% token masking probability
  • Architecture Comparison: RougeBERT (hybrid model) vs RoBERTa baseline (~9M parameters each)
  • Training: 10 epochs, batch size 16, gradient accumulation 4 steps, learning rate 1e-5
  • Optimizer: Ranger21 with AdaBelief, warmup, and MADGrad components
  • Early Stopping: Patience of 10 validation steps based on validation loss

Evaluation Metrics

  • Perplexity: Primary metric for language modeling quality (lower is better)
  • MLM Accuracy: Token-level accuracy on masked positions
  • Validation Loss: Cross-entropy loss on held-out validation set
  • Evaluation Frequency: Every 10 training steps with continuous monitoring

Learning Curves of the Two Models on 1K dataset for 10 epochs

image/png

image/png

🔧 Contributing

This project is a learning experiment — all contributions are welcome!

  • 🧠 Have a better way to implement the methods?
  • 📊 Want to add evaluation metrics?
  • ✨ Found a bug? Please open an issue!

👉 Please:

  • Keep changes minimal and focused.
  • Add comments if you change core logic.

⚠️ Disclaimer

This is NOT a production model.

  • Built during late-night prototyping sessions 🌙
  • Not thoroughly validated or benchmarked due to compute constraint
  • Some components are heuristic and unproven
  • May crash, overfit, or generate nonsense (especially outside molecular data)
  • I’m still learning PyTorch, attention mechanisms, and transformer internals

Use this code to learn and experiment — not to deploy.

📜 License

MIT

COCONUTDB

@article{sorokina2021coconut,
  title={COCONUT online: Collection of Open Natural Products database},
  author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
  journal={Journal of Cheminformatics},
  volume={13},
  number={1},
  pages={2},
  year={2021},
  doi={10.1186/s13321-020-00478-9}
}

ChemBL34

@article{zdrazil2023chembl,
  title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
  author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
  journal={Nucleic Acids Research},
  year={2023},
  volume={gkad1004},
  doi={10.1093/nar/gkad1004}
}

@misc{chembl34,
  title={ChemBL34},
  year={2023},
  doi={10.6019/CHEMBL.database.34}
}

SuperNatural3

@article{Gallo2023,
  author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
  title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
  journal = {Nucleic Acids Research},
  year = {2023},
  month = jan,
  day = {6},
  volume = {51},
  number = {D1},
  pages = {D654-D659},
  doi = {10.1093/nar/gkac1008}
}

Ranger21 Optimizer

@article{wright2021ranger21,
      title={Ranger21: a synergistic deep learning optimizer}, 
      author={Wright, Less and Demeure, Nestor},
      year={2021},
      journal={arXiv preprint arXiv:2106.13731},
}

About

Lightweight BERT-like model with Rotary Position Embeddings (RoPE), Grouped-Query Attention (GQA) for faster inference, and configurable sliding-window + global-token attention for long-context modeling. Includes weight-tied embeddings and RMSNorm.

Topics

Resources

License

Stars

Watchers

Forks

Languages