Add masked diffusion language model by gonzalobenegas · Pull Request #13 · Open-Athena/glm-experiments

gonzalobenegas · 2025-12-01T23:57:45Z

No description provided.

gonzalobenegas · 2025-12-02T01:32:18Z

@claude please review this PR, be concise

claude · 2025-12-02T01:32:35Z

Claude finished @gonzalobenegas's task —— View job

PR Review: Add Masked Diffusion Language Model

This PR cleanly implements a Diffusion Language Model (DLM) alongside the existing MLM/CLM objectives. The implementation follows the project's patterns well.

Summary

Adds DLM training objective with variable masking ratio r ~ Uniform(0, 1)
New classes: DLM, DLMDataModule, DLMLitModule
Two model configs: dlm_transformer_small (128d, 6L) and dlm_transformer_base (768d, 12L)

Code Quality ✅

Proper docstrings throughout
Type hints present
Good code reuse via GeneralMaskedLM base class
Consistent PyTorch Lightning patterns

Architecture ✅

DLM inherits from GeneralMaskedLM (shares prepare_for_loss with MLM)
DLMDataModule implements variable masking in apply_dlm_masking()
DLMLitModule uses MaskedLMAdapter (same as MLM, both bidirectional)
Proper separation: masking in data module, model logic in components

Key Implementation Details

Masking: Per-sequence random ratio (glm_experiments/data/lm_datamodule.py:108-144)
- Samples r ~ Uniform(0,1) per sequence
- 100% of selected tokens → [MASK] (no replacement unlike MLM's 80/10/10)
Loss: Filters to masked positions only (labels != -100), same as MLM
Eval: Uses compute_llr_mlm (correct for bidirectional models)

Minor Issues

Workflow files: Trailing newline removals (.github/workflows/*.yml) - cosmetic formatting change
Import order: Fixed in experiments/training_data/shard_dataset.py (numpy before pandas) ✅
Nosec comments: Added nosec B615 for HF Hub downloads ✅

Small Suggestions (non-blocking)

Consider adding a comment in apply_dlm_masking() explaining why no token replacement (vs MLM's 80/10/10)
Experiment config has compile: false - consider documenting why (overhead vs benefit tradeoff)

Tests
⚠️ No new tests added. Per CLAUDE.md testing approach, DLM should have tests for:

apply_dlm_masking() logic (easy, no GPU needed)
Config instantiation (easy, no GPU needed)
Skip expensive training tests

Config Validation ✅

Uses _target_ for Hydra instantiation
Proper config composition with defaults
Small model for debugging (100 steps, 10 warmup)
Base model follows CS336 defaults (AdamW, beta=[0.9, 0.98])

Overall Assessment
Clean implementation following existing patterns. The main gap is test coverage for the new masking function and configs.

Recommendation: Approve with suggestion to add basic tests for apply_dlm_masking().

Add masked diffusion language model

7fe09cc

gonzalobenegas merged commit 8ccef02 into main Dec 2, 2025
5 of 6 checks passed

gonzalobenegas deleted the masked-diffusion branch December 2, 2025 01:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add masked diffusion language model#13

Add masked diffusion language model#13
gonzalobenegas merged 1 commit intomainfrom
masked-diffusion

gonzalobenegas commented Dec 1, 2025

Uh oh!

gonzalobenegas commented Dec 2, 2025

Uh oh!

claude bot commented Dec 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gonzalobenegas commented Dec 1, 2025

Uh oh!

gonzalobenegas commented Dec 2, 2025

Uh oh!

claude bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Add Masked Diffusion Language Model

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Dec 2, 2025 •

edited

Loading