Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective RLVR#398
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective RLVR#398Athe-kunal wants to merge 48 commits intonatolambert:mainfrom
Conversation
…n-in-policy-gradients Configurable KL estimators, DAPO reward/filter refactor, and rollout collection refactor
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Avoids reverse-parsing the chat template from decoded tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plumb response_penalties through compute_rewards, the rollout Experience, and the per-rollout wandb log alongside the existing reward/correctness/format averages. Ignore E731/B023 in ruff to allow the small avg() lambda used to keep the log call compact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Shorter field names and document them in the dataclass docstring alongside the existing reward fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Also tweak DAPO l_cache from 256 to 128 in the config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add SAPO and DAPO to the policy gradients intro list, and correct the DAPO expansion to "Decoupled Clip and Dynamic sAmpling Policy Optimization (ByteDance, 2025)" to match the paper and loss.py docstring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
merge origin main
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1c3b0d9cda
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if self.cfg.max_entropy_tokens != -1 and old_entropy is not None: | ||
| tok_ids = sequence_ids[sample_idx, 1:][action_mask[sample_idx]] | ||
| tok_ent = old_entropy[sample_idx][action_mask[sample_idx]].float() | ||
| threshold = torch.quantile(tok_ent, 1.0 - self.cfg.max_entropy_tokens / 100.0) |
There was a problem hiding this comment.
Guard quantile call against empty sampled completion
When max_entropy_tokens != -1, this path assumes the randomly chosen sample has at least one generated action token. With pad_token_id == eos_token_id (set earlier in this file), a rollout that emits EOS immediately produces an empty tok_ent, and torch.quantile on an empty tensor raises at runtime. Because sample_idx is random, a single zero-length completion in the batch can crash logging and halt training.
Useful? React with 👍 / 👎.
| Threshold is computed batch-wide over all unmasked positions (Wang et al., 2025). | ||
| """ | ||
| flat_entropy = entropy[action_mask.bool()] | ||
| threshold = torch.quantile(flat_entropy, 1.0 - top_pct / 100.0) |
There was a problem hiding this comment.
Handle empty masked entropy set before quantile
The high-entropy mask helper assumes at least one True entry in action_mask. If a microbatch has no valid action tokens (for example, all sampled completions terminate immediately and are masked out), flat_entropy is empty and this torch.quantile call throws, breaking the optimizer step. Adding an empty-mask fallback is needed to keep training robust.
Useful? React with 👍 / 👎.
|
@natolambert and @zafstojano adding this important paper beyond 80/20 for high entropy fork tokens. Below I have highlighted the 20% high entropy tokens
|

This PR implements the paper for 80/20
It adds the following things
max_entropy_tokens: 20 # -1 = all tokens; set to e.g. 20 to use top 20% highest-entropy (forking) tokens