Skip to content

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective RLVR#398

Open
Athe-kunal wants to merge 48 commits intonatolambert:mainfrom
Athe-kunal:main
Open

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective RLVR#398
Athe-kunal wants to merge 48 commits intonatolambert:mainfrom
Athe-kunal:main

Conversation

@Athe-kunal
Copy link
Copy Markdown
Contributor

This PR implements the paper for 80/20

It adds the following things

  1. Add entropy and max_entropy_tokens: 20 # -1 = all tokens; set to e.g. 20 to use top 20% highest-entropy (forking) tokens
  2. This is translated to loss, and the entropy for the current and old policy is logged to wandb
  3. Also, I have added a way to visualize the high entropy tokens, and it is pretty interesting as the words like let' see, break are highlighted to be high entropy tokens.

zafstojano and others added 17 commits April 29, 2026 21:40
Plumb response_penalties through compute_rewards, the rollout
Experience, and the per-rollout wandb log alongside the existing
reward/correctness/format averages. Ignore E731/B023 in ruff to
allow the small avg() lambda used to keep the log call compact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Shorter field names and document them in the dataclass docstring
alongside the existing reward fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Also tweak DAPO l_cache from 256 to 128 in the config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add SAPO and DAPO to the policy gradients intro list, and correct the
DAPO expansion to "Decoupled Clip and Dynamic sAmpling Policy
Optimization (ByteDance, 2025)" to match the paper and loss.py docstring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1c3b0d9cda

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if self.cfg.max_entropy_tokens != -1 and old_entropy is not None:
tok_ids = sequence_ids[sample_idx, 1:][action_mask[sample_idx]]
tok_ent = old_entropy[sample_idx][action_mask[sample_idx]].float()
threshold = torch.quantile(tok_ent, 1.0 - self.cfg.max_entropy_tokens / 100.0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard quantile call against empty sampled completion

When max_entropy_tokens != -1, this path assumes the randomly chosen sample has at least one generated action token. With pad_token_id == eos_token_id (set earlier in this file), a rollout that emits EOS immediately produces an empty tok_ent, and torch.quantile on an empty tensor raises at runtime. Because sample_idx is random, a single zero-length completion in the batch can crash logging and halt training.

Useful? React with 👍 / 👎.

Threshold is computed batch-wide over all unmasked positions (Wang et al., 2025).
"""
flat_entropy = entropy[action_mask.bool()]
threshold = torch.quantile(flat_entropy, 1.0 - top_pct / 100.0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle empty masked entropy set before quantile

The high-entropy mask helper assumes at least one True entry in action_mask. If a microbatch has no valid action tokens (for example, all sampled completions terminate immediately and are masked out), flat_entropy is empty and this torch.quantile call throws, breaking the optimizer step. Adding an empty-mask fallback is needed to keep training robust.

Useful? React with 👍 / 👎.

@Athe-kunal
Copy link
Copy Markdown
Contributor Author

Athe-kunal commented May 5, 2026

@natolambert and @zafstojano adding this important paper beyond 80/20 for high entropy fork tokens. Below I have highlighted the 20% high entropy tokens

Wandb report

Screenshot 2026-05-05 at 1 05 53 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants