Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective RLVR by Athe-kunal · Pull Request #398 · natolambert/rlhf-book

Athe-kunal · 2026-05-05T21:59:12Z

This PR implements the paper for 80/20

It adds the following things

Add entropy and max_entropy_tokens: 20 # -1 = all tokens; set to e.g. 20 to use top 20% highest-entropy (forking) tokens
This is translated to loss, and the entropy for the current and old policy is logged to wandb
Also, I have added a way to visualize the high entropy tokens, and it is pretty interesting as the words like let' see, break are highlighted to be high entropy tokens.

…n-in-policy-gradients Configurable KL estimators, DAPO reward/filter refactor, and rollout collection refactor

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Avoids reverse-parsing the chat template from decoded tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plumb response_penalties through compute_rewards, the rollout Experience, and the per-rollout wandb log alongside the existing reward/correctness/format averages. Ignore E731/B023 in ruff to allow the small avg() lambda used to keep the log call compact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Shorter field names and document them in the dataclass docstring alongside the existing reward fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Also tweak DAPO l_cache from 256 to 128 in the config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add SAPO and DAPO to the policy gradients intro list, and correct the DAPO expansion to "Decoupled Clip and Dynamic sAmpling Policy Optimization (ByteDance, 2025)" to match the paper and loss.py docstring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

merge origin main

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1c3b0d9cda

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-05T22:03:10Z

+        if self.cfg.max_entropy_tokens != -1 and old_entropy is not None:
+            tok_ids = sequence_ids[sample_idx, 1:][action_mask[sample_idx]]
+            tok_ent = old_entropy[sample_idx][action_mask[sample_idx]].float()
+            threshold = torch.quantile(tok_ent, 1.0 - self.cfg.max_entropy_tokens / 100.0)


Guard quantile call against empty sampled completion

When max_entropy_tokens != -1, this path assumes the randomly chosen sample has at least one generated action token. With pad_token_id == eos_token_id (set earlier in this file), a rollout that emits EOS immediately produces an empty tok_ent, and torch.quantile on an empty tensor raises at runtime. Because sample_idx is random, a single zero-length completion in the batch can crash logging and halt training.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-05T22:03:10Z

+    Threshold is computed batch-wide over all unmasked positions (Wang et al., 2025).
+    """
+    flat_entropy = entropy[action_mask.bool()]
+    threshold = torch.quantile(flat_entropy, 1.0 - top_pct / 100.0)


Handle empty masked entropy set before quantile

The high-entropy mask helper assumes at least one True entry in action_mask. If a microbatch has no valid action tokens (for example, all sampled completions terminate immediately and are masked out), flat_entropy is empty and this torch.quantile call throws, breaking the optimizer step. Adding an empty-mask fallback is needed to keep training robust.

Useful? React with 👍 / 👎.

Athe-kunal · 2026-05-05T22:08:40Z

@natolambert and @zafstojano adding this important paper beyond 80/20 for high entropy fork tokens. Below I have highlighted the 20% high entropy tokens

Wandb report

Athe-kunal and others added 30 commits April 23, 2026 03:48

add dapo

0e64968

Refine DAPO accumulation and add configurable KL estimator

c7dba1d

Merge pull request #1 from Athe-kunal/codex/review-dapo-implementatio…

867a14a

…n-in-policy-gradients Configurable KL estimators, DAPO reward/filter refactor, and rollout collection refactor

dapo refactor

1cbd406

chore:ruff

1ee2d56

add dapo refactoring

e27eb49

PR comments

3c6ad09

chore:ruff

dda5480

cleanup

601e95b

simplify

a907d78

simplify console logging

f757e2d

docstrings

2a7e2a3

gitignore local env files

83c78fe

per prompt rollout

17136f8

.eval() in rollout

9a3160b

update logging

5db87d7

fix buffer filling to expected shape

fe7c672

simplify dapo filtering

753d840

fix lint

d3570b2

model device

791bd96

rename to RolloutEngine

68789a4

prints

7c7a515

exp type

712f1f3

refactor progress bar

dce1890

update dapo params

98585ee

drop rollout_batch_size, fix multi-GPU KL, simplify RolloutEngine

6a5d83b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

display rollout sample from stashed prompt/answer/completion

b19feaa

Avoids reverse-parsing the chat template from decoded tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

add correctness and format rewards logging to wandb

ef278b0

chore: lint

a9f866e

chore: lint

4a76130

zafstojano and others added 17 commits April 29, 2026 21:40

rename Experience.format_rewards/response_penalties to format/penalties

6185325

Shorter field names and document them in the dataclass docstring alongside the existing reward fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

print avg format / penalty in rollout results panel

4454c9f

Also tweak DAPO l_cache from 256 to 128 in the config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add MaxRL

8ea973c

merge main

9c86a41

PR comments and fixes

034ac84

fix loss

355c767

add to readme

d494436

chore: lint

b2542be

Merge remote-tracking branch 'origin/main'

d08e3b6

merge origin main

add working 80/20 split

bec4349

wandb logging

65b7021

make it rich text

56a52eb

add 80/20 entropy

816daaa

wandb logging

ff0378b

make it rich text

1c3b0d9

chatgpt-codex-connector Bot reviewed May 5, 2026

View reviewed changes

chores

13a0940

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective RLVR#398

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective RLVR#398
Athe-kunal wants to merge 48 commits intonatolambert:mainfrom
Athe-kunal:main

Athe-kunal commented May 5, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 5, 2026

Uh oh!

chatgpt-codex-connector Bot May 5, 2026

Uh oh!

Athe-kunal commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Athe-kunal commented May 5, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Athe-kunal commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Athe-kunal commented May 5, 2026 •

edited

Loading