Add SDPO to policy gradients by Shekswess · Pull Request #387 · natolambert/rlhf-book

Shekswess · 2026-04-25T22:24:41Z

Summary

Refs #360.

This PR adds SDPO (Self-Distillation Policy Optimization) to code/policy_gradients/, following the issue discussion that SDPO fits the online policy-gradient training loop better than direct_alignment/.

Changes include:

SDPOLoss, implemented as GRPO plus token-level reverse-KL self-distillation.
SDPO teacher-input construction from successful rollouts in each group.
Completion/log-prob extraction by action_mask, so variable-length generations do not accidentally include prompt tokens.
Optional SDPO stabilization knobs: sdpo_teacher_ema_rate and sdpo_is_clip.
policy_gradients/configs/sdpo.yaml, README/changelog updates, and inclusion in run_all_policy_gradients.sh.

Validation

Local checks passed:

.venv/bin/python -m compileall policy_gradients from code/
uvx ruff check code/policy_gradients
policy_gradients/configs/sdpo.yaml loads successfully with policy_gradients.config.load_config

I also ran reduced local wandb probes in sdpo-test because I do not have enough compute for the full reference run:

1.7B/default-style SDPO attempt OOMed on my 16 GB RTX 5060 Ti during Adam optimizer-state allocation: https://wandb.ai/bokicasheks-loka/sdpo-test/runs/lno3ek18
0.6B spell_backward smoke test finished, but SDPO stayed inactive because the model did not produce successful rollouts: https://wandb.ai/bokicasheks-loka/sdpo-test/runs/ijgtli7k
0.6B basic_arithmetic reduced validation finished with nonzero reward/loss/grad norm, exercising the SDPO path: https://wandb.ai/bokicasheks-loka/sdpo-test/runs/5s619wr7

Final reduced-run summary: avg_reward=0.7708, loss=-0.2717, grad_norm=3.9375.

Maintainer Request

@natolambert, could you run the full policy_gradients/configs/sdpo.yaml reference job on proper compute and decide what official wandb run should go into the README table? I left the README status as pending full validation rather than treating my reduced local run as the canonical result.

zafstojano · 2026-04-26T08:50:19Z

Probably a good idea to first merge #385 since it decouples the rollouts and the policy updates - a breaking change.

nathanlambert · 2026-04-26T18:22:37Z

lgtm 👍

natolambert · 2026-04-26T23:44:22Z

lgtm 👍

🫨🫨🤨🤨

zafstojano · 2026-05-01T07:50:54Z

#385 has been merged and has introduced quite large refactoring.

I know the #360 suggests to incorporate SDPO under policy_gradients/ - and while technically feasible - it will hurt readability.

If we want to compare the performance of distillation vs. RL we should make an effort to sync the same metrics in the same WandB project for 1-to-1 comparison.

Apart from that, I think it will benefit to have a separate module in the code directory for self_distillation.

For example, trl also implements SDPO under a self_distillation module in the experimental directory:
https://github.com/huggingface/trl/tree/main/trl/experimental/self_distillation

With that said, I think we can take a look at the new layout of the policy_gradient module as to how it organizes the functions, and start a new module. For instance, I believe roughly the same structure should work:

train.py that should have almost no other function rather than the main() method that should be extremely readable - almost pseudo-code like. Extreme focus on readability and conciseness.
utils.py that hides the ugliness of computing the tensors (kl, log probs, etc.), seeding functions, attention impl, printing/logging, etc.
Experience and ReplayBuffer storing necessary tensors for optimization
RolloutEngine for generating rollouts

I might miss something else, but think covered most of my points

add(policy-gradients): implement sdpo

f798fb1

Shekswess mentioned this pull request Apr 25, 2026

[CODE] Add SDPO (Self-Distillation Policy Optimization) implementation #360

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SDPO to policy gradients#387

Add SDPO to policy gradients#387
Shekswess wants to merge 1 commit intonatolambert:mainfrom
Shekswess:code/sdpo-implementation

Shekswess commented Apr 25, 2026 •

edited

Loading

Uh oh!

zafstojano commented Apr 26, 2026

Uh oh!

nathanlambert commented Apr 26, 2026 •

edited

Loading

Uh oh!

natolambert commented Apr 26, 2026

Uh oh!

zafstojano commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Shekswess commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Maintainer Request

Uh oh!

zafstojano commented Apr 26, 2026

Uh oh!

nathanlambert commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

natolambert commented Apr 26, 2026

Uh oh!

zafstojano commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Shekswess commented Apr 25, 2026 •

edited

Loading

nathanlambert commented Apr 26, 2026 •

edited

Loading