Skip to content

Add SDPO to policy gradients#387

Draft
Shekswess wants to merge 1 commit intonatolambert:mainfrom
Shekswess:code/sdpo-implementation
Draft

Add SDPO to policy gradients#387
Shekswess wants to merge 1 commit intonatolambert:mainfrom
Shekswess:code/sdpo-implementation

Conversation

@Shekswess
Copy link
Copy Markdown

@Shekswess Shekswess commented Apr 25, 2026

Summary

Refs #360.

This PR adds SDPO (Self-Distillation Policy Optimization) to code/policy_gradients/, following the issue discussion that SDPO fits the online policy-gradient training loop better than direct_alignment/.

Changes include:

  • SDPOLoss, implemented as GRPO plus token-level reverse-KL self-distillation.
  • SDPO teacher-input construction from successful rollouts in each group.
  • Completion/log-prob extraction by action_mask, so variable-length generations do not accidentally include prompt tokens.
  • Optional SDPO stabilization knobs: sdpo_teacher_ema_rate and sdpo_is_clip.
  • policy_gradients/configs/sdpo.yaml, README/changelog updates, and inclusion in run_all_policy_gradients.sh.

Validation

Local checks passed:

  • .venv/bin/python -m compileall policy_gradients from code/
  • uvx ruff check code/policy_gradients
  • policy_gradients/configs/sdpo.yaml loads successfully with policy_gradients.config.load_config

I also ran reduced local wandb probes in sdpo-test because I do not have enough compute for the full reference run:

Final reduced-run summary: avg_reward=0.7708, loss=-0.2717, grad_norm=3.9375.

Maintainer Request

@natolambert, could you run the full policy_gradients/configs/sdpo.yaml reference job on proper compute and decide what official wandb run should go into the README table? I left the README status as pending full validation rather than treating my reduced local run as the canonical result.

@zafstojano
Copy link
Copy Markdown
Collaborator

Probably a good idea to first merge #385 since it decouples the rollouts and the policy updates - a breaking change.

@nathanlambert
Copy link
Copy Markdown

nathanlambert commented Apr 26, 2026

image

lgtm 👍

@natolambert
Copy link
Copy Markdown
Owner

lgtm 👍

🫨🫨🤨🤨

@zafstojano
Copy link
Copy Markdown
Collaborator

#385 has been merged and has introduced quite large refactoring.

I know the #360 suggests to incorporate SDPO under policy_gradients/ - and while technically feasible - it will hurt readability.

If we want to compare the performance of distillation vs. RL we should make an effort to sync the same metrics in the same WandB project for 1-to-1 comparison.

Apart from that, I think it will benefit to have a separate module in the code directory for self_distillation.

For example, trl also implements SDPO under a self_distillation module in the experimental directory:
https://github.com/huggingface/trl/tree/main/trl/experimental/self_distillation

With that said, I think we can take a look at the new layout of the policy_gradient module as to how it organizes the functions, and start a new module. For instance, I believe roughly the same structure should work:

  • train.py that should have almost no other function rather than the main() method that should be extremely readable - almost pseudo-code like. Extreme focus on readability and conciseness.
  • utils.py that hides the ugliness of computing the tensors (kl, log probs, etc.), seeding functions, attention impl, printing/logging, etc.
  • Experience and ReplayBuffer storing necessary tensors for optimization
  • RolloutEngine for generating rollouts

I might miss something else, but think covered most of my points

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants