diff --git a/book/chapters/07-reasoning.md b/book/chapters/07-reasoning.md
index 49f5d673..6156e582 100644
--- a/book/chapters/07-reasoning.md
+++ b/book/chapters/07-reasoning.md
@@ -11,6 +11,7 @@ page-title: Reasoning
search-title: "Chapter 7: Reasoning"
next-chapter: "Direct Alignment"
next-url: "08-direct-alignment"
+lecture-label: "Lecture 5: Reasoning (Chap. 7)"
---
# Reasoning Training & Inference-Time Scaling
diff --git a/pyproject.toml b/pyproject.toml
index e16ccfef..064b8525 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -19,5 +19,6 @@ skills = [
]
teach = [
# TODO: pin back to a PyPI release when colloquium development slows down.
- "colloquium @ git+https://github.com/natolambert/colloquium.git",
+ # Testing PR #25 (animations) — revert to HEAD after merge
+ "colloquium @ git+https://github.com/natolambert/colloquium.git@refs/pull/25/head",
]
diff --git a/teach/course/lec5-chap7.md b/teach/course/lec5-chap7.md
new file mode 100644
index 00000000..74306176
--- /dev/null
+++ b/teach/course/lec5-chap7.md
@@ -0,0 +1,819 @@
+---
+title: "Lecture 5: Reasoning Training & Inference-Time Scaling"
+author: "Nathan Lambert"
+fonts:
+ heading: "Rubik"
+ body: "Poppins"
+bibliography: refs.bib
+figure_captions: true
+footer:
+ left: "rlhfbook.com"
+ center: "Lecture 5"
+ right: "Lambert {n}/{N}"
+custom_css: |
+ .slide--section-break { background: #F28482; }
+ :root {
+ --colloquium-progress-fill: #F28482;
+ }
+ .slide--title-sidebar h1 {
+ font-size: 2.5em;
+ letter-spacing: 0;
+ }
+---
+
+
+
+
+# Lecture 5: Reasoning Training & Inference-Time Scaling
+
+
rlhfbook.com
+
+
+
+Course on RLHF and post-training. Chapter 7
+
+---
+
+
+## Lecture 5: Reasoning training & inference-time scaling
+
+
+
+```box
+title: Overview
+tone: muted
+compact: true
+content: |
+ 1. Introduction
+ 2. Key Related Works
+ 3. Training Overview
+```
+
+|||
+
+```box
+title: Core Training Pipeline
+tone: accent
+compact: true
+content: |
+ 4. Instruction Tuning
+ 5. Reward Models
+ 6. Reinforcement Learning
+ 7. **Reasoning**
+ 8. Direct Alignment
+ 9. Rejection Sampling
+```
+
+|||
+
+```box
+title: Data & Preferences
+tone: muted
+compact: true
+content: |
+ 10. What are Preferences
+ 11. Preference Data
+ 12. Synthetic Data & CAI
+```
+
+===
+
+
+
+```box
+title: Practical Considerations
+tone: muted
+compact: true
+content: |
+ 13. Tool Use
+ 14. Over-optimization
+ 15. Regularization
+ 16. Evaluation
+ 17. Product & Character
+```
+
+|||
+
+```box
+title: Appendices
+tone: muted
+compact: true
+content: |
+ A. Key Definitions
+ B. Style Benchmarks
+ C. References
+```
+
+|||
+
+```box
+title: Lectures
+tone: surface
+compact: true
+content: |
+ 1. Overview (Ch. 1-3)
+ 2. IFT, RM, RS (Ch. 4,5,9)
+ 3. RL Theory (Ch. 6 pt 1)
+ 4. RL Practice (Ch. 6 pt 2)
+ **5. Reasoning (Ch. 7)**
+```
+
+---
+
+## From RL to reasoning
+
+Lectures 3-4 covered the **math and implementation** of policy gradient RL for language models: PPO, GRPO, loss aggregation, async training.
+
+This lecture: **where those algorithms go when you scale them up on verifiable problems** -- and the wave of models that resulted.
+
+Two parts:
+
+1. **What changes in the recipe** -- implementation decisions that differ from standard RLHF RL
+2. **The reasoning model landscape** -- key 2025 models, grouped by what they teach
+
+---
+
+## What this lecture covers
+
+```box
+title: Lecture outline
+tone: accent
+content: |
+ 1. **RLVR foundations** -- RLHF vs RLVR, the feedback loop, key terminology
+ 2. **What changes in the recipe** -- difficulty filtering, KL removal, infrastructure shifts
+ 3. **The reasoning model landscape** -- case studies grouped by lesson
+ 4. **Looking ahead** -- where reasoning training is going
+```
+
+---
+
+## The LeCun cake
+
+At NeurIPS 2016, Yann LeCun introduced the cake metaphor:
+
+> If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning.
+
+With modern language models, the analogy is complete:
+
+- **Self-supervised learning** on internet data = the bulk of the cake
+- **Supervised fine-tuning** for instructions = the icing
+- **Reinforcement learning** (RLHF, then RLVR) = the cherry on top
+
+
+
+Reasoning models graduated RL from **cherry-on-top to a load-bearing component** of the training stack.
+
+---
+
+
+## RLHF vs RLVR: How reward changes everything
+
+**RLHF** -- subjective scoring:
+
+> *Explain opportunity cost in economics.*
+>
+> Scoring requires judging clarity, accuracy, completeness -- all learned preferences with no definitive answer.
+
+|||
+
+**RLVR** -- verifiable scoring:
+
+> *What is the sum of all primes < 20?*
+>
+> `extracted_answer == 77` → Reward = 1
+>
+> *Write `fib(n)` returning the nth Fibonacci number.*
+>
+> `assert fib(10) == 55` → All tests pass → Reward = 1
+
+Often no learned reward model is needed.
+
+---
+
+## The RLVR feedback loop
+
+
+
+
+
+
+The RL algorithms (PPO, GRPO) are the same as lectures 3-4. The key change: **reward comes from a verification function**, not a learned model.
+
+---
+
+## Key terms for this lecture
+
+
+
+**Evaluation metrics**:
+
+- **pass@1**: Accuracy on a single sample per problem
+- **pass@K**: Generate $K$ completions, report whether *any* is correct -- measures coverage
+- **IFEval**: Instruction-following evaluation benchmark
+- **GPQA**: Graduate-level QA benchmark for science reasoning
+
+|||
+
+**Algorithm / architecture terms**:
+
+- **DAPO**: Dynamic Advantage Policy Optimization -- a relaxed-clipping variant of GRPO [@yu2025dapo]
+- **CISPO**: Clipped Importance Sampling PO -- clips IS weights rather than per-token ratios
+- **MTP**: Multi-Token Prediction -- pretraining objective that predicts multiple future tokens
+
+---
+
+
+
+## What changes in the RL recipe
+
+---
+
+
+## Standard RLHF RL vs RLVR for reasoning
+
+**RLHF RL (lectures 3-4)**:
+
+- Reward model scores
+- KL penalty to reference model
+- Short completions (~100-500 tokens)
+- Hundreds of RL steps
+- PPO or GRPO
+
+|||
+
+**RLVR for reasoning**:
+
+- Verification function (binary or partial credit)
+- Often **no KL penalty**
+- Completions can be **1K-30K tokens**
+- Thousands of RL steps
+- GRPO variants dominate
+
+Same policy-gradient family -- different reward signal and systems regime.
+
+---
+
+## What thinking tokens look like
+
+
+
+**Standard model** answers directly:
+
+> *What is the sum of all prime numbers less than 20?*
+>
+> The primes less than 20 are 2, 3, 5, 7, 11, 13, 17, 19. Their sum is **77**.
+
+**Reasoning model** generates `` tokens before answering:
+
+> `` I need to find all primes less than 20. Starting from 2: 2 is prime. 3 is prime. 4 = 2×2, not prime. 5 is prime... Let me add them up: 2+3=5, 5+5=10, 10+7=17, 17+11=28, 28+13=41, 41+17=58, 58+19=77. ``
+>
+> The answer is $\boxed{77}$.
+>
+> **Verification**: `extracted_answer == 77` → Reward = 1
+
+For harder problems, thinking can be **thousands of tokens**.
+
+---
+
+
+## RL training vs inference-time scaling
+
+
+
+
+
+|||
+
+
+
+Both axes show log-linear performance gains. RL training **shifts the curve**; inference-time scaling **moves along it**. They are complementary, not competing.
+
+---
+
+## Offline difficulty filtering
+
+The model can only learn from problems where there is a **gradient signal**.
+
+- If pass rate is **0%**: all completions fail → advantages are all equal → zero gradient
+- If pass rate is **100%**: all completions succeed → same problem
+- Sweet spot: **20-80% pass rate** per prompt
+
+Recipe: sample $N$ completions per prompt before training, keep prompts in the productive range.
+
+Used by Seed-Thinking 1.5 [@seed2025seed], Open-Reasoner-Zero [@hu2025openreasonerzero], Phi-4 [@abdin2025phi4], MiMo [@xia2025mimo], Skywork OR-1 [@he2025skyworkor1].
+
+---
+
+## Online filtering and difficulty curriculum
+
+Offline filtering is a snapshot -- the model improves during training, shifting the difficulty distribution.
+
+Solutions:
+
+- **Per-batch online filtering**: Skip prompts that are now too easy or too hard
+- **Difficulty schedules**: Save harder problems for later in training
+- **Dynamic resampling**: Re-evaluate difficulty periodically
+
+Used by Kimi 1.5 [@team2025kimi], Magistral [@mistral2025magistral], Llama-Nemotron [@bercovich2025llamanemotron], MiMo [@xia2025mimo].
+
+---
+
+## Zero-gradient filtering in practice
+
+
+
+A more precise version used in OLMo 3 Think:
+
+Within each batch, skip any prompt group where **all** $G$ completions succeed **or** all fail.
+
+- Advantage = 0 for every completion in that group → zero gradient
+- "Free" -- no extra sampling needed, just discard before the gradient step
+
+Combined with **active sampling**: resample to fill the batch with non-zero-gradient groups, maintaining the target batch size.
+
+---
+
+## Removing the KL penalty
+
+In RLHF (lectures 3-4): KL penalty prevents the policy from drifting too far from the reference model. **Essential** when reward models can be gamed.
+
+In RLVR: rewards are **ground truth** (not a learned proxy), so over-optimization is less of a risk.
+
+Removing KL allows the model to **explore more freely** during long training runs, discovering novel reasoning strategies the reference model never exhibited.
+
+Used by Magistral [@mistral2025magistral], Open-Reasoner-Zero [@hu2025openreasonerzero], Skywork OR-1 [@he2025skyworkor1].
+
+---
+
+## Relaxed and asymmetric clipping
+
+Standard PPO/GRPO uses symmetric clipping:
+
+$$\text{clip}(\rho_t, 1-\varepsilon, 1+\varepsilon)$$
+
+**DAPO** [@yu2025dapo] and related variants propose **asymmetric clipping** -- wider on the upside to encourage exploration of new reasoning behaviors.
+
+This matters more for reasoning because the action space is larger and the model needs to **discover** novel strategies, not just refine known ones.
+
+Used by Magistral [@mistral2025magistral], INTELLECT-2 [@primeintellectteam2025intellect2reasoningmodeltrained].
+
+---
+
+## Format and language consistency rewards
+
+Beyond binary correctness, many models add small **auxiliary rewards**:
+
+**Format rewards**: Encourage `...` before answers, penalize malformed reasoning blocks. Makes answer extraction, tooling, and distillation much easier.
+
+**Language consistency**: Penalize language switching mid-reasoning. Common in multilingual models where the model might reason in English but answer in Chinese (or vice versa).
+
+These are not about correctness -- they're about making reasoning **predictable and usable**.
+
+Used by DeepSeek R1 [@guo2025deepseek], Magistral [@mistral2025magistral], Skywork OR-1 [@he2025skyworkor1].
+
+---
+
+## Length penalties and overthinking
+
+Without intervention, RL-trained models generate **longer and longer** reasoning traces. Not always useful -- "overthinking" wastes compute.
+
+Mitigation strategies:
+
+- **Progressive length extension** (Kimi 1.5 [@team2025kimi]): gradually increase the target length during training
+- **Small length penalty** (INTELLECT-2 [@primeintellectteam2025intellect2reasoningmodeltrained]): penalize excessive trace length throughout
+- **Overlong filtering**: discard completions that exceed a threshold for throughput
+
+Goal: teach the model to reason **efficiently**, not just verbosely.
+
+---
+
+## Loss normalization: Group vs batch
+
+Recall from lecture 4: loss aggregation strategy matters.
+
+- **Standard GRPO**: normalizes advantages within each prompt group
+
+$$\hat{A}_i = \frac{R_i - \mu_G}{\sigma_G}$$
+
+- **Batch-level normalization**: normalizes across the entire batch -- avoids per-group biases when groups have very different difficulty levels
+- **Token-level vs sequence-level**: normalizing loss by total tokens across the batch reduces length bias (Dr. GRPO [@liu2025understanding])
+
+Used by Magistral [@mistral2025magistral], MiMo [@xia2025mimo].
+
+---
+
+
+## The infrastructure bottleneck
+
+
+
+Reasoning completions are **long and variable** in length.
+
+Result: inference (rollout generation) dominates training time.
+
+From OLMo 3:
+
+- Learner GPUs sit idle **~75%** of the time
+- **5-14x** more compute for inference than training
+- Static batching wastes **up to 54%** of compute
+
+|||
+
+
+
+---
+
+## Off-policy and asynchronous updates
+
+As completions get longer, synchronous rollout-then-train becomes **wasteful**.
+
+Moving to async:
+
+- **Actors** generate completions continuously
+- **Learner** consumes them as available
+- Trade-off: data is slightly stale (off-policy), but throughput increases dramatically
+
+Partial-to-full async used by Seed-Thinking 1.5 [@seed2025seed], INTELLECT-2 [@primeintellectteam2025intellect2reasoningmodeltrained], and others.
+
+This is the "algorithm to systems" shift -- **keeping the GPUs busy** matters as much as the loss function.
+
+---
+
+## Parallel test-time compute scaling
+
+Combining answers from multiple parallel rollouts improves over a single rollout.
+
+- **Majority voting**: Sample $N$, take the most common answer
+- **Scoring model**: Use a learned selector to pick the best answer
+- **Best-of-N**: Score with a reward model or verifier, take the highest
+
+pass@K measures this potential; pass@1 measures the deployed policy. The gap between them shows how much inference-time scaling can help.
+
+Used at inference by DeepSeek R1 [@guo2025deepseek], Phi-4 [@abdin2025phi4].
+
+---
+
+## Summary: RLVR recipe changes vs RLHF
+
+| Decision | RLHF RL (Lec 3-4) | RLVR for reasoning |
+|:---------|:-------------------|:-------------------|
+| Reward signal | Learned RM | Verification function |
+| KL penalty | Essential | Often removed |
+| Clipping | Symmetric | Asymmetric / relaxed |
+| Completion length | ~100-500 tokens | ~1K-30K tokens |
+| Difficulty filtering | Rarely | Standard practice |
+| Loss normalization | Per-group | Per-group or per-batch |
+| Training duration | ~100s of steps | ~1000s of steps |
+| Infrastructure | Synchronous OK | Async near-mandatory |
+
+---
+
+## Common failure modes
+
+
+
+- **No RL headroom**: Starting policy solves ~0% or ~100% of training problems → no gradient signal
+- **Over-specialization**: Single-domain RL improves one metric while harming adjacent behaviors
+- **Length pathologies**: Models overthink (wasting compute) or collapse to short answers
+- **Verifier bottlenecks**: Slow code execution or brittle test infrastructure caps experiment velocity
+- **Off-policy drift**: Asynchronous actors generate stale data; needs inflight update strategies
+- **Contamination**: Training prompts that overlap with eval benchmarks give false optimism
+
+---
+
+## Cross-model empirical findings
+
+Three results that appeared independently across multiple teams:
+
+- **Text-only reasoning boosts multimodal performance**: MiMo-VL and Magistral [@mistral2025magistral] found that text-only reasoning RL *after* multimodal training improves vision tasks
+- **Mixed-domain RL prevents over-optimization**: Training on math alone leads to degradation on general chat; mixing in code and instruction following is safer [@teamolmo2025olmo3]
+- **Midtraining determines RL ceiling**: How much math/code is in pretraining data sets the upper bound on what RL can achieve [@xia2025mimo]
+
+---
+
+
+
+## The reasoning model landscape
+
+---
+
+## The research that came before
+
+The ideas behind RLVR aren't new -- they were explored before o1/R1 made them mainstream:
+
+- **STaR** [@zelikman2022star] and **Quiet-STaR** [@Zelikman2024QuietSTaRLM]: self-taught reasoning with ground-truth rewards (2022-2024)
+- **TRICE** [@hoffman2023training]: MCMC-inspired optimization for reasoning traces
+- **VinePPO** [@VinePPO]: PPO with binary math rewards on GSM8K/MATH
+- **Tulu 3** [@lambert2024t]: PPO for math correctness while maintaining broad capabilities
+
+The difference: these didn't scale to the same factor, or sacrificed general performance for specialized gains. 2025 was about scale-up and synthesis, not spontaneous invention.
+
+---
+
+## Why does RL work now?
+
+
+
+- **Stability is much more tractable**: Still a first-class research problem (entropy collapse, long-horizon credit), but tooling and recipes are mature enough for widespread adoption
+- **Open-source tooling**: TRL, Open Instruct [@lambert2024t], veRL [@sheng2024hybridflow], OpenRLHF [@hu2024openrlhf]
+- **Base models are good enough**: Multiple sources suggest RL reasoning training only became viable with models from ~2024 onwards -- a capability floor was needed
+- **Verifiable domains provide clean signal**: Math and code give unambiguous rewards, avoiding the reward hacking problems of RLHF
+
+---
+
+## How to read the landscape
+
+25+ reasoning model reports landed in 2025 alone. Rather than chronological, we group by **what each model teaches us**:
+
+- **The pioneer** -- DeepSeek R1 cracked open the door
+- **The replicators** -- Open-Reasoner-Zero, Phi-4: is the recipe reproducible?
+- **End-to-end pipelines** -- MiMo, OLMo 3: pretraining → post-training as one system
+- **Toggleable reasoning** -- Llama-Nemotron, Qwen 3: reasoning as a product mode
+- **Stability engineering** -- Skywork OR-1: making long RL runs actually work
+
+---
+
+
+## DeepSeek R1: The catalyst
+
+
+
+The anchor release for the open reasoning wave.
+
+**R1-Zero**: Pure RL on a base model. No SFT warm-start. Showed that large-scale RL *alone* can induce chain-of-thought reasoning.
+
+**The full R1 recipe**: Cold-start SFT → large-scale RL → distillation of smaller models.
+
+Open weights, 671B MoE.
+
+|||
+
+
+
+---
+
+## DeepSeek R1: What it taught us
+
+
+
+**R1-Zero** proved that RL alone produces reasoning behavior:
+
+- Emergent self-verification and backtracking
+- Thinking tokens appear without being taught
+- Strong math/code gains
+
+
+
+But R1-Zero also had problems: language mixing mid-reasoning, poor formatting, inconsistent output structure.
+
+The full R1 recipe re-introduced cold-start SFT to fix these, then scaled RL further. Also released distilled smaller models -- distillation as an alternative path to RL.
+
+---
+
+## Open-Reasoner-Zero: The minimalist replication
+
+
+
+If DeepSeek R1 proved the concept, Open-Reasoner-Zero proved it was **reproducible**.
+
+- Fully open: model, data, and code
+- Vanilla PPO with GAE ($\lambda=1, \gamma=1$) and simple rule-based rewards
+- No KL penalty
+- Showed the recipe is not a DeepSeek-specific trick
+
+One of the clearest "minimalism wins" results. Start here if you want to understand the basic recipe.
+
+---
+
+## Phi-4: Small model, careful recipe
+
+
+
+14B parameters (Microsoft). Excels at STEM reasoning despite small size.
+
+Key lesson: **model quality and data curation can compensate for scale**.
+
+- Curated set of "teachable" prompts and synthetic reasoning demonstrations
+- Short phase of outcome-based RL after SFT
+- Uses offline difficulty filtering and majority voting at inference
+
+The best small-model argument in the reasoning table.
+
+---
+
+## MiMo: End-to-end reasoning pipeline
+
+
+
+Xiaomi controls the **entire pipeline** from pretraining through post-training.
+
+Key lesson: **pretraining data choices dramatically affect RL headroom**.
+
+- Three-stage data mixing during pretraining (25T tokens)
+- Multi-Token Prediction (MTP) during pretraining
+- Multi-domain RL to prevent over-optimization on a single task type
+
+"MiMo is the best rebuttal to the idea that reasoning is just a late-stage RL patch."
+
+---
+
+## Llama-Nemotron: Toggleable reasoning
+
+
+
+Multi-size models with a **system prompt toggle** for thinking on/off.
+
+- Not every query needs 10K thinking tokens
+- Open weights AND data
+- Uses online difficulty curriculum and length-controlled RL training
+
+The practical UX insight: reasoning should be a **dial, not a switch**.
+
+---
+
+## Toggleable reasoning is becoming standard
+
+Many models now support reasoning on/off:
+
+- **Llama-Nemotron** [@bercovich2025llamanemotron]: system prompt toggle
+- **Qwen 3** [@yang2025qwen3]: `/think` and `/no_think` modes + thinking budget
+- **K2-V2**: low / medium / high reasoning effort
+- **GLM-4.5**: thinking vs direct response modes
+
+Training this requires either length-controlled RL or multi-stage SFT with both thinking and non-thinking demonstrations.
+
+This is a **UX-driven training decision** -- not just about capability.
+
+---
+
+## Skywork OR-1: Fighting entropy collapse
+
+
+
+The best "stability and ablations" paper in the table.
+
+- Studies **entropy dynamics** during long-CoT RL training
+- Argues that avoiding premature entropy collapse is critical for final performance
+- Fully open: weights, data, AND code
+
+"The paper to cite when someone says the high-level recipe is enough by itself." Stability engineering matters as much as the algorithm.
+
+---
+
+
+## OLMo 3 Think: The fully open reasoning model
+
+
+
+The most comprehensive open documentation of a reasoning model lifecycle.
+
+Releases: stages, checkpoints, data, infrastructure, hyperparameters.
+
+"If you want to study how reasoning training actually works, this is the model."
+
+|||
+
+
+
+
+
+Key lessons: DPO is a better RL start than SFT alone. Mixed-domain RL prevents over-optimization. Zero-gradient filtering and active sampling are essential. Performance was still improving when the run ended.
+
+---
+
+## What the landscape tells us
+
+
+
+- **Algorithm is table stakes**: Most models use GRPO or close variants -- the differentiator is systems engineering and data
+- **Open weights is the norm**: Nearly all models release weights; open *process* (data, code, checkpoints) is rarer and more valuable
+- **Reasoning toggle is becoming standard**: Users and developers want controllable thinking, not always-on long CoT
+- **Agentic absorption**: Later models (Kimi K2, GLM-4.5, DeepSeek V3.2) blend reasoning with tool use and agentic behavior -- reasoning is becoming a substrate, not a product category
+
+---
+
+
+
+## Looking ahead
+
+---
+
+## The expanding scope of RLVR
+
+RLVR started with math and code because they have the **strongest automatic feedback loops**: symbolic equivalence, unit tests, compilation.
+
+It is expanding to:
+
+- **Precise instruction following**: Verifiable constraints (length, format, inclusion/exclusion rules)
+- **Agentic tasks**: Did the agent complete the task in the environment?
+- **Quality preservation**: LM-judge signals to maintain general capabilities during reasoning RL
+
+"The core to progress on RLVR is having a variety and depth of verifiable problems."
+
+---
+
+## Open questions
+
+- Is RL training **discovering** new capabilities, or **eliciting** what pretraining already learned?
+- How far can reasoning training go without better pretraining data?
+- Will agentic RL (tool use + reasoning) require fundamentally different recipes?
+- Can we systematically study the scaling properties of RL for reasoning? [@khatri2025art]
+
+---
+
+## Lecture summary
+
+1. **RLVR** -- verification functions replace reward models; same policy-gradient family, different signal and systems regime
+2. **Recipe changes** -- difficulty filtering, no KL, relaxed clipping, format rewards, async infrastructure
+3. **The landscape** -- 25+ models in 2025; DeepSeek R1 pioneered, the community rapidly iterated
+4. **Cross-cutting patterns** -- toggleable reasoning, algorithm-to-systems shift, open weights vs open process
+5. **The cake metaphor** -- RL moved from cherry on top to load-bearing component
+
+---
+
+
+## Resources
+
+
+
+```box
+title: Book & Course
+tone: accent
+compact: true
+content: |
+ - rlhfbook.com — Chapter 7
+ - Course slides & recordings
+ - GitHub: natolambert/rlhf-book
+```
+
+|||
+
+```box
+title: Key Papers
+tone: surface
+compact: true
+content: |
+ - DeepSeek R1
+ - OLMo 3 Think
+ - DAPO
+ - Tulu 3
+```
+
+===
+
+
+
+```box
+title: Codebases
+tone: surface
+compact: true
+content: |
+ - TRL (Hugging Face)
+ - Open Instruct (Ai2)
+ - veRL (Bytedance)
+ - OpenRLHF
+```
+
+|||
+
+```box
+title: Further Reading
+tone: surface
+compact: true
+content: |
+ - Skywork OR-1
+ - Magistral
+ - Open-Reasoner-Zero
+ - OpenThoughts
+```
+
+---
+
+## Course outline
+
+1. Introduction & Training Overview -- Chapters 1-3
+2. IFT, Reward Models, Rejection Sampling -- Chapters 4, 5, 9
+3. RL Theory -- Chapter 6 (Part 1)
+4. RL Implementation & Practice -- Chapter 6 (Part 2)
+5. **Reasoning -- Chapter 7**
+6. Direct Alignment Algorithms -- Chapter 8
+7. ...
+
+---
+
+
+## Thank you
+
+Questions and discussion welcome.
+
+**Nathan Lambert**
+
+rlhfbook.com | interconnects.ai
+
+===
+
+
diff --git a/teach/course/refs.bib b/teach/course/refs.bib
index 180f55ef..1c75b48f 100644
--- a/teach/course/refs.bib
+++ b/teach/course/refs.bib
@@ -681,3 +681,186 @@ @article{beukman2026preventing
journal={arXiv preprint arXiv:2603.06009},
year={2026}
}
+
+@article{hu2025openreasonerzero,
+ title={Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model},
+ author={Hu, Jingcheng and Zhang, Yinmin and Han, Qi and Jiang, Daxin and Zhang, Xiangyu and Shum, Heung‑Yeung},
+ journal={arXiv preprint arXiv:2503.24290},
+ year={2025}
+}
+
+@article{abdin2025phi4,
+ title={Phi-4-Reasoning Technical Report},
+ author={Abdin, Marah and Agarwal, Sahaj and Awadallah, Ahmed and others},
+ journal={arXiv preprint arXiv:2504.21318},
+ year={2025}
+}
+
+@article{bercovich2025llamanemotron,
+ title={Llama‑Nemotron: Efficient Reasoning Models},
+ author={Bercovich, Akhiad and Levy, Itay and Golan, Izik and others},
+ journal={arXiv preprint arXiv:2505.00949},
+ year={2025}
+}
+
+@article{xia2025mimo,
+ title={MiMo: Unlocking the Reasoning Potential of Language Model--From Pretraining to Posttraining},
+ author={Xia, Bingquan and Shen, Bowen and Zhu, Dawei and Zhang, Di and Wang, Gang and Zhang, Hailin and Liu, Huaqiu and Xiao, Jiebao and Dong, Jinhao and Zhao, Liang and others},
+ journal={arXiv preprint arXiv:2505.07608},
+ year={2025}
+}
+
+@article{yang2025qwen3,
+ title={Qwen3 technical report},
+ author={Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and others},
+ journal={arXiv preprint arXiv:2505.09388},
+ year={2025}
+}
+
+@article{he2025skyworkor1,
+ title={Skywork Open Reasoner 1 Technical Report},
+ author={He, Jujie and Liu, Jiacai and Liu, Chris Yuhao and others},
+ journal={arXiv preprint arXiv:2505.22312},
+ year={2025}
+}
+
+@techreport{mistral2025magistral,
+ title={Magistral: Scaling Reinforcement Learning for Reasoning in Large Language Models},
+ author={{Mistral AI}},
+ institution={Mistral AI},
+ year={2025},
+ month={June},
+ url={https://mistral.ai/static/research/magistral.pdf},
+ note={Technical report}
+}
+
+@article{team2025kimi,
+ title={Kimi k1. 5: Scaling reinforcement learning with llms},
+ author={Team, Kimi and Du, Angang and Gao, Bofei and Xing, Bowei and Jiang, Changjiu and Chen, Cheng and Li, Cheng and Xiao, Chenjun and Du, Chenzhuang and Liao, Chonghua and others},
+ journal={arXiv preprint arXiv:2501.12599},
+ year={2025}
+}
+
+@misc{seed2025seed,
+ title={Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning},
+ author={ByteDance Seed and Jiaze Chen and Tiantian Fan and Xin Liu and Lingjun Liu and Zhiqi Lin and Mingxuan Wang and Chengyi Wang and Xiangpeng Wei and Wenyuan Xu and Yufeng Yuan and Yu Yue and Lin Yan and Qiying Yu and Xiaochen Zuo and Chi Zhang and Ruofei Zhu and Zhecheng An and Zhihao Bai and Yu Bao and Xingyan Bin and Jiangjie Chen and Feng Chen and Hongmin Chen and Riwei Chen and Liangqiang Chen and Zixin Chen and Jinsong Chen and Siyan Chen and Kaiyuan Chen and Zhi Chen and Jin Chen and Jiecao Chen and Jinxin Chi and Weinan Dai and Ning Dai and Jiahui Dai and Shihan Dou and Yantao Du and Zhengyin Du and Jianhui Duan and Chen Dun and Ting-Han Fan and Jiazhan Feng and Junda Feng and Ziyuan Feng and Yuwei Fu and Wenqi Fu and Hanjie Fu and Hao Ge and Hongyi Guo and Mingji Han and Li Han and Wenhao Hao and Xintong Hao and Qianyu He and Jerry He and Feng He and Wen Heng and Zehua Hong and Qi Hou and Liang Hu and Shengding Hu and Nan Hu and Kai Hua and Qi Huang and Ziyue Huang and Hongzhi Huang and Zihao Huang and Ting Huang and Wenhao Huang and Wei Jia and Bin Jia and Xiaoying Jia and Yuhua Jiang and Haobin Jiang and Ziheng Jiang and Kaihua Jiang and Chengquan Jiang and Jianpeng Jiao and Xiaoran Jin and Xing Jin and Xunhao Lai and Zheng Li and Xiang Li and Liyi Li and Hongkai Li and Zheng Li and Shengxian Wan and Ya Wang and Yunshui Li and Chenggang Li and Niuniu Li and Siyu Li and Xi Li and Xiao Li and Aoyan Li and Yuntao Li and Nianning Liang and Xinnian Liang and Haibin Lin and Weijian Lin and Ye Lin and Zhicheng Liu and Guanlin Liu and Guanlin Liu and Chenxiao Liu and Yan Liu and Gaohong Liu and Juncai Liu and Chundian Liu and Deyi Liu and Kaibo Liu and Siyao Liu and Qi Liu and Yongfei Liu and Kang Liu and Gan Liu and Boyi Liu and Rui Long and Chenwei Lou and Weiqiang Lou and Xiang Luo and Yao Luo and Caiping Lv and Heyang Lv and Bole Ma and Qianli Ma and Hongzhi Ma and Yiyuan Ma and Jin Ma and Wenchang Ma and Tingting Ma and Chen Mao and Qiyang Min and Zhe Nan and Guanghan Ning and Jinxiang Ou and Haojie Pan and Renming Pang and Yanghua Peng and Tao Peng and Lihua Qian and Lihua Qian and Mu Qiao and Meng Qu and Cheng Ren and Hongbin Ren and Yong Shan and Wei Shen and Ke Shen and Kai Shen and Guangming Sheng and Jinlong Shi and Wenlei Shi and Guang Shi and Shuai Shuai Cao and Yuxin Song and Zuquan Song and Jing Su and Yifan Sun and Tao Sun and Zewei Sun and Borui Wan and Zihan Wang and Xiaohui Wang and Xi Wang and Shuguang Wang and Jun Wang and Qinlong Wang and Chenyuan Wang and Shuai Wang and Zihan Wang and Changbao Wang and Jiaqiang Wang and Shihang Wang and Xuwu Wang and Zaiyuan Wang and Yuxuan Wang and Wenqi Wang and Taiqing Wang and Chengzhi Wei and Houmin Wei and Ziyun Wei and Shufa Wei and Zheng Wu and Yonghui Wu and Yangjun Wu and Bohong Wu and Shuang Wu and Jingqiao Wu and Ning Wu and Shuangzhi Wu and Jianmin Wu and Chenguang Xi and Fan Xia and Yuqiao Xian and Liang Xiang and Boren Xiang and Bowen Xiao and Zhen Xiao and Xia Xiao and Yongsheng Xiao and Chao Xin and Shulin Xin and Yuwen Xiong and Jingjing Xu and Ziwen Xu and Chenyin Xu and Jiayi Xu and Yifan Xu and Wei Xu and Yufei Xu and Shikun Xu and Shipeng Yan and Shen Yan and Qingping Yang and Xi Yang and Tianhao Yang and Yuehang Yang and Yuan Yang and Ximing Yang and Zeyu Yang and Guang Yang and Yifan Yang and Xuesong Yao and Bairen Yi and Fan Yin and Jianian Yin and Ziqiang Ying and Xiangyu Yu and Hongli Yu and Song Yu and Menghan Yu and Huan Yu and Siyu Yuan and Jun Yuan and Yutao Zeng and Tianyang Zhan and Zheng Zhang and Yun Zhang and Mofan Zhang and Wang Zhang and Ru Zhang and Zhi Zhang and Tianqi Zhang and Xinyi Zhang and Zhexi Zhang and Sijun Zhang and Wenqiang Zhang and Xiangxiang Zhang and Yongtao Zhang and Yuyu Zhang and Ge Zhang and He Zhang and Yue Zhang and Renjie Zheng and Ningxin Zheng and Zhuolin Zheng and Yaowei Zheng and Chen Zheng and Xiaoyun Zhi and Wanjun Zhong and Cheng Zhong and Zheng Zhong and Baoquan Zhong and Xun Zhou and Na Zhou and Huan Zhou and Hang Zhu and Defa Zhu and Wenjia Zhu and Lei Zuo},
+ year={2025},
+ eprint={2504.13914},
+ archivePrefix={arXiv},
+ primaryClass={cs.CL},
+ url={https://arxiv.org/abs/2504.13914}
+}
+
+@misc{primeintellectteam2025intellect2reasoningmodeltrained,
+ title={INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning},
+ author={Prime Intellect Team and Sami Jaghouar and Justus Mattern and Jack Min Ong and Jannik Straube and Manveer Basra and Aaron Pazdera and Kushal Thaman and Matthew Di Ferrante and Felix Gabriel and Fares Obeid and Kemal Erdem and Michael Keiblinger and Johannes Hagemann},
+ year={2025},
+ eprint={2505.07291},
+ archivePrefix={arXiv},
+ primaryClass={cs.LG},
+ url={https://arxiv.org/abs/2505.07291},
+}
+
+@misc{deepseekai2025v32,
+ title={ DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models },
+ author={ DeepSeek-AI },
+ year={ 2025 },
+ eprint={ 2512.02556 },
+ archivePrefix={arXiv},
+ primaryClass={ cs.CL },
+ url={https://arxiv.org/abs/2512.02556}
+}
+
+@inproceedings{zelikman2022star,
+title={{ST}aR: Bootstrapping Reasoning With Reasoning},
+author={Eric Zelikman and Yuhuai Wu and Jesse Mu and Noah Goodman},
+booktitle={Advances in Neural Information Processing Systems},
+editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
+year={2022},
+url={https://openreview.net/forum?id=_3ELRdg2sgI}
+}
+
+@article{Zelikman2024QuietSTaRLM,
+ title={Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking},
+ author={E. Zelikman and Georges Harik and Yijia Shao and Varuna Jayasiri and Nick Haber and Noah D. Goodman},
+ journal={COLM},
+ year={2024},
+ volume={abs/2403.09629},
+}
+
+@inproceedings{hoffman2023training,
+title={Training Chain-of-Thought via Latent-Variable Inference},
+author={Matthew Douglas Hoffman and Du Phan and david dohan and Sholto Douglas and Tuan Anh Le and Aaron T Parisi and Pavel Sountsov and Charles Sutton and Sharad Vikram and Rif A. Saurous},
+booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
+year={2023},
+url={https://openreview.net/forum?id=a147pIS2Co}
+}
+
+@misc{VinePPO,
+ title={VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment},
+ author={Amirhossein Kazemnejad and Milad Aghajohari and Eva Portelance and Alessandro Sordoni and Siva Reddy and Aaron Courville and Nicolas Le Roux},
+ year={2024},
+ eprint={2410.01679},
+ archivePrefix={arXiv},
+ primaryClass={cs.LG},
+ url={https://arxiv.org/abs/2410.01679},
+}
+
+@article{muennighoff2025s1,
+ title={s1: Simple test-time scaling},
+ author={Muennighoff, Niklas and Yang, Zitong and Shi, Weijia and Li, Xiang Lisa and Fei-Fei, Li and Hajishirzi, Hannaneh and Zettlemoyer, Luke and Liang, Percy and Cand{\`e}s, Emmanuel and Hashimoto, Tatsunori},
+ journal={arXiv preprint arXiv:2501.19393},
+ year={2025}
+}
+
+@article{khatri2025art,
+ title={The art of scaling reinforcement learning compute for llms},
+ author={Khatri, Devvrit and Madaan, Lovish and Tiwari, Rishabh and Bansal, Rachit and Duvvuri, Sai Surya and Zaheer, Manzil and Dhillon, Inderjit S and Brandfonbrener, David and Agarwal, Rishabh},
+ journal={arXiv preprint arXiv:2510.13786},
+ year={2025}
+}
+
+@article{aggarwal2025l1,
+ title={L1: Controlling how long a reasoning model thinks with reinforcement learning},
+ author={Aggarwal, Pranjal and Welleck, Sean},
+ journal={arXiv preprint arXiv:2503.04697},
+ year={2025}
+}
+
+@misc{shao2025spurious,
+ title={Spurious Rewards: Rethinking Training Signals in RLVR},
+ author={Rulin Shao and Shuyue Stella Li and Rui Xin and Scott Geng and Yiping Wang and Sewoong Oh and Simon Shaolei Du and Nathan Lambert and Sewon Min and Ranjay Krishna and Yulia Tsvetkov and Hannaneh Hajishirzi and Pang Wei Koh and Luke Zettlemoyer},
+ year={2025},
+ howpublished={\url{https://rethink-rlvr.notion.site/Spurious-Rewards-Rethinking-Training-Signals-in-RLVR-1f4df34dac1880948858f95aeb88872f}},
+ note={Notion Blog}
+}
+
+@misc{anthropic2025claude4,
+ author = {{Anthropic}},
+ title = {Claude 4},
+ year = {2025},
+ month = {May},
+ url = {https://www.anthropic.com/news/claude-4},
+ note = {Accessed: 2025-06-13}
+}
+
+@article{liu2025inference,
+ title={Inference-Time Scaling for Generalist Reward Modeling},
+ author={Liu, Zijun and Wang, Peiyi and Xu, Runxin and Ma, Shirong and Ruan, Chong and Li, Peng and Liu, Yang and Wu, Yu},
+ journal={arXiv preprint arXiv:2504.02495},
+ year={2025}
+}
+
+@inproceedings{sheng2024hybridflow,
+ title = {HybridFlow: A Flexible and Efficient RLHF Framework},
+ booktitle = {European Conference on Computer Systems (EuroSys)},
+ author = {Guangming Sheng and Chi Zhang and Zilingfeng Ye and Xibin Wu and Wang Zhang and Ru Zhang and Yanghua Peng and Haibin Lin and Chuan Wu},
+ year = {2025},
+}
+
+@article{hu2024openrlhf,
+ title={OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework},
+ author={Jian Hu and Xibin Wu and Zilin Zhu and Xianyu and Weixun Wang and Dehao Zhang and Yu Cao},
+ journal={arXiv preprint arXiv:2405.11143},
+ year={2024}
+}