Draft: Lecture 5 — Reasoning Training & Inference-Time Scaling by natolambert · Pull Request #327 · natolambert/rlhf-book

natolambert · 2026-04-01T20:41:17Z

Summary

New lecture 5 covering Chapter 7 (Reasoning Training & Inference-Time Scaling), built with colloquium.

Changes

teach/course/lec5-chap7.md — ~53 slides organized as:
- Intro (7 slides): RLVR concept, LeCun cake metaphor, RLHF vs RLVR scoring, feedback loop
- Model landscape (20 slides): ~10 key models grouped by lesson (DeepSeek R1, Open-Reasoner-Zero, Phi-4, MiMo, Llama-Nemotron, Qwen 3, MiniMax-M1, Skywork OR-1, Magistral, OLMo 3 Think, DeepSeek V3.2), plus cross-model patterns and pre-o1 research context
- Recipe changes (16 slides): What differs from RLHF RL — difficulty filtering, no KL, relaxed clipping, format rewards, length penalties, loss normalization, async infrastructure, test-time scaling
- Outro (8 slides): Looking ahead, open questions, summary, resources
teach/course/refs.bib — 23 new bib entries for reasoning papers
book/chapters/07-reasoning.md — Added lecture-label metadata

🤖 Generated with Claude Code

New lecture covering the reasoning model landscape and RLVR implementation details that differ from standard RLHF RL. - ~53 slides: intro/RLVR recap, model landscape (grouped by lesson), recipe changes (difficulty filtering, no KL, async infra, etc.), looking ahead - Add 23 bib entries to teach/course/refs.bib for reasoning papers - Add lecture-label metadata to chapter 7 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major restructure based on reviewer feedback: - Reorder to method-first: RLVR foundations → recipe changes → model landscape (was landscape → recipe) - Replace goldfish poem with math thinking-tokens example - Tighten claims: "often no RM needed", "stability is much more tractable", "same policy-gradient family" - Add glossary slide (pass@K, DAPO, CISPO, MTP, IFEval, GPQA) - Add failure-modes slide (6 common failure patterns) - Move pre-o1 research before model landscape - Rename "model table" → "landscape" - Compress landscape: cut 4 standalone model slides, mention inline Replace all duplicated-slide reveals with colloquium PR #25 animations: -  for incremental list reveals -  for punchlines and progressive content Point colloquium dep at PR #25 branch for testing animations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c5cb43f063

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-01T22:09:35Z

    # TODO: pin back to a PyPI release when colloquium development slows down.
-    "colloquium @ git+https://github.com/natolambert/colloquium.git",
+    # Testing PR #25 (animations) — revert to HEAD after merge
+    "colloquium @ git+https://github.com/natolambert/colloquium.git@refs/pull/25/head",


Pin colloquium dependency to immutable revision

The teach extra now points to refs/pull/25/head, which is a mutable GitHub PR ref; if that PR branch is force-pushed, closed, or garbage-collected, pip/uv install .[teach] can fail or silently pull different code over time. This makes slide builds non-reproducible and can break onboarding/CI unexpectedly, so this should be pinned to a stable tag or commit SHA instead of a moving PR ref.

Useful? React with 👍 / 👎.

github-actions Bot deployed to preview April 1, 2026 20:46 View deployment

chatgpt-codex-connector Bot reviewed Apr 1, 2026

View reviewed changes

github-actions Bot deployed to preview April 1, 2026 22:14 View deployment

natolambert marked this pull request as draft April 1, 2026 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Lecture 5 — Reasoning Training & Inference-Time Scaling#327

Draft: Lecture 5 — Reasoning Training & Inference-Time Scaling#327
natolambert wants to merge 2 commits intomainfrom
lecture-5-reasoning

natolambert commented Apr 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

natolambert commented Apr 1, 2026

Summary

Changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant