Skip to content

Draft: Lecture 5 — Reasoning Training & Inference-Time Scaling#327

Draft
natolambert wants to merge 2 commits intomainfrom
lecture-5-reasoning
Draft

Draft: Lecture 5 — Reasoning Training & Inference-Time Scaling#327
natolambert wants to merge 2 commits intomainfrom
lecture-5-reasoning

Conversation

@natolambert
Copy link
Copy Markdown
Owner

Summary

New lecture 5 covering Chapter 7 (Reasoning Training & Inference-Time Scaling), built with colloquium.

Changes

  • teach/course/lec5-chap7.md — ~53 slides organized as:
    • Intro (7 slides): RLVR concept, LeCun cake metaphor, RLHF vs RLVR scoring, feedback loop
    • Model landscape (20 slides): ~10 key models grouped by lesson (DeepSeek R1, Open-Reasoner-Zero, Phi-4, MiMo, Llama-Nemotron, Qwen 3, MiniMax-M1, Skywork OR-1, Magistral, OLMo 3 Think, DeepSeek V3.2), plus cross-model patterns and pre-o1 research context
    • Recipe changes (16 slides): What differs from RLHF RL — difficulty filtering, no KL, relaxed clipping, format rewards, length penalties, loss normalization, async infrastructure, test-time scaling
    • Outro (8 slides): Looking ahead, open questions, summary, resources
  • teach/course/refs.bib — 23 new bib entries for reasoning papers
  • book/chapters/07-reasoning.md — Added lecture-label metadata

🤖 Generated with Claude Code

New lecture covering the reasoning model landscape and RLVR
implementation details that differ from standard RLHF RL.

- ~53 slides: intro/RLVR recap, model landscape (grouped by lesson),
  recipe changes (difficulty filtering, no KL, async infra, etc.),
  looking ahead
- Add 23 bib entries to teach/course/refs.bib for reasoning papers
- Add lecture-label metadata to chapter 7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major restructure based on reviewer feedback:

- Reorder to method-first: RLVR foundations → recipe changes →
  model landscape (was landscape → recipe)
- Replace goldfish poem with math thinking-tokens example
- Tighten claims: "often no RM needed", "stability is much more
  tractable", "same policy-gradient family"
- Add glossary slide (pass@K, DAPO, CISPO, MTP, IFEval, GPQA)
- Add failure-modes slide (6 common failure patterns)
- Move pre-o1 research before model landscape
- Rename "model table" → "landscape"
- Compress landscape: cut 4 standalone model slides, mention inline

Replace all duplicated-slide reveals with colloquium PR #25 animations:
- <!-- animate: bullets --> for incremental list reveals
- <!-- step --> for punchlines and progressive content

Point colloquium dep at PR #25 branch for testing animations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c5cb43f063

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pyproject.toml
# TODO: pin back to a PyPI release when colloquium development slows down.
"colloquium @ git+https://github.com/natolambert/colloquium.git",
# Testing PR #25 (animations) — revert to HEAD after merge
"colloquium @ git+https://github.com/natolambert/colloquium.git@refs/pull/25/head",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pin colloquium dependency to immutable revision

The teach extra now points to refs/pull/25/head, which is a mutable GitHub PR ref; if that PR branch is force-pushed, closed, or garbage-collected, pip/uv install .[teach] can fail or silently pull different code over time. This makes slide builds non-reproducible and can break onboarding/CI unexpectedly, so this should be pinned to a stable tag or commit SHA instead of a moving PR ref.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant