diff --git a/book/templates/course.html b/book/templates/course.html index ca67700e..14ba6733 100644 --- a/book/templates/course.html +++ b/book/templates/course.html @@ -286,20 +286,6 @@

Welcome to the Course

- -

Lectures

@@ -352,6 +338,17 @@

Lecture 4: RL Implementation & Practice

Source +
+
+

Q&A 1: Reader questions

+

Covering lectures so far · Answering questions from the issue tracker and Discord

+
+ +
diff --git a/teach/course/qa-01.md b/teach/course/qa-01.md new file mode 100644 index 00000000..dfb9edd9 --- /dev/null +++ b/teach/course/qa-01.md @@ -0,0 +1,179 @@ +--- +title: "Q&A 1: Reader questions" +author: "Nathan Lambert" +fonts: + heading: "Rubik" + body: "Poppins" +bibliography: refs.bib +figure_captions: true +footer: + left: "rlhfbook.com/course" + center: "Q&A 1" + right: "Lambert {n}/{N}" +custom_css: | + .slide--section-break { background: #F28482; } + :root { + --colloquium-progress-fill: #F28482; + } +--- + + + +# Q&A 1: Reader questions from the series + +
rlhfbook.com/course
+
+ +
+

Nathan Lambert

+

Round 1

+

Spring 2026

+
+ +

Answering questions posted on the book issue tracker, Discord, and YouTube comments.

+ +--- + + + + +## Lecture 2 — IFT, reward models, & rejection sampling + +--- + +## Q: One model vs. many for synthetic SFT data? + + + +> "For synthetic data (for SFT), what's the theory backing whether one model is better versus more than one model to generate completions?" + +Two angles to discuss: + +- **Capability ceiling** — can a single strong teacher cover the distribution, or do you need diversity (ensembling teachers) to avoid stylistic mode collapse? +- **Diversity vs. consistency trade-off** — multi-teacher data adds coverage but mixes formatting and reasoning habits; single-teacher data is cleaner but narrower. + +Not much formal theory here — mostly empirical. Relevant: Self-Instruct, Orca, Tulu / OLMoE data mixes. + +--- + +## Q: Should the teacher's distribution match the base model? + + + +> "Is there any consideration for choosing a teacher model that has a similar distribution as the base you're fine tuning?" + +Tension: + +- **Closer distribution** → SFT loss is lower, less "forgetting," easier to learn style. But ceiling is bounded by the teacher. +- **Farther distribution** (e.g., distilling from a much stronger model) → bigger capability gains, but larger distribution shift ⇒ more forgetting and sometimes spurious style transfer. + +Practical heuristic: teacher should be **strictly stronger** on the target skill; distribution match matters more when you're worried about regressions on already-good behaviors. + +--- + +## Q: Is there a unified RM benchmark across RM types? + + + +> "I recall you said you worked on the RewardBench project… is there a benchmarking of the different RM models on the same labeled dataset? …we have a dataset with comparison generation info for pairs (prompt, chosen, rejected), and then the completions in the same dataset also have labels per token allowing to train with outcome reward model (ORM)." + +What the reader is asking: can we compare Bradley-Terry pairwise RMs vs. ORMs vs. PRMs vs. generative RMs on **one dataset** that carries both preference pairs and per-token / outcome labels? + +- Such a dataset is rare — most preference sets don't have step / outcome labels, and most ORM datasets don't have comparisons. +- PRM800K-style data is closest, but not directly comparable to pref datasets. +- This is a real open gap — worth discussing what the right evaluation object would look like. + +--- + +## Q: How important is balancing domains in an SFT mix? + + + +> "When composing dataset from different domains/tasks, how is it important to balance the proportion of data from each task/domain. I assume it's important to have them balanced or the training is robust enough to counter small imbalances." + +Key points to cover: + +- Mix ratios **do** matter, especially for smaller models and shorter training runs. +- Heavily-overrepresented domains leak style/format into neighboring tasks. +- Upsampling small-but-important domains (e.g., safety, math) is standard. +- References: Tulu / OLMoE mix ablations, DeepSeek-R1 data composition notes. + +--- + + + + +## Lecture 3 — RL motivation & math + +--- + +## Q: Is GAE the only way to compute the advantage in PPO? + + + +> "Post lecture 3, I'm looking at the cheatsheet and am wondering if the advantage for PPO can be computed in multiple ways and GAE (Generalized Advantage Estimator) is one of the many ways. If not, can we update the formula for PPO where we can define how $A_{t}$ is computed…" + +The PPO clipped objective (as written on the [cheatsheet](https://rlhfbook.com/rl-cheatsheet/)): + +$$J(\theta) = \mathbb{E}_t\!\left[\min\!\left(\rho_t(\theta)\, \hat{A}_t,\; \text{clip}(\rho_t(\theta),\, 1-\varepsilon,\, 1+\varepsilon)\, \hat{A}_t\right)\right]$$ + +treats $\hat{A}_t$ as a plug-in advantage estimate. GAE is one choice; Monte-Carlo returns, $n$-step TD, and group-relative baselines (GRPO-style) are others. + +--- + +## Q: Should the notation section define $T$, $K$, $G$? + + + +> "Also, for the sake of completeness, we can add $T$, $K$, $G$ in the notation if that's helpful." + +Symbols the reader is flagging: + +- $T$ — trajectory / episode length (timestep horizon) +- $K$ — number of PPO epochs per batch of rollouts +- $G$ — group size for GRPO-style advantage estimation + +Currently introduced inline in their respective chapters but not consolidated in the notation table. + +--- + +## Q: What is the shape of $\rho$? + + + +> "What is the shape of rho? Is it a vector with length of the action space (vocabulary)?" + +Recall the importance sampling ratio at timestep $t$: + +$$\rho_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_\text{old}}(a_t \mid s_t)}$$ + +- For a **single sampled action** $a_t$: $\rho_t$ is a **scalar** — a ratio of two probabilities, not a distribution over the vocabulary. +- Per rollout of length $T$: you get a **vector of length $T$** (one $\rho_t$ per timestep). +- Per batch of $B$ rollouts: a $[B, T]$ tensor. + +The policies $\pi_\theta(\cdot \mid s_t)$ and $\pi_{\theta_\text{old}}(\cdot \mid s_t)$ are distributions over the vocabulary, but $\rho_t$ evaluates both at the same *realized* action. + +--- + +## Q: Are slides 59–61 in Lecture 3 out of order? + + + +> "Slide 59, 60, 61 seem to be in jumbled order. The flow if I understand right is: 'What can we do with more conservative gradients?' → introduces 2 problems; 'PPO core idea 1: constrained updates' → trust regions/clipping; 'PPO core idea 2: Importance sampling' → importance sampling." + +Current order in `lec3-chap6-p1.md`: + +1. **PPO core idea 1: constrained updates** — introduces trust regions +2. **What can we do with more conservative gradients?** — raises 2 follow-up problems (multi-step updates, off-policy data) +3. **PPO core idea 2: Importance sampling** — resolves the off-policy problem + +Srinath's proposed order swaps (1) and (2). Worth discussing the intended narrative arc. + +--- + + + + +## More answers & follow-ups + +Deeper slides and diagrams to be added as questions accumulate.