Skip to content

Commit ff1b062

Browse files
authored
Rename bitwise-exact-rl.md to bitwise-consistent-train-inference.md
Signed-off-by: Bram Wasti <[email protected]>
1 parent afa6b12 commit ff1b062

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

_posts/2025-11-10-bitwise-exact-rl.md renamed to _posts/2025-11-10-bitwise-consistent-train-inference.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
---
22
layout: post
3-
title: "Bitwise-exact Batch Invariant On-Policy Reinforcement Learning with vLLM and TorchTitan"
3+
title: "No More Train/Inference Mismatch: Bitwise Consistent On-Policy Reinforcement Learning with vLLM and TorchTitan"
44
author: "vLLM and TorchTitan Teams"
55
---
66

7-
In the septillions of flops used to pre-train models, this mismatch between values has largely been avoidable. Pre-training typically runs at a fixed batch size which induces the same reduction kernels to be run - often side-stepping the issue entirely.
7+
Across the septillions of FLOPs used in pre-training, numerical mismatches have had effectively imperceptible impact. Pre-training typically runs at a fixed batch size which induces the same reduction kernels to be run - often side-stepping the issue entirely.
88

9-
Reinforcement learning, on the other hand, seems to almost exclusively run different reduction algorithms due to its inference-heavy (and thus largely latency and memory-bound) nature. Kernels optimized for low-batch size inference typically run reductions all at once, whereas kernels for training models parallelize heavily to reuse data and amp up compute utilization. That means the generators and the trainers are typically running completely different kernels!
9+
Reinforcement learning, on the other hand, seems to almost exclusively run different reduction algorithms due to its inference-heavy (and thus largely latency and memory-bound) nature. Kernels optimized for low-batch size inference typically run reductions without tiling, whereas kernels for training models parallelize heavily to reuse data and amp up compute utilization. That means the generators and the trainers are typically running completely different kernels!
1010

1111
So intuitively, why might this be an issue? A rudimentary explanation is that the training becomes implicitly “off-policy” because the outputs from the generator do not match the outputs a trainer might produce given the same inputs.
1212

0 commit comments

Comments
 (0)