Fix #31: route MTP verify through HSS orchestrator (PR #37 follow-up)#47
Fix #31: route MTP verify through HSS orchestrator (PR #37 follow-up)#47gbanyan wants to merge 4 commits into
Conversation
Issue Avarok-Cybersecurity#31 was claimed fixed by PR Avarok-Cybersecurity#37 (chunked-prefill slide removal). On hardware, long prompts with --high-speed-swap + --speculative still produce silently-wrong attention output — "ZEBRA-1947-MOONFISH" needle returns "ZEBRA-1947-M" / "ZEBRA-1944". Root cause: `decode_multi_seq` (the K-token verify entry point used by all `verify_a/b/c/c2/d` paths) calls the production paged-decode kernel directly, which reads K/V from `meta.block_table` (HBM only). Under HSS, HBM is capped at `cache_blocks_per_seq` blocks (~1024 tokens at default cap=64), so the verify attention sees only the recent ~cap×bs context and misses the long-context history that lives only on disk. The single-token decode path in `decode/attention_forward.rs:424` *does* check `high_speed_swap_engaged` and routes through the HSS orchestrator (`attend_layer_on_stream`) which reads the full history from disk. The multi-Q tile kernel doesn't exist (Phase 6.2.b), and the trait signature for `decode_multi_seq` doesn't even pass `disk_block_ids` / `disk_last_offloaded_per_layer`, so routing through the orchestrator from the multi-seq path requires a sweeping signature change. Surgical workaround: when HSS is engaged, fall back to `decode_batched` (which by default loops over N sequential single-token `decode` calls — each properly routed through the orchestrator). Mirrors what the SSM branch immediately below already does. Cost: ~k× attention launches per verify step under HSS; correctness restored. Differential test on Sehyo/Qwen3.5-122B-A10B-NVFP4 with --high-speed-swap --high-speed-swap-cache-blocks-per-seq 64 --speculative --num-drafts 1, 8569-token NIAH prompt, needle at chunk-3 boundary, temperature=0: | Image | Output | |----------------------|---------------------------------| | main + HSS-on | ZEBRA-1947-M / ZEBRA-1944 (wrong) | | this PR + HSS-on | ZEBRA-1947-MOONFISH (4/4 deterministic) | | any image + HSS-off | ZEBRA-1947-MOONFISH | | any image + no-MTP | ZEBRA-1947-MOONFISH | Patches applied to all four verify modules (K=2/3/γ/4) for defense-in-depth even though only K=2 is exercised by --num-drafts 1. Not fixed in this PR: `decode_b.rs:413` (multi-sequence batched decode+prefill path, max_batch_size > 1). Different fix shape — needs per-sequence loop. Doesn't fire at max_batch_size=1. Closes Avarok-Cybersecurity#31 (the part PR Avarok-Cybersecurity#37 missed).
Latent bug surfaced while diagnosing Avarok-Cybersecurity#31. The offload's `start = last.min(total - 1)` heuristic was designed for decode ("re-offload the active block since new slots get written into it") and silently misses the analogous case during chunked prefill. `reshape_and_cache_flash` writes only the chunk's own token slots, so when chunk N ended mid-block (chunk_size not a multiple of block_size) it left the boundary block's tail slots zero on disk after the post-chunk-N offload. Chunk N+1 fills those tail slots in HBM (its first tokens land in those slots), but the offload's `start = last` skipped re-pushing the boundary block, so disk's boundary-block tail stayed permanently zero. Decode reads the full history from disk via `attend_layer_on_stream`, so the zero slots silently corrupt attention for chunk-boundary positions. Verified via instrumented build with per-block zero-slot count: chunk-1 offload (block 127): zero_slots_K=4/16 (slots 12..15) chunk-2 offload (block 127): zero_slots_K=0/16 (this fix) chunk-2 offload (block 191): zero_slots_K=8/16 (slots 8..15) chunk-3 offload (block 191): zero_slots_K=0/16 (this fix) Empirically this fix alone does not change end-to-end output on the Avarok-Cybersecurity#31 differential test (the MTP-verify routing fix in the previous commit dominates — disk content is correct without this fix once verify reads the right blocks). But the partial-block-on-disk state is still incorrect, so worth fixing in case any other code path relies on that block being intact (or future changes shift sensitivity). Cost: ~one extra D2H per layer per chunk transition. Negligible.
|
All contributors have signed the CLA. Thank you! |
tbraun96
left a comment
There was a problem hiding this comment.
Thanks for the fix! The code looks good to me; the comments will be helpful moving forward. Just have to satisfy the CLA gate to become a contributor, then we can merge.
|
I have read the CLA Document and I hereby sign the CLA |
Throughput follow-up (separate from the correctness fix)After the fix verified correct on the 8569-token NIAH, I bumped ObservationAt 131K config, HSS engaged, MTP-2 enabled, ~13K-token tools-active client request: Atlas's prior reported numbers (49.7 tok/s at 131K with HSS+MTP from earlier trials) appear to have come from the buggy code path this PR fixes — Where the cost goes (per attention call in the orchestrator)
Scratch pool seems undersized for long contextStartup log shows: At 131K context with NVFP4: ~6250 blocks per layer × 12 attention layers ≈ 75 000 (layer, block) pairs of working set. The 8192-block resident pool can hold ~11% of that. Each decode step's per-layer attention thrashes through cycle-by-cycle disk I/O — same blocks read, evicted, re-read across layers within a single step. So disk bandwidth isn't the floor; cache thrashing is. I'm going to dig into whether the pool size is exposed via CLI / env (Atlas docs mention Other low-hanging items I noticed (non-blocking, just listing)
Happy to test any of the above on the same 8569-token NIAH config we have nailed down. cc @tbraun96 since you wrote the prefill-side companion path in #37. |
|
@gbanyan thanks for the detailed write-up — independently arrived at the same three items you flagged after live-profiling MiniMax-M2.7-NVFP4 EP=2 ( The 16× HSS slowdown reproduces on a much smaller workload too: 143-token, no-tools reply on M2.7-NVFP4 lands at 1.8 tok/s with Pushed
These don't address your scratch-pool sizing point — agreed that's the highest-leverage knob for the 131K case (75K working-set vs 8192 scratch ≈ 11% coverage = thrash). Live-tested commits 165fdac / 9153e0a / ff0011b on the M2.7 EP=2 short-prompt config: bench-loop deltas are within noise on this short-context workload — the dominant per-token cost is the sync Image with all three: |
Update: scratch-pool sizing isn't the bottleneckTested
Within noise — bigger pool doesn't help. Reason: at 8.5K context the working set is 536 blocks/layer × 12 layers = 6432 (layer, block) pairs, which already fits in the default 8192-block pool. There's no thrashing to fix at this scale. The real per-call cost is inside
So the throughput floor is per-call latency × launches, not disk bandwidth or pool capacity. Three optimization angles, in rough priority:
For our deployment we're rolling back to vLLM for now and keeping #47 open to track upstream throughput work. Happy to retest any optimization on the same NIAH config. |
|
@tbraun96 — apologies, my "scratch-pool isn't the bottleneck" follow-up just above was posted without reading your reply first; my bad on the comment crossover. After re-reading, our findings line up rather than disagree, but I want to be explicit about what we did vs didn't claim so the thread isn't confusing for future readers: What we measured was specifically a 4× pool bump (8192 → 32768) on an 8569-token prompt — context where the working set already fit in the default pool. So the "no thrashing to fix at this scale" framing is correct narrowly: at that context size, scaling the pool can't do useful work because no eviction fires. That's exactly the case your perf-branch commit #3 (skip predictor when What we missed was your offload-path sync Thanks again for the detailed write-up and the debug image — really appreciated. |
…follow-up)
The HSS predictor's A_g buffer is sized:
A_g = num_layers × max_blocks × num_kv_heads × block_size × r × 2
With the default r=32 and a max-seq-len-sized block pool, A_g can hit
2+ GB and OOM at install time. The MiniMax-M2.7-NVFP4 EP=2 head at
--max-seq-len 65536 reliably fails with:
--high-speed-swap install failed: cuMemAlloc_v2(2080882688) failed: 2
The user got nothing actionable from that — same error regardless of
--high-speed-swap-resident-blocks or any other knob, because the
A_g allocation is sized by max_blocks not resident_blocks.
This change:
1. Adds `cuMemGetInfo_v2` FFI in cuda_min.rs + a `mem_info()` helper.
2. In `Predictor::new_on_stream`, preflights the A_g size against
95% of free HBM before calling `cuMemAlloc_v2`. On failure, bails
with an actionable error that names the four knobs the user can
actually turn:
- --high-speed-swap-rank: halves A_g
- --max-seq-len: shrinks the block pool
- --kv-cache-dtype nvfp4: halves KV-pool footprint
- --gpu-memory-utilization: leave more HBM for HSS scratch
and shows the A_g sizing formula so they can do the arithmetic
themselves.
3. Allocation behaviour on the happy path is unchanged — only the
error path is improved.
Verified on the same EP=2 setup that failed previously:
rank=32 → A_g 1.94 GB needed, 0.70 GB free → preflight tells user
'try --high-speed-swap-rank 16'.
rank=16 → A_g 0.97 GB needed, 0.86 GB free → preflight tells user
'try --high-speed-swap-rank 8'.
rank=8 → A_g 0.49 GB → installs cleanly → server serves end-to-end
(MiniMax-M2.7-NVFP4 EP=2, 64K context, fp8 KV, prefix
cache, HSS-engaged).
So the preflight is doing both jobs: failing fast with a fix, and the
fix it suggests actually works.
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Issue #31 (HSS + chunked-prefill silent output corruption on long prompts) was claimed fixed by PR #37, but on hardware the bug still reproduces with
--high-speed-swap+--speculative+ long prompts. Root cause is a different missing HSS orchestrator routing — in the K-token verify path, not the prefill path #37 patched.Repro
Sehyo/Qwen3.5-122B-A10B-NVFP4 on a single GB10,
:3.0.0image semantics:Send an 8569-token NIAH prompt with the needle
ZEBRA-1947-MOONFISHplaced near a chunk-3 boundary (~position 4084),temperature=0:ZEBRA-1947-M/ZEBRA-1944(non-deterministic)ZEBRA-1947-MOONFISH(deterministic)ZEBRA-1947-MOONFISHZEBRA-1947-MOONFISHNotice the matrix isolates the bug to HSS on × MTP on.
Root cause
decode_multi_seq(the K-token verify entry used byverify_a/b/c/c2/d) calls the production paged-decode kernel directly, which reads K/V frommeta.block_table(HBM). Under HSS, HBM is capped atcache_blocks_per_seq × block_sizetokens (~1024 at default cap=64), so verify attention sees only the recent context window and misses the long-context history that lives only on disk.The single-token decode at
decode/attention_forward.rs:424does checkhigh_speed_swap_engagedand routes through the orchestrator (attend_layer_on_stream). The multi-Q tile kernel doesn't exist (Phase 6.2.b), and the trait signature fordecode_multi_seqdoesn't even passdisk_block_ids/disk_last_offloaded_per_layer, so routing through the orchestrator from the multi-seq path needs a sweeping signature change.Fix (commit 1)
Surgical workaround: when HSS is engaged, fall back to
decode_batched(which by default loops over N sequential single-tokendecodecalls — each properly routed through the orchestrator). This mirrors what the SSM branch immediately below already does for the same correctness reason.Cost: ~k× attention launches per verify step under HSS. Functional correctness restored.
Patches applied to all four verify modules (K=2/3/γ/4) for defense-in-depth even though only K=2 is exercised by
--num-drafts 1.Bonus fix (commit 2)
While instrumenting offload state I found a separate latent bug. The offload's
start = last.min(total - 1)heuristic was designed for decode ("re-offload the active block since new slots get written into it") and silently misses the analogous case during chunked prefill where the boundary block's tail slots get filled by the next chunk.Verified via instrumentation:
This fix alone doesn't change end-to-end output on #31 (the verify-routing fix dominates), but the partial-block-on-disk state was wrong and worth fixing.
Out of scope
decode_b.rs:413has the samedecode_multi_seqpattern but for multi-sequence batched decode+prefill. Different fix shape — needs per-sequence loop. Doesn't fire atmax_batch_size=1so didn't affect the repro. Worth a follow-up PR.Test plan
ZEBRA-1947-MOONFISHreturned 4/4 deterministic runs:debugimage before merge — recommended given PR Fix #31 (real): ensure_blocks_through_prefill must NEVER slide #37's history (merged unverified). I'd suggest the same 8569-token NIAH I used; happy to share the exact prompt construction if useful.Closes
#31 (the part PR #37 missed)
🤖 Generated with Claude Code