Skip to content

Fix #31: route MTP verify through HSS orchestrator (PR #37 follow-up)#47

Open
gbanyan wants to merge 4 commits into
Avarok-Cybersecurity:mainfrom
gbanyan:fix/hss-mtp-verify-routing
Open

Fix #31: route MTP verify through HSS orchestrator (PR #37 follow-up)#47
gbanyan wants to merge 4 commits into
Avarok-Cybersecurity:mainfrom
gbanyan:fix/hss-mtp-verify-routing

Conversation

@gbanyan
Copy link
Copy Markdown

@gbanyan gbanyan commented May 9, 2026

Summary

Issue #31 (HSS + chunked-prefill silent output corruption on long prompts) was claimed fixed by PR #37, but on hardware the bug still reproduces with --high-speed-swap + --speculative + long prompts. Root cause is a different missing HSS orchestrator routing — in the K-token verify path, not the prefill path #37 patched.

Repro

Sehyo/Qwen3.5-122B-A10B-NVFP4 on a single GB10, :3.0.0 image semantics:

spark serve ... \
  --high-speed-swap \
  --high-speed-swap-cache-blocks-per-seq 64 \
  --max-seq-len 16384 \
  --speculative --num-drafts 1 \
  --kv-cache-dtype nvfp4 --kv-high-precision-layers auto

Send an 8569-token NIAH prompt with the needle ZEBRA-1947-MOONFISH placed near a chunk-3 boundary (~position 4084), temperature=0:

Configuration Output
main + HSS on + MTP on (3 runs) ZEBRA-1947-M / ZEBRA-1944 (non-deterministic)
this PR + HSS on + MTP on (4 runs) ZEBRA-1947-MOONFISH (deterministic)
any image + HSS off + MTP on ZEBRA-1947-MOONFISH
any image + HSS on + MTP off ZEBRA-1947-MOONFISH

Notice the matrix isolates the bug to HSS on × MTP on.

Root cause

decode_multi_seq (the K-token verify entry used by verify_a/b/c/c2/d) calls the production paged-decode kernel directly, which reads K/V from meta.block_table (HBM). Under HSS, HBM is capped at cache_blocks_per_seq × block_size tokens (~1024 at default cap=64), so verify attention sees only the recent context window and misses the long-context history that lives only on disk.

The single-token decode at decode/attention_forward.rs:424 does check high_speed_swap_engaged and routes through the orchestrator (attend_layer_on_stream). The multi-Q tile kernel doesn't exist (Phase 6.2.b), and the trait signature for decode_multi_seq doesn't even pass disk_block_ids / disk_last_offloaded_per_layer, so routing through the orchestrator from the multi-seq path needs a sweeping signature change.

Fix (commit 1)

Surgical workaround: when HSS is engaged, fall back to decode_batched (which by default loops over N sequential single-token decode calls — each properly routed through the orchestrator). This mirrors what the SSM branch immediately below already does for the same correctness reason.

Cost: ~k× attention launches per verify step under HSS. Functional correctness restored.

Patches applied to all four verify modules (K=2/3/γ/4) for defense-in-depth even though only K=2 is exercised by --num-drafts 1.

Bonus fix (commit 2)

While instrumenting offload state I found a separate latent bug. The offload's start = last.min(total - 1) heuristic was designed for decode ("re-offload the active block since new slots get written into it") and silently misses the analogous case during chunked prefill where the boundary block's tail slots get filled by the next chunk.

Verified via instrumentation:

chunk-1 offload (block 127): zero_slots_K=4/16  (slots 12..15)
chunk-2 offload (block 127): zero_slots_K=0/16  (after fix — re-offloaded)
chunk-2 offload (block 191): zero_slots_K=8/16  (slots 8..15)
chunk-3 offload (block 191): zero_slots_K=0/16  (after fix)

This fix alone doesn't change end-to-end output on #31 (the verify-routing fix dominates), but the partial-block-on-disk state was wrong and worth fixing.

Out of scope

decode_b.rs:413 has the same decode_multi_seq pattern but for multi-sequence batched decode+prefill. Different fix shape — needs per-sequence loop. Doesn't fire at max_batch_size=1 so didn't affect the repro. Worth a follow-up PR.

Test plan

  • 8569-token NIAH, needle at chunk-3 boundary: ZEBRA-1947-MOONFISH returned 4/4 deterministic runs
  • 8569-token NIAH, needle at chunk-1 boundary (position ~2046): correct
  • HSS off + MTP on: still correct (no regression on non-HSS path)
  • Author verification on a :debug image before merge — recommended given PR Fix #31 (real): ensure_blocks_through_prefill must NEVER slide #37's history (merged unverified). I'd suggest the same 8569-token NIAH I used; happy to share the exact prompt construction if useful.

Closes

#31 (the part PR #37 missed)

🤖 Generated with Claude Code

gbanyan added 2 commits May 10, 2026 01:17
Issue Avarok-Cybersecurity#31 was claimed fixed by PR Avarok-Cybersecurity#37 (chunked-prefill slide
removal). On hardware, long prompts with --high-speed-swap +
--speculative still produce silently-wrong attention output —
"ZEBRA-1947-MOONFISH" needle returns "ZEBRA-1947-M" / "ZEBRA-1944".

Root cause: `decode_multi_seq` (the K-token verify entry point used
by all `verify_a/b/c/c2/d` paths) calls the production paged-decode
kernel directly, which reads K/V from `meta.block_table` (HBM only).
Under HSS, HBM is capped at `cache_blocks_per_seq` blocks (~1024
tokens at default cap=64), so the verify attention sees only the
recent ~cap×bs context and misses the long-context history that
lives only on disk.

The single-token decode path in `decode/attention_forward.rs:424`
*does* check `high_speed_swap_engaged` and routes through the HSS
orchestrator (`attend_layer_on_stream`) which reads the full history
from disk. The multi-Q tile kernel doesn't exist (Phase 6.2.b),
and the trait signature for `decode_multi_seq` doesn't even pass
`disk_block_ids` / `disk_last_offloaded_per_layer`, so routing
through the orchestrator from the multi-seq path requires a
sweeping signature change.

Surgical workaround: when HSS is engaged, fall back to
`decode_batched` (which by default loops over N sequential
single-token `decode` calls — each properly routed through the
orchestrator). Mirrors what the SSM branch immediately below
already does. Cost: ~k× attention launches per verify step under
HSS; correctness restored.

Differential test on Sehyo/Qwen3.5-122B-A10B-NVFP4 with
--high-speed-swap --high-speed-swap-cache-blocks-per-seq 64
--speculative --num-drafts 1, 8569-token NIAH prompt, needle at
chunk-3 boundary, temperature=0:

| Image                | Output                          |
|----------------------|---------------------------------|
| main + HSS-on        | ZEBRA-1947-M / ZEBRA-1944 (wrong) |
| this PR + HSS-on     | ZEBRA-1947-MOONFISH (4/4 deterministic) |
| any image + HSS-off  | ZEBRA-1947-MOONFISH             |
| any image + no-MTP   | ZEBRA-1947-MOONFISH             |

Patches applied to all four verify modules (K=2/3/γ/4) for
defense-in-depth even though only K=2 is exercised by --num-drafts 1.

Not fixed in this PR: `decode_b.rs:413` (multi-sequence batched
decode+prefill path, max_batch_size > 1). Different fix shape —
needs per-sequence loop. Doesn't fire at max_batch_size=1.

Closes Avarok-Cybersecurity#31 (the part PR Avarok-Cybersecurity#37 missed).
Latent bug surfaced while diagnosing Avarok-Cybersecurity#31. The offload's
`start = last.min(total - 1)` heuristic was designed for decode
("re-offload the active block since new slots get written into it")
and silently misses the analogous case during chunked prefill.

`reshape_and_cache_flash` writes only the chunk's own token slots,
so when chunk N ended mid-block (chunk_size not a multiple of
block_size) it left the boundary block's tail slots zero on disk
after the post-chunk-N offload. Chunk N+1 fills those tail slots
in HBM (its first tokens land in those slots), but the offload's
`start = last` skipped re-pushing the boundary block, so disk's
boundary-block tail stayed permanently zero. Decode reads the full
history from disk via `attend_layer_on_stream`, so the zero slots
silently corrupt attention for chunk-boundary positions.

Verified via instrumented build with per-block zero-slot count:
  chunk-1 offload (block 127): zero_slots_K=4/16  (slots 12..15)
  chunk-2 offload (block 127): zero_slots_K=0/16  (this fix)
  chunk-2 offload (block 191): zero_slots_K=8/16  (slots 8..15)
  chunk-3 offload (block 191): zero_slots_K=0/16  (this fix)

Empirically this fix alone does not change end-to-end output on the
Avarok-Cybersecurity#31 differential test (the MTP-verify routing fix in the previous
commit dominates — disk content is correct without this fix once
verify reads the right blocks). But the partial-block-on-disk state
is still incorrect, so worth fixing in case any other code path
relies on that block being intact (or future changes shift sensitivity).

Cost: ~one extra D2H per layer per chunk transition. Negligible.
@gbanyan gbanyan requested review from AzeezIsh and tbraun96 as code owners May 9, 2026 17:18
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

All contributors have signed the CLA. Thank you!
Posted by the CLA Assistant Lite bot.

Copy link
Copy Markdown
Contributor

@tbraun96 tbraun96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! The code looks good to me; the comments will be helpful moving forward. Just have to satisfy the CLA gate to become a contributor, then we can merge.

@gbanyan
Copy link
Copy Markdown
Author

gbanyan commented May 9, 2026

I have read the CLA Document and I hereby sign the CLA

@gbanyan
Copy link
Copy Markdown
Author

gbanyan commented May 9, 2026

Throughput follow-up (separate from the correctness fix)

After the fix verified correct on the 8569-token NIAH, I bumped --max-seq-len to 131072 and tried longer-context tests. Hit a steep throughput cliff that I think is worth flagging here as a follow-up — it's not introduced by this PR, but this PR does expose it for the first time.

Observation

At 131K config, HSS engaged, MTP-2 enabled, ~13K-token tools-active client request:

verify=3.5s/step  →  ~0.5 tok/s end-to-end

Atlas's prior reported numbers (49.7 tok/s at 131K with HSS+MTP from earlier trials) appear to have come from the buggy code path this PR fixes — decode_multi_seq calling run_paged_decode directly, which is fast (on-device kernel, no per-call sync) but reads HBM-only (the bug). The correct path through attend_layer_on_stream was, as far as I can tell, never actually exercised by MTP verify under HSS before this fix. So the throughput observation isn't a regression so much as a first measurement.

Where the cost goes (per attention call in the orchestrator)

  1. predictor.score_blocks_on_stream over all max_blocks_per_layer (~8193 in the config), regardless of context length.
  2. stream_sync after the predictor D2H — full GPU serialization point. With my K=2 fix → 24 syncs per verify step.
  3. Tile loop with disk reads.

Scratch pool seems undersized for long context

Startup log shows:

--high-speed-swap installing: ... scratch=8192 blocks ... max_blocks=1025

At 131K context with NVFP4: ~6250 blocks per layer × 12 attention layers ≈ 75 000 (layer, block) pairs of working set. The 8192-block resident pool can hold ~11% of that. Each decode step's per-layer attention thrashes through cycle-by-cycle disk I/O — same blocks read, evicted, re-read across layers within a single step. So disk bandwidth isn't the floor; cache thrashing is.

I'm going to dig into whether the pool size is exposed via CLI / env (Atlas docs mention qd=8 rank=32 but I haven't found a knob for scratch). Sizing it to fit the working set seems like the highest-leverage knob. If you have guidance on the right place to look, happy to test on hardware.

Other low-hanging items I noticed (non-blocking, just listing)

  • Predictor score_blocks runs over the full max_blocks_per_layer (e.g. 8193) every call regardless of seq_block_ids.len() — could be tile/seq-bounded for small contexts.
  • stream_sync per attend call is the K-multiplier in this PR's cost. A queued-async-then-sync-once pattern (like the offload path uses) would erase that.
  • Could also imagine a multi-Q tile kernel that handles n_q > 1 directly, removing the need for the sequential fallback this PR adds entirely.

Happy to test any of the above on the same 8569-token NIAH config we have nailed down.

cc @tbraun96 since you wrote the prefill-side companion path in #37.

@tbraun96
Copy link
Copy Markdown
Contributor

tbraun96 commented May 9, 2026

@gbanyan thanks for the detailed write-up — independently arrived at the same three items you flagged after live-profiling MiniMax-M2.7-NVFP4 EP=2 (atlas-gb10:debug on 2× GB10, the same :debug tag I push these images to).

The 16× HSS slowdown reproduces on a much smaller workload too: 143-token, no-tools reply on M2.7-NVFP4 lands at 1.8 tok/s with --high-speed-swap vs 28.6 tok/s without (config-only diff: drop --high-speed-swap, otherwise identical EP=2 + FP8 KV + thinking-on baseline). So the cliff isn't 131K-context-specific — short-context decodes also hit the per-attend-call overhead you describe.

Pushed perf/hss-decode-overhead (3 commits, branched off this PR) addressing the stream_sync and predictor-bound items directly:

  1. Drop the per-read() stream_sync(stream) in IoUringBackend::read(). The "PosixBackend semantics" contract was never actually load-bearing — every caller immediately follows with copy_h_to_d_async + an attention kernel on the same stream, which serialises behind the H2D copies queued inside read(). With the sync gone, deeper QD is actually utilised, so I bump --high-speed-swap-qd default 8 → 32 (validation cap 64 → 256). io_uring DBMS tuning paper (arxiv 2512.04859) measured 2.05× from this stack on the same hardware class.

  2. Coalesce K+V offload D2H under one synchronize. Adds copy_d2h_async_on_stream to the GPU trait (CUDA backend = cuMemcpyDtoHAsync_v2 no-sync); HSS offload issues both copies async then syncs once. Halves per-layer host syncs on the offload path (was 2/layer × 62 layers = 124 syncs/token).

  3. Skip predictor scoring when seq_block_ids.len() ≤ cfg.resident_blocks. No eviction can fire this call → score buffer feeds nothing useful → the project_q + score_blocks + D2H + whole-stream sync are pure overhead. Gate the whole pipeline on need_eviction. eviction.touch still runs for LRU correctness.

These don't address your scratch-pool sizing point — agreed that's the highest-leverage knob for the 131K case (75K working-set vs 8192 scratch ≈ 11% coverage = thrash). resident_blocks is already exposed via --high-speed-swap-resident-blocks; the 8192 default is from crates/spark-server/src/main_modules/serve_phases/build.rs:91. For your 131K NVFP4 config a resident_blocks ≈ working-set size (75K) at NVFP4 group_bytes would land around ~300 MiB pinned-pool — not unreasonable on the 119.7 GB unified LPDDR5X. Worth a microbench.

Live-tested commits 165fdac / 9153e0a / ff0011b on the M2.7 EP=2 short-prompt config: bench-loop deltas are within noise on this short-context workload — the dominant per-token cost is the sync pwrite × 16/layer × 62 layers in the offload path (write_from_host does wait_buffer_free + memcpy + libc::pwrite, all sync), which needs an async-write refactor through the io_uring SQE path to address. Separate work. Should compound with your prefill-side correctness fix + a scratch-pool resize on the long-context regime where the architecture is doing the right things just at low queue depth and high sync count.

Image with all three: avarok/atlas-gb10:debug (also tagged atlas-gb10:hss-perf3 locally), baked from atlas-builder:r8 + master HSS-#31 fixes #35/#36/#37 + this PR's two commits + my three perf commits, on top of atlas-gb10:fix42 runtime layers.

@gbanyan
Copy link
Copy Markdown
Author

gbanyan commented May 9, 2026

Update: scratch-pool sizing isn't the bottleneck

Tested --high-speed-swap-resident-blocks 32768 (4× the default 8192) on the same 8569-token NIAH config:

Pool size verify per K=2 step end-to-end
8192 (default) 3.5s ~0.5 tok/s @ 13K context
32768 (4×) 2.85 – 4.27s ~0.3 tok/s @ 8.5K context

Within noise — bigger pool doesn't help. Reason: at 8.5K context the working set is 536 blocks/layer × 12 layers = 6432 (layer, block) pairs, which already fits in the default 8192-block pool. There's no thrashing to fix at this scale.

The real per-call cost is inside attend_layer_on_stream:

  • predictor.score_blocks over max_blocks_per_layer = 8193 per call (fixed, scales with max-seq-len, not actual context length or pool)
  • stream_sync per layer per Q → 24 full GPU syncs per K=2 verify step under this PR's sequential fallback
  • 12 sequential per-layer kernel launches without CUDA graph capture (graph capture is disabled under HSS in verify_b/c/c2/d by use_graphs = … && !hss_engaged)

So the throughput floor is per-call latency × launches, not disk bandwidth or pool capacity. Three optimization angles, in rough priority:

  1. Batch the predictor scoring + tile launches across the K queries so a K=2 verify step is one launch instead of two (recovers ~half of this PR's cost).
  2. Defer stream_sync to step boundary (queue everything async, sync once at end). Score copies to a pinned host buffer that's already-async-readable.
  3. Allow CUDA graph capture under HSS for the per-layer body — the host I/O is already on a separate stream, so the device-side graph could be safe.

For our deployment we're rolling back to vLLM for now and keeping #47 open to track upstream throughput work. Happy to retest any optimization on the same NIAH config.

@gbanyan
Copy link
Copy Markdown
Author

gbanyan commented May 9, 2026

@tbraun96 — apologies, my "scratch-pool isn't the bottleneck" follow-up just above was posted without reading your reply first; my bad on the comment crossover. After re-reading, our findings line up rather than disagree, but I want to be explicit about what we did vs didn't claim so the thread isn't confusing for future readers:

What we measured was specifically a 4× pool bump (8192 → 32768) on an 8569-token prompt — context where the working set already fit in the default pool. So the "no thrashing to fix at this scale" framing is correct narrowly: at that context size, scaling the pool can't do useful work because no eviction fires. That's exactly the case your perf-branch commit #3 (skip predictor when seq_block_ids.len() ≤ cfg.resident_blocks) targets — at small-context decodes the predictor pipeline is pure overhead. We never got to actually exercise scratch-pool sizing at the regime where it would matter (the 131K case, 75K working set vs default 8192). So the headline "pool sizing doesn't help" is misleading in isolation — it's accurate for 8.5K, not for 131K, and the latter is the regime where your sizing recommendation (resident_blocks ~ working set, ~300 MiB pinned) is the lever.

What we missed was your offload-path sync pwrite × 16/layer × 62 layers diagnosis — we didn't have visibility into write_from_host semantics, so our "real cost is per-call latency × launches" framing under-attributed cost to the offload side. Your :debug image's three perf commits target the read path while the sync-write item lives separately, makes sense.

Thanks again for the detailed write-up and the debug image — really appreciated.

tbraun96 added a commit that referenced this pull request May 20, 2026
…follow-up)

The HSS predictor's A_g buffer is sized:
  A_g = num_layers × max_blocks × num_kv_heads × block_size × r × 2

With the default r=32 and a max-seq-len-sized block pool, A_g can hit
2+ GB and OOM at install time. The MiniMax-M2.7-NVFP4 EP=2 head at
--max-seq-len 65536 reliably fails with:

  --high-speed-swap install failed: cuMemAlloc_v2(2080882688) failed: 2

The user got nothing actionable from that — same error regardless of
--high-speed-swap-resident-blocks or any other knob, because the
A_g allocation is sized by max_blocks not resident_blocks.

This change:

1. Adds `cuMemGetInfo_v2` FFI in cuda_min.rs + a `mem_info()` helper.

2. In `Predictor::new_on_stream`, preflights the A_g size against
   95% of free HBM before calling `cuMemAlloc_v2`. On failure, bails
   with an actionable error that names the four knobs the user can
   actually turn:

     - --high-speed-swap-rank: halves A_g
     - --max-seq-len: shrinks the block pool
     - --kv-cache-dtype nvfp4: halves KV-pool footprint
     - --gpu-memory-utilization: leave more HBM for HSS scratch

   and shows the A_g sizing formula so they can do the arithmetic
   themselves.

3. Allocation behaviour on the happy path is unchanged — only the
   error path is improved.

Verified on the same EP=2 setup that failed previously:

  rank=32 → A_g 1.94 GB needed, 0.70 GB free → preflight tells user
            'try --high-speed-swap-rank 16'.
  rank=16 → A_g 0.97 GB needed, 0.86 GB free → preflight tells user
            'try --high-speed-swap-rank 8'.
  rank=8  → A_g 0.49 GB → installs cleanly → server serves end-to-end
            (MiniMax-M2.7-NVFP4 EP=2, 64K context, fp8 KV, prefix
            cache, HSS-engaged).

So the preflight is doing both jobs: failing fast with a fix, and the
fix it suggests actually works.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants