delta-net: fix np>1 hybrid recurrent-state corruption (batched multi-seq)#1933
delta-net: fix np>1 hybrid recurrent-state corruption (batched multi-seq)#1933poisonxa16 wants to merge 5 commits into
Conversation
…seq) Hybrid/recurrent models (qwen3next Gated-DeltaNet, qwen35moe) corrupt output under concurrent decoding at np>=3: the mixed-seq path builds N per-token subgraphs that alias the persistent recurrent pool s_l[il], and the ggml graph allocator reuses freed offsets across them -> cross-sequence bleed (np=2 clean, np>=3 dirty). Replace the per-token loop with a single batched multi-seq delta-net call (one ssm_conv + one delta-net over all tokens, [n_kv,n_tokens] seq map routing in-kernel, one contiguous write-back), mirroring the existing concurrency-clean Mamba path. Concurrency harness CLEAN at np=4/6; np=1 speed unchanged; non-hybrid qwen3moe unaffected. Refs ikawrakow#1932. Based on 1520eda (may need rebase to main).
|
Thank you for the PR. The |
|
Thanks for taking a look, and the fair pushback — "works in general, not a narrow case" is exactly the right bar. Let me address it directly. The PR's whole purpose is to remove the single-sequence assumption — and it does so by making Where the single-seq assumption actually lived (and what changed):
Why this isn't narrow — and why MTP is safe: at Evidence (Tesla P100, sm_60, offloaded MoE):
I'm happy to share the harness script and full logs, or run any specific case you'd consider definitive (a different |
|
Does it work CPU-only? |
|
OK, here is what I get when I try to test the PR |
The first push of this PR included llama-delta-net.cpp/.h and llama.cpp (which USE the new batched-delta-net input tensors) but omitted the files that DECLARE and CREATE them, so it failed to compile for reviewers: - src/llama-context.h : declares struct members inp_conv_seq_map / inp_qnext_state_mask - src/graphs/build_qwen35.cpp : creates those input tensors for qwen35moe - src/graphs/build_qwen3next.cpp : creates them for qwen3next - src/llama-build-context.cpp : build-context wiring Adds all four; the tree now builds (verified with -DGGML_CUDA=ON).
|
It builds now, so I decided to run a which is what we have on the main branch. |
…gle-token fallback) The initial fix only handled the 1-token-per-seq concurrent-decode shape and fell back to per-token single-token chunking whenever a u-batch contained repeated seq_id values (multi-token-per-seq), e.g. llama-perplexity with -ub 2048 -c 512 (n_seq=4) or the MTP verify batch. That fallback is what made the PR look identical to main on a perplexity run. This generalizes the batched delta-net to the full (n_seqs, n_seq_tokens, seq_slot[]) decomposition, mirroring mainline equal_seqs / llama-memory-recurrent: - New src/pxa-seq-decomp.h: pxa_decompose_seqs() turns a batch into distinct contiguous, position-monotonic sequences (n_seqs, per-seq token counts, uniform-token flag, pos-0 reset flags); rejects genuinely interleaved/ragged batches so they take a safe path. - delta-net builder now derives n_seq_tokens = n_tok / n_seqs and does one batched gather -> scan -> scatter for the conv + delta-net over ALL tokens (state_row_idx / conv_seq_map [n_kv,n_tokens] / state_mask), instead of a per-token get_rows/set_rows loop. - llama.cpp: the "repeated seq_id" u-batch is now run through the batched path when it is cleanly decomposable (uniform tokens-per-seq); only genuinely non-uniform/ragged batches split per-FULL-sequence (still single-seq, correct recurrent state) rather than per-token. No more single-token-chunking fallback for the perplexity / MTP-verify shapes. - ssm-conv.cu: fix CUDA VMM pool LIFO dealloc order (fast_path_ok_d must be allocated before the block-scoped seq_ids/seq_seen). Previously latent because the n_kv>1 path was never hit; the batched mixed-seq delta-net now exercises it. Validation (Tesla P100 sm_60, Qwen3.5-35B-A3B Q2_K, offloaded MoE): llama-perplexity -ub 2048 -c 512 (n_seq=4) now completes with NO "falling back to single-token chunking" message, PPL = 1.0048, identical to the n_seq=1 reference (-b 512) PPL = 1.0048 -> the batched multi-token-per-seq path is numerically exact, not a fallback. n_seqs==1 decode is behaviorally unchanged.
ggml_compute_forward_set_rows_f32 fetched type_traits[dst->type].from_float and called it unconditionally. For an F32 destination that entry is NULL, so any set_rows targeting an F32 tensor jumped to 0x0 on the CPU backend. This was latent because nothing drove set_rows into an F32 tensor on CPU until the batched delta-net recurrent-state scatter. The CUDA set_rows path already handles F32->F32 directly. Copy the floats with memcpy for an F32 dst. Fix verified: llama-perplexity (Qwen3.5-35B-A3B Q2_K, CPU-only, GGML_CUDA=OFF) -ub 2048 -c 512 (n_seq=4) now runs without the prior SIGSEGV in ggml_compute_forward_set_rows; no "falling back to single-token chunking", PPL 1.0059 over 8 chunks.
|
You're right, and thanks for the concrete repro — that pinned the gap exactly. Two separate things were wrong; both are fixed on the branch now. 1. The fallback (your perplexity result). The revision you built only handled the 1-token-per-seq concurrent-decode shape and bailed to single-token chunking on any repeated seq_id u-batch — which is exactly Pushed in 2. CPU-only (your other question). It segfaulted CPU-only — and that turned out to be a real, latent hole in the CPU backend, not the graph: Evidence (Qwen3.5-35B-A3B Q2_K;
On both backends the batched multi-token-per-seq path is numerically identical to the single-seq reference (within error bars) and never falls back. Happy to run any other shape you'd consider definitive. |
|
We are getting there. The hopefully last remaining thing is performance. We are paying a non-negligible price for the ability to handle multiple sequences even when using a single sequence. Here is what I get with this PR:
And here is what we have on the main branch
I.e., ~11% slower TG and ~4% slower PP on my 2x3090 system. Running CPU-only I observe a similar TG performance regression. |
…sion) The batched multi-seq path made n_seqs==1 decode pay for machinery it does not need: every layer/token it did ggml_get_rows + (state_mask) ggml_mul + ggml_set_rows on the recurrent row, which for this arch is ~state_dim floats (MBs) -> ~3 full-row copies per layer per token. That is the ~11% TG / ~4% PP regression reported on 2x3090, and a larger TG hit on CPU (more bandwidth-bound). Restore the pre-PR (main) zero-copy path for the single-sequence case: when n_seqs==1 and the state row is known at graph-build time (pxa_static_slot, = inp_s_seq_qnext[0] = batch.seq_id[0][0]), access the row IN PLACE via a static-offset ggml_view_2d, reset with ggml_scale, and write the new row back with ggml_concat_inplace -- no get_rows/mask/set_rows. The batched n_seqs>1 path is unchanged (still the allocator-safe single gather->scan->scatter). Graph-reuse correctness: pxa_seq_sig() now encodes the slot for the all_same case, so a reused graph is invalidated when the active slot changes (the static offset is never stale). Single-stream decode keeps a constant slot -> full reuse. llama_set_inputs skips the inp_s_seq_qnext fill when the fast path leaves it bufferless (matches the existing guard for inp_conv_seq_map/inp_qnext_state_mask). Verified CPU-only (Qwen3.5-35B-A3B Q2_K, GGML_CUDA=OFF, llama-sweep-bench -c 2048 -ub 512, eval avg over 128 tok): main 1520eda : TG 15.72 t/s, PP 149.88 t/s PR without this fast path: TG 10.88 t/s, PP 152.07 t/s (regressed) PR with this fast path : TG 15.57 t/s, PP 148.56 t/s (== main) Correctness unchanged: perplexity -ub 2048 -c 512 (n_seq=4) and -b 512 (n_seq=1) both finite, no "falling back to single-token chunking" (PPL 1.0059 / 1.0058).
|
Thanks — and genuinely, thank you for letting us iterate on this in your fork; the back-and-forth has made the PR a lot better. You're right that we were paying for multi-seq machinery on the single-seq path. Fixed in Graph reuse stays correct: I verified CPU-only (you noted the regression shows there too). Qwen3.5-35B-A3B Q2_K,
TG is back to main parity (PP was never the issue — it's compute-bound, matching your ~11% TG / ~4% PP split; on this CPU box the TG hit was larger, ~31%, since it's more bandwidth-bound). Correctness unchanged: I don't have your 2×3090 Separately, we're also working on getting MTP to run correctly at |
|
This is better, but still ~4% lower TG with split mode But the more worrying observation is that the PR seems to affect MTP in some way. For example, for the query "Write a quick sort implementation in python" using |
Fixes the concurrent-decoding corruption reported in #1932.
Problem
Hybrid / recurrent-state models (qwen3next Gated-DeltaNet — Coder-Next-80B, Qwen3-Next-80B; qwen35moe — Qwen3.5-35B/122B) corrupt their output under concurrent decoding (
-np >= 3): concurrent slots bleed each other's recurrent state.np=1/np=2clean,np>=3dirty, worse asnprises. Non-hybridqwen3moe(30B-A3B) is unaffected.Root cause
build_layer_attn_linear's mixed-seq path builds N independent per-token subgraphs, eachget_rows/set_rowson the persistent recurrent pools_l[il].ggml-allocreuses freed buffer offsets by topological refcount; with N>=3 interleaved subgraphs a still-live recurrent scratch is aliased -> cross-sequence bleed. The per-token-loop structure is the bug.Fix
Replace the per-token loop with a single batched multi-seq delta-net call: one
ggml_ssm_conv+ one delta-net over all tokens, a[n_kv, n_tokens]seq map routing each token to its state row in-kernel, and one contiguous write-back — mirroring the existing concurrency-clean Mamba path (src/graphs/build_mamba.cpp). The delta-net CUDA kernel and thessm_convmulti-seq-unique path already supportn_seqs>1; only the graph builder looped.Testing
Notes
1520eda(the base this was developed against) so the diff is exact; happy to rebase ontomainif you'd like it mergeable as-is.Thank you for ik_llama.cpp — the hybrid graph builders + IQK kernels are what make these models viable on an old P100 at all.