Qwen3_xml Tool Parser#73
Open
G-Deca wants to merge 33 commits into
Open
Conversation
The bundled `mtp.safetensors` for `Qwen/Qwen3.6-27B-FP8` carries a single full-attention layer + a dense gate/up/down MLP (no router, no experts). Atlas's MTP loader assumed every MTP head was MoE-shaped, so the dense loader (`Qwen35DenseWeightLoader`) just stubbed `load_mtp_weights → None` and `--speculative` silently no-opped. Changes: - `MtpWeights` gains `dense_ffn: Option<DenseExpertWeight>`. `load_mtp` auto-detects FFN flavor by inspecting weight names: presence of `mtp.layers.0.mlp.gate_proj.weight` without a `.gate.weight` router selects the dense path; the MoE fields are populated with NULL placeholders. - `MtpHead` gains a `dense_ffn_generic` projection triple. The constructor short-circuits all MoE quantization when dense weights are present (NVFP4 mode is rejected — Qwen3.6-27B-FP8 ships an FP8 MTP head, so users pass `--mtp-quantization fp8` or `bf16`). - `MtpHead::forward_one` step 10 dispatches `dense_ffn_forward_generic` when the dense triple is populated. The dense path reuses the existing `dense_gemv_*` and `moe_silu_mul` kernels — no new kernel wiring or PTX changes. - `Qwen35DenseWeightLoader::load_mtp_weights` now calls `load_mtp` when `mtp.fc.weight` is present in the store. The single auto-detecting loader handles both MoE (35B-A3B) and dense (27B) variants. - `factory::build` warns when `--speculative` is requested but no MTP weights were loaded (was: silent no-op). Tested via `scripts/check.sh check -p spark-model -p spark-server`. End-to-end validation pending sparkrun on dgx2 with the `qwen3.6-27b-dense-fp8-mtp-atlas` recipe.
Qwen3.6-27B-FP8 is the dense text sibling of the Qwen3.6-VL family. Its config.json declares the same `vision_config` block as the VL siblings (Qwen ships them as one config family), so since `82f4794 (Apple Metal)` widened `is_qwen3_vl()` to also match `qwen3_5 + vision`, the routing in `loader_for_config` sends the 27B dense checkpoint to `Qwen3VLWeightLoader` — which assumes MoE and panics on the missing `mlp.gate.weight`. The fix: check `is_qwen35_dense()` (requires `num_experts == 0`) BEFORE `is_qwen3_vl()`. Dense text-only checkpoints are unambiguously distinguishable by `num_experts == 0`; VL-MoE always has experts. Repro: spark serve Qwen/Qwen3.6-27B-FP8 → "Failed to build model: Weight 'model.layers.0.mlp.gate.weight' not found in store" After fix the dense loader handles it (and now also picks up the bundled MTP head — see 43a04d2).
…litting them (Avarok-Cybersecurity#62) * fix(thinkbrake): defer forced </think> past code fences instead of splitting them F2 confidence early-stop (and the thinking-budget cap) forced `</think>` the instant they tripped — including mid-line inside a ```python block the model was drafting in its reasoning. Code tokens are near- deterministic (top-1 >=0.95 for long runs) so F2 trips trivially on a code block; the forced boundary split a statement and the reasoning parser then cut reasoning_content mid-token, leaking the rest into content. User report: "stops half-way through thinking while generating a codeblock." Fix: track ``` code-fence parity per sequence via the atomic fence token id (resolved once in tokenizer_runtime, fail-open if a tokenizer splits it). F2 keeps DETECTING/arming everywhere (a model can ramble in code forever and must stay brakeable), but the forced `</think>` INJECTION is deferred until the fence closes — a safe boundary right after the code block, never mid-statement. THINK_LOOP period-repeat watchdog is unchanged and still active inside fences. Pure decisions extracted to SSOT helpers, called by production and asserted by tests: toggle_code_fence, confidence_run_step, should_inject_think_end. spark-server scheduler::helpers suite 18/18 green under #![deny(warnings)]. Verification: live repro (fibonacci, Qwen3.6-27B-FP8 + MTP) confirmed the mid-codeblock split is gone and the ```python block is emitted intact. End-to-end re-check of the defer-refinement's post-fence brake timing is the remaining live step. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(watchdog): digit-normalized content-loop detector for template degeneration Companion to the thinkbrake fence fix: after deferring the forced </think> past code fences, live testing surfaced a separate content- phase degeneration — Qwen3.6-27B at temp=0 emits a fixed line template with varying numeric payload (`- B(46) = 104509868777\n- B(47) = 273508641\n …`) until max_tokens. The exact-token content-loop watchdog structurally cannot catch this: the integer tokens differ every line so no fixed token period repeats. Fix: detect_content_token_loop_normalized maps numeric tokens to a sentinel AND run-length-collapses consecutive sentinels. Run-collapse is essential — Qwen3.6 is digit-level (`104509868777` → 12 single-digit tokens, `273508641` → 9), so a 1:1 map alone leaves variable-length sentinel runs and the period still varies; collapsing makes `- B(<digits>) = <digits>\n` identical regardless of digit count. Numeric-token mask (the 10 single-digit ids of 248070) built once at startup in tokenizer_runtime via decode_with_special (NOT id_to_token — that returns raw byte-level BPE pieces with the `Ġ` space marker), exposed via a set/get OnceLock mirroring enable_loop_watchdog. OR-ed into the existing content-loop watchdog at decode_logits_step.rs under the SAME per-model enable_loop_watchdog gate (qwen3.6-27b / qwen3.5-27b only). FP guard: CONTENT_LOOP_NORM_MIN_REPEATS=4 and the matched period must contain >=1 sentinel AND >=1 structural token, so pure-number columns and pure-prose loops stay the exact detector's job. scheduler::helpers 23/23 green (5 new incl. variable-length-digit-run + exact-path regression), clean under #![deny(warnings)]. Live (27B-FP8+MTP): degeneration repro bounded 2000 -> 577 tokens (watchdog fired), thinkbrake code-fence fix still intact, normal coding prompt finish=stop (no false-positive early stop). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(qwen3.6-27b): thinking_default=true (reasoning on by default) qwen3.6-27b is a reasoning model but its MODEL.toml [behavior] had no thinking_default key → ModelBehavior default (false) → resolve_thinking returned enable_thinking=false for any client that doesn't explicitly opt in. Open WebUI (and most OpenAI-compatible clients) send plain requests with no enable_thinking flag, so the chat template injected an empty `<think>\n\n</think>` thinking-off marker and reasoning_content was always empty — thinking never displayed. Set thinking_default=true. Plain requests now reason by default; explicit per-request enable_thinking=false (or reasoning effort, etc.) still wins via resolve_thinking. Verified live (27B-FP8+MTP): plain request → reasoning_tokens=176, reasoning_content populated, finish=stop. Build-time embedded via atlas-kernels build_parse.rs. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(qwen3.6-27b): anti-repetition sampling stack + bounded in-fence </think> defer Surfaced by the OpenWebUI "3D chess game" stress prompt (the canonical long-code hard case). Two coupled root causes, both Atlas-side: 1. Sampling presets on this branch had presence_penalty=0.0 and no DRY/LZ — the exact condition commit 6c69d3f live-bisected as the cause of the announce/restart + verbatim-CSS-block degeneration on hard code-gen (QwenLM/Qwen3.6 Avarok-Cybersecurity#88/#115/#145, confirmed upstream on official vLLM/BF16). Replicated 6c69d3f's battle-tested stack (presence_penalty=1.5 + lz_penalty=0.2 + DRY 0.8/1.75/2) on the presets THIS branch's build_sampling actually selects (thinking_text / thinking_coding / non_thinking — no thinking_coding routing here; tools left 0.0). XTC not ported on this branch; the digit-normalized content-loop watchdog is the safety net. 2. Regression from the thinkbrake fence-defer (cd0ca9d): when the model writes its whole deliverable as a ```code block INSIDE <think>, the fence never closes so should_inject_think_end deferred the forced </think> indefinitely — budget brake fired at 256 but reasoning ran to 3025 tokens and the real answer was trapped in reasoning_content with a 499-char content stub. Bounded the deferral: THINK_DEFER_BUDGET_FACTOR=3 (hard-inject </think> past 3x budget even mid-fence; absolute ceiling 2048 when budget=None). Live (27B-FP8+MTP, 3D-chess prompt): before = 401-token CSS-loop garbage; after = reasoning capped exactly at 768 (=3x256), content 7234 chars with 27 THREE. calls, Scene/WebGLRenderer/Camera/ OrbitControls, valid </script></html>, finish=stop, no watchdog. scheduler::helpers 24/24 green. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(qwen3.6-27b): retune presence_penalty 1.5 → 1.0 (temp=0.6 sweep) presence_penalty=1.5 (replicated from 6c69d3f) over-penalized on THIS branch's realistic path: Open WebUI sends no sampling params so requests run at the thinking_text preset default temp=0.6 (stochastic), where a flat 1.5 per-seen-token penalty boosts EOS once common code tokens are exhausted → premature termination (574 tok, 0 THREE. calls). The earlier validation passed only because it used temp=0 (greedy resists early EOS) — wrong regime. Live presence_penalty sweep at temp=0.6 on the 3D-chess prompt: 1.5 → 574 tok, 0 THREE. (premature EOS) 1.0 → 6782 chars, 17 THREE., </script></body></html>, finish=stop, single coherent impl, no loop ← chosen 0.5 → degenerate outlier 0.3 → complete but 3× wasteful announce/restart rewrites 1.0 is the balance point between premature-termination (high pp) and restart-spam (low pp). DRY (multi-token) + LZ (n-gram) + the digit-normalized content-loop watchdog carry the precision loop-breaking presence (single-token) shouldn't be doing alone. Loop/degeneration layer now resolved (no watchdog fires, no restart spam across regimes). NOTE: a separate, non-Atlas factor remains — Open WebUI injects an empty `system: "User Context:\n\n"` which the model reacts to with terse output (isolated: removing it 3×'s the generation). Tracked for a follow-up Atlas robustness fix (neutralize content-free system messages) + user-side Open WebUI system-prompt config. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chat): neutralize content-free client system messages Open WebUI injects an empty RAG/context system message — `"User Context:\n\n"` (trims to the bare label `User Context:`) — when no custom system prompt is set. Models react to a content-free system directive by producing terse / prematurely-terminated output: isolated 2026-05-17, the identical 3D-chess request WITHOUT that system message produced 3x the generation (1499 vs 469 completion tokens, 14 vs 3 THREE. calls). The string is purely client-side (zero matches in Atlas source); Atlas relayed it faithfully and the jinja template rendered it correctly — the model's terseness is a reasonable reaction to a meaningless instruction. We can't fix the client, so Atlas adapts: a leading system message whose trimmed content is empty, or a single short bare `Label:` line with no payload (`User Context:`, `Context:`, `System:`), is dropped before templating so it can't poison generation. Model-agnostic (one site, pre-jinja) — not per-template. Conservative: any multi-line or post-colon content is a real prompt and is never stripped (unit tests cover empty/whitespace, the OpenWebUI residue, and substantive prompts incl. label-like-with-payload and long prose ending in ':'). Live: exact Open WebUI request → log "Dropped content-free client system message dropped=User Context:", no degeneration. (Residual length variance on this prompt is upstream Qwen3.6 temp=0.6 behavior — QwenLM/Qwen3.6 Avarok-Cybersecurity#88 — not Atlas; Open WebUI sends no sampling params so requests run at the preset default temp=0.6.) Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: green the pipeline (LoC split + clippy + fmt) Make the full PR Avarok-Cybersecurity#62 merge chain pass CI. Three classes of fix: 1. LoC ≤500 (file-size-cap): split scheduler/helpers.rs (834 → 476) by moving its `#[cfg(test)] mod thinking_loop_tests` (360 lines) to helpers_tests.rs via `#[path]` — logical child of `helpers`, so `use super::*` resolves exactly as before; zero production change. (decode_logits_step.rs / serve.rs are on the CI allow_list — left.) 2. clippy (deny clippy::all) — fixes, several PRE-EXISTING on the base branch and unmasked once upstream crates compiled clean: - atlas-core config/methods.rs: manual checked division → checked_div - atlas-kernels build_codegen.rs: generated consts emit `&str`/`&[u8]` not `&'static …` (clears 15 redundant_static_lifetimes in the generated target_ptx.rs; fn-return `'static` kept — not linted) - spark-runtime buffers/sizes.rs + spark-server preflight.rs: manual checked division → checked_div().map().unwrap_or() - spark-model forward_layers.rs + spark-server phase_promote_prefills.rs: descending sort_by → sort_(unstable_)by_key(|x| Reverse(..)) - spark-server openai/annotations.rs: loop+let-else-break → while let - spark-server scheduler/helpers.rs: iter().any(|t| t==X) → contains(&X) 3. cargo fmt --all (workspace) — normalizes the clippy edits plus pre-existing base fmt debt (mtp_head/new.rs, qwen35_dense.rs, decode_logits_seq/step.rs). Verified in atlas-builder: fmt --check 0 diffs, no non-allowlisted .rs >500, `cargo clippy --tests --workspace --keep-going` zero errors, `cargo test --workspace` all green (incl. 342 spark-server + 23 helpers). Behavior-preserving throughout (test-mod move, semantically identical clippy rewrites, whitespace-only fmt). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Azeez Ishaqui <debaterishaqui@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oken Reconcile the thinkbrake code-fence work with main's Avarok-Cybersecurity#56 chunked-prefill refactor (#37513bf split scheduler/phase_continue_prefills.rs into a 4-file dir-module) and Avarok-Cybersecurity#57 tool-call-parser system-prompt injection. Conflicts (2), both resolved by taking main's side: - atlas-kernels/build_codegen.rs: trivial; both sides independently dropped 'static from the generated const type. main's const_ty rename + preamble comment is canonical; branch's redundant change discarded. - scheduler/phase_continue_prefills.rs: structural; took main's 4-file dispatcher (run_standard/run_batched_mixed/run_batched_prefill). Re-threaded code_fence_token: Option<u32> (additive plumbing, after think_start_token / before tool_call_start_token, no logic change) into main's new module boundaries so the thinkbrake fence signal reaches the existing toggle_code_fence SSOT in process_decode_logits: - continue_in_progress_prefills sig + its run_standard_chunk_loop and run_batched_mixed_step calls - run_standard.rs run_standard_chunk_loop sig + process_decode_logits call - run_batched_mixed.rs run_batched_mixed_step sig + process_decode_logits call mod.rs/decode_logits_step.rs/decode_step.rs retained the branch's code_fence_token via clean 3-way auto-merge (verified, no E0061). Audits: Avarok-Cybersecurity#57 (api/chat/mod.rs only) is disjoint from the branch's thinkbrake decode logic (decode_logits_seq.rs only) — no interaction. tokenizers 0.23/safetensors 0.7 dep bumps compile clean workspace-wide. Verified (mirrors CI exactly): cargo fmt --all --check 0 diffs; no non-allowlisted .rs >500 LoC; cargo clippy --workspace --tests zero diagnostics; cargo test --workspace 662 passed / 0 failed (incl. thinkbrake fence/defer/loop suites green). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors PR Avarok-Cybersecurity#61 onto feat/qwen3.6-dense-mtp. The CLI default of `nvfp4` makes the dense-FFN MTP head guard (mtp_head/new.rs:58-65) reject Qwen3.6-27B-FP8 (ships an FP8/bf16 MTP head), so `--speculative` silently disabled MTP (has_proposer()=false → use_speculative=false). bf16 is accepted by the guard → `--speculative` now engages dense MTP. Higher accuracy; opt into nvfp4 explicitly with --mtp-quantization nvfp4. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 0 of the Qwen3.6-27B long-code plan. Paired-seed N=10 driver (harness.py) + acorn-loose AST analyzer (analyze.mjs) + paired diff (compare.py), frozen canonical 3D-chess prompt. analyze.mjs uses acorn-loose so duplicate-declaration detection sees the degenerate tail (strict-parse fallback would mask it). Validated on a known -degenerate sample (dup=2, completeness_pass=false) and a valid control (completeness_pass=true). results/ git-ignored. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… mode Degenerate output usually never emits the closing ``` or </script>, so the closed-only extractor returned "" exactly on failing samples and masked valid_js_line_count / duplicate_declaration_count (gate metric was unaffected). Handle unclosed trailing fence, no-fence raw HTML, and unclosed <script>. Add harness.py --reanalyze (re-score saved samples with SSOT summarize(); no regeneration). Validated: masked baseline seed3 went 0/0,dup=0 -> 379 lines,dup=3; controls unchanged. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…escalating resample Root cause (controlled greedy A/B, 2026-05-18): on long structured code the model falls into a repetition loop; the loop-watchdog only *truncated* (content kill / force-</think>) in BOTH the MTP and non-MTP paths — MTP was never the lever. This converts truncate→recover. - New scheduler/resample.rs (SSOT for Phase-2 tuning): per-output-length penalty ramp + escalation factor + `resample_penalty_factor` (unit-tested). - decode_logits_step.rs: content-loop watchdog now escalates `resample_escalation` (compounding per re-fire) to steer the model OUT of the loop; only after RESAMPLE_MAX_ESC un-cleared escalations falls back to the original hard finish; decays on recovery. RESAMPLE_MAX_ESC=0 ⇒ exactly the old kill behaviour. - decode_logits_seq.rs: scale presence (clamped < EOS cliff) / lz / DRY by resample_penalty_factor(output_len, escalation). RAMP_SLOPE=0 ⇒ identity. - resample_escalation:u8 added to ActiveSeq+SwappedSeq + all 5 ctor sites + both lifecycle mappings (mirrors think_watchdog_fires exactly). Verified: cargo check + clippy clean, fmt 0 diffs, LoC ≤500 (helpers back to 476; new resample.rs 74), resample unit tests 3/3. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…atchdog→escalating resample" This reverts 571e2e3. The vLLM-oracle diagnosis (2026-05-18) proved the Qwen3.6-27B-FP8 long-code degeneration is an Atlas bug where Atlas's anti-repetition PENALTIES are part of the cause, not the cure: same model+prompt+temp on vLLM (no penalties, no MTP) produces a complete game, while Atlas's penalty stack induces premature-EOS (documented in the MODEL.toml 6c69d3f bisect comment). Phase-2 ADDED a per-length penalty ramp — directionally backwards. Its watchdog→escalate half is also moot: the real failure is a fuzzy loop with period ~80 tok > the 64-tok CONTENT_LOOP_PERIOD_MAX, so the watchdog never detects it. Full revert restores the known-good baseline; the correct fixes (penalty-preset realignment, thinking_default, prompt-template divergence) are tracked separately and gated on the N≥10 paired harness. See memory project_qwen36_27b_degeneration_rootcause. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o-thinking) Additive, PCND-clean: --presence-penalty/--frequency-penalty/--no-thinking are only sent when explicitly passed; default cell unchanged (server's MODEL.toml preset). Enables the N≥10 paired data gate (shipped preset vs vLLM-minimal) that Avarok-Cybersecurity#67's MODEL.toml realignment is contingent on — so the preset change is data-driven, not n=1 (feedback_no_n1_stochastic_ab). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mbed_scale guard Three independent fixes restore Qwen3.6-27B-FP8 long-code generation to match vLLM behavior (~14.5x later degeneration, complete coherent code). 1. weight_loader/qwen35_dense.rs: A_log and dt_bias must be FP32. Consumer kernels (ssm_preprocess.cu, mamba2_ssm_decode.cu) declare them `const float*`; loading via `dense()` kept BF16 storage, reinterpreting 48-elt BF16 (96B) as 48-elt FP32 -> per-head scrambled decay gates. MoE sister loader (ssm_qwen35.rs:59-62) had this fix already with an explicit warning about exponential GDR amplification at long context — dense path missed the mirror. 2. weight_loader/qwen35_dense.rs + layers/qwen3_ssm/init.rs: ATLAS_FP8_SSM_PREFILL=1 routes SSM in_proj_qkv/out_proj through a native FP8 prefill GEMM (bf16_to_fp8 -> fp8_gemm_n128), eliminating the BF16-truncation intermediate that, amplified by k-conv's tiny weights (||conv-k|| ~ 18x smaller than ||conv-v||), rotated the k-band conv output direction. NVFP4 fallback preserved for decode batched paths. Mirrors the MoE set_fp8_weights pattern. 3. model/impl_a3.rs: scale_embeddings_fp32 now applies the same no-embed-scale guard as scale_embeddings_bf16. Without it, every non-Gemma model that opted into fp32-residual hard-failed with "Module 'embed_scale' not loaded". Verification (greedy, --no-thinking, --max-tokens 6000, vs aligned 54-tok HF oracle and end-to-end generation): conv-k cos 0.550 -> 0.9998, recur per-head cos 0.99 -> 1.00 with magnitude ratio 0.82 -> 0.99, gnorm per-head cos 0.82 -> 0.999, tokens_to_first_degeneration 1196 -> 17327. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Thanks for the contribution. Before we can merge, please sign our Contributor License Agreement by replying to this comment with exactly:
I have read the CLA Document and I hereby sign the CLA 2 out of 3 committers have signed the CLA. |
Author
|
Author
|
recheck |
Sweep 2026-05-20: every Qwen-family MODEL.toml under kernels/gb10 now declares 'thinking_default = true' in [behavior]. Reasoning-tier Qwen3.5/3.6 models materially benefit from CoT on multi-step prompts; the prior implicit default of false (or explicit false on qwen3-next-80b-a3b) silenced thinking unless the caller passed chat_template_kwargs.enable_thinking=true. Per-request override and CLI --disable-thinking kill-switch retain precedence per the documented ladder (CLI flag > request body > MODEL.toml [behavior]). Note: MODEL.toml is compile-time-baked via atlas-kernels/build.rs → build_codegen.rs, so a rebuild is required for these edits to take effect in the running binary. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…stigation Distilled playbook for tracking down quality regressions where Atlas output diverges from a reference framework (vLLM / HF) on the same model checkpoint. Grounded in commit 3ebc08a (the GDN decay-gate precision + FP8 SSM prefill fixes that produced the 14.2× improvement in tokens_to_first_degeneration on Qwen3.6-27B-FP8). Covers the cheapest-signal-first elimination ladder, the byte-exact HF CPU oracle pattern, per-head magnitude as a pointer-bug fingerprint (std/min/max diagnostic), the sister-loader diff rule, reversal discipline, and the polling cadence for multi-hour reproductions. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous false value (inherited from Qwen3.5's 'Think-Satisfy' guard) unconditionally suppressed thinking on every tool-active turn — i.e. every opencode interaction, since opencode always carries tools. Verified post-rebuild on atlas-gb10:fix sha 5072cb1562c4: tools_active=true, --num-drafts=1, thinking_in_tools=false → 0 reasoning_content chunks tools_active=true, --num-drafts=1, thinking_in_tools=true → 34 reasoning_content chunks Qwen3.6's reasoning is meaningfully better than 3.5's so Think-Satisfy shouldn't re-emerge; max_thinking_budget=512 caps any drift and F28 auto-disables thinking when the previous message is a tool error. One-line revert if tool-call density regresses. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three changes (audit per DEBUGGING_METHODOLOGY.md): 1. weight_loader/qwen35_dense.rs: drop ATLAS_FP8_SSM_PREFILL env gate. The fix has been live since 3ebc08a, verified 14.2× degeneration-onset improvement and matching vLLM behavior class. Unconditional now for every FP8-on-disk dense Qwen3.6 variant. Also switch norm.weight to dense_f32_safe (mirrors the MoE sister loader's FP32-aware path — defensive against checkpoints that ship norm weights as fp32). 2. weight_loader/qwen35/load_layers/linear_attn_arms.rs: same FP8 SSM prefill path cross-ported to the MoE A3B loader. Sister-loader audit (3 parallel agents per DEBUGGING_METHODOLOGY.md §6) found that the MoE A3B has identical asymmetric conv weights (k-segment ~18× smaller than v) and was therefore vulnerable to the same conv-k SNR collapse via FP8→BF16→NVFP4→BF16 triple-quant chain. Now both variants dispatch SSM prefill through fp8_gemm_n128 (BF16 act × FP8 weight, FP32 accumulator) with NVFP4 retained as decode/batch fallback. 3. DEBUGGING_METHODOLOGY.md appendix: update Bug Avarok-Cybersecurity#1 description to note the env gate has been removed and the fix cross-ported to MoE. Bug Avarok-Cybersecurity#2 audit was clean — codebase-wide sweep of every const float* kernel parameter found all SSM scalar params (A_log, dt_bias, conv biases, D_param across Qwen + Nemotron) already on the dense_keep_f32 / dense_bf16_as_f32 path. No new pointer-alias sites. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Quick follow-up to 7d5e8fc: the MoE A3B Bug Avarok-Cybersecurity#1 cross-port allocates native-FP8 SSM weights inside build_linear_attention_nvfp4 (called per LinearAttention layer), which was silent. Add one tracing::info! inside the if-Fp8Dequanted block so startup logs visibly confirm the fix is active — mirrors the dense loader's top-level log (which fires once before the layer loop). For 35B-A3B with 30 SSM layers, expect 30 'SSM[...] ... native FP8 prefill GEMM' lines. Quick rebuild only, no behavior change. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…hunk-position lesson
Adds env-gated layer-0..N GDN intermediate dumper for the MoE A3B path:
ATLAS_GDN_DUMP=<dir> — base output dir
ATLAS_GDN_DUMP_LAYERS=0,15,29 — SSM-layer indices to capture
ATLAS_GDN_DUMP_N_SSM=30 — SSM-layer-count for counter modulo
The dumper hooks into trait_prefill.rs at 4 sites (post-conv, post-l2norm,
post-recurrence, post-gnorm) and captures the last token's BF16 slice.
Under chunked prefill, the file is overwritten on every scheduler chunk so
the on-disk dump corresponds to the LAST chunk's last-token (== position
L-1 of the full prefill, not chunk_len-1 of the first chunk).
Companion Python tooling:
hf_gdn_ref_a3b.py — HF CPU oracle for A3B (32 v-heads, conv_dim=8192,
hidden=2048). Generated TOK list from Atlas /tokenize.
gdn_chain_diff_a3b.py — diff comparator with A3B segment shapes
(q/k/v = 2048/2048/4096).
Methodology doc update: §3 gotcha Avarok-Cybersecurity#4 (don't use FP8 checkpoint as oracle —
HF silently ignores scale_inv), Avarok-Cybersecurity#5 (when comparing across SUT configs,
guarantee identical dumped positions or you'll see methodological noise).
§7 reversal log: chunked-prefill drift hypothesis added (refuted).
Investigation findings on Qwen3.6-35B-A3B at L=16k:
Layer 0: cos > 0.99985 — byte-perfect to BF16 floor at every stage.
Layer 15: gnorm cos 0.64 — depth + long-context quantization noise.
Layer 29: gnorm cos 0.71 — same class.
At L=1244 the L15/L29 drift is much smaller (gnorm 0.92 / 0.86).
Final model output remains coherent (sub-perceptual drift).
No discrete numerical bug — drift is expected quantized-inference noise.
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two changes in service of long-context drift investigation:
1. Multi-layer dump support — ATLAS_GDN_DUMP_LAYERS=0,1,2,3,15,29 (comma list
of SSM-layer indices). Per-stage AtomicBool latches per (layer_idx, stage);
counter wraps mod ATLAS_GDN_DUMP_N_SSM. Last-call overwrites so dumps land
at position L-1 of the full prefill (not chunk_len-1 of the first chunk).
2. ATLAS_DISABLE_WY4=1 — forces fallback to single-token persistent kernel
(or split4) for kernel-numerics isolation. WY4 was a suspect; turned out
not to be the dominant noise source.
Investigation summary (Qwen3.6-35B-A3B, L=16k, layer-0..L15 walk against HF
BF16 baseline):
L0 cos=0.99988 (byte-perfect at all L: 31, 1244, 16100) — proves
Bug#1/Avarok-Cybersecurity#2/Avarok-Cybersecurity#3 fixes work and rules out first-layer kernel bugs.
L1 gnorm cos=0.69547 at L=16k vs 0.94266 at L=1244
→ real, length-dependent drift starting at layer 1.
L3+ gnorm cos in 0.23-0.67 range — non-monotonic compounding.
Mechanism: FP8/NVFP4 weight-quantization noise on the per-layer in_proj_qkv
(specifically the W_z portion fed into the gnorm silu(z) gate). silu is a
non-linear amplifier near z≈0, so small per-layer FP8 noise in z becomes
large drift in gnorm output. At long L, the SSM recurrence H_t = g*H_{t-1}
+ v*k^T amplifies tiny per-step noise through the multiplicative gate
(H decay ratio compounds e^(ε·L) over L steps).
Refuted hypotheses (all evidence in commit + tests):
- Chunked-prefill state-precision loss (refuted with corrected dumper)
- FP8 KV cache as primary culprit (only +0.05 cos with BF16 KV)
- BF16 residual stream as primary culprit (FP32 residual is *worse* at L1)
- WY4 algebraic correction (disabling makes L1 worse, not better)
True fix requires loader-side precision schedule: load qkvz weight (or at
least its W_z slice) at BF16 unconditionally, accepting ~50 MB/layer extra
VRAM. Roughly 1.5 GB more for the A3B at 30 SSM layers. Not done in this
patch — needs design discussion on whether to make this default-on, env-gated,
or per-model.
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Diagnostic env override that skips both FP8 and NVFP4 paths and uses the BF16 in_proj_qkvz weight via ops::dense_gemm. Tests whether weight quantization is the dominant source of layer-1+ long-context drift. Result (Qwen3.6-35B-A3B, L=16k, vs HF BF16 baseline): Stage FP8/NVFP4 (default) BF16 weights forced L0 gnorm 0.99988 1.00000 (perfect match) L1 gnorm 0.69547 0.4274 (WORSE by 0.27) L2 gnorm 0.93958 0.9513 (+0.01) L3 gnorm 0.23399 0.4655 (+0.23) Key inference: at L0 BF16 weights give byte-perfect match (cos=1.0000) proving Atlas's BF16 GEMM == HF's BF16 GEMM when there's no SSM-state or downstream-layer mixing. At L1, identical BF16 weights produce a substantially DIFFERENT result from HF (cos 0.43, magnitude diverges too). This rules out FP8/NVFP4 weight precision as the dominant drift source at deeper layers under long context. Remaining suspects (real causes): - Atlas's GDN prefill kernels (WY4/persistent/split4) vs HF's chunk_gated_delta_rule have different FP32-accumulation reduction orders. Per-step recurrence errors accumulate differently. - L2-norm reduction order on q,k may differ. - Gated RMSNorm kernel's silu(z) is computed with __expf (fast intrinsic) on Atlas but expf (IEEE) on HF — small per-element differences amplify through the non-linear gate. - Attention layer (model layer 3+) at L=16k has FP8 weights and a 16k- token softmax; its output noise propagates into downstream SSM layers. Each of these is a non-trivial kernel-level investigation. The dumper + Python comparator from this commit chain (+ the per-layer walk pattern proven here) are the right infrastructure to drill further when prioritized. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…hts override
Three diagnostic env knobs for long-context drift investigation:
ATLAS_DISABLE_WY4=1 — fall back to single-token persistent kernel
ATLAS_FORCE_PERSISTENT=1 — force per-token persistent at any k
ATLAS_GDN_BF16_WEIGHTS=1 — skip FP8/NVFP4 paths, use BF16 dense GEMM
Per-substep dumpers added at:
pre_norm (layer input from residual stream)
post_norm (post-input_norm, into in_proj_qkv)
post_qkvz (post-deinterleave Q|K|V|Z, into conv1d)
Investigation findings on Qwen3.6-35B-A3B at L=16k vs HF Qwen3.6-35B-A3B
BF16 baseline:
L0 gnorm cos = 0.99988 (clean at all L)
L1 gnorm cos = 0.43-0.70 depending on config
L3+ drift compounds non-monotonically
Combinations tried (no single fix found):
default (FP8/NVFP4): L1 gnorm 0.695
BF16 KV cache: L1 gnorm 0.704 (+0.01)
ATLAS_FP32_RESIDUAL=1: L1 gnorm 0.386 (worse)
ATLAS_DISABLE_WY4=1 → split4: L1 gnorm 0.301 (worse)
ATLAS_GDN_BF16_WEIGHTS=1: L1 gnorm 0.427 (worse — and proves
weight precision is NOT the cause)
ATLAS_FORCE_PERSISTENT=1: L1 gnorm 0.595 (slightly worse)
Max precision (BF16w + FP32res + persistent): L1 gnorm 0.648
Key inference: at L0 with BF16 weights, cos=1.00000 (byte-perfect)
proving Atlas's GEMM == HF's GEMM with identical inputs. At L1, same
configuration gives cos=0.43, meaning L1's INPUT differs from HF's L1
input. The drift must come from the L0→L1 transition: residual_add
chain that includes L0's out_proj output + post_attn_norm + MoE
output (NOT covered by ATLAS_GDN_BF16_WEIGHTS which only affects
qkvz GEMM, NOT the MoE experts or out_proj).
Remaining suspect: MoE expert quantization noise propagating via the
residual stream. The MoE has 256 experts with FP8/NVFP4 weights and
per-token routing; each token's MoE output has small quant noise that
ADDS to the residual, drifting L1+ inputs vs HF.
The pre_norm/post_norm dumpers have a subtle bug at small-L (capture
zeros for short prompts but real data for long) — buffer layout
interaction. Filed but not blocking the main finding.
Cleanest next move: instrument MoE output dumping + comparison vs HF
to confirm/refute the MoE-quant-noise hypothesis. If confirmed, load
MoE experts at BF16 (large memory cost) is the only real fix path.
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The DFlash drafter constructor unconditionally applied YaRN scaling with
factor=64 / original_max_position_embeddings=4096 / beta_fast=32 /
beta_slow=1 hardcoded. The v2 2026-04-27 Qwen3.6-DFlash drafter ships
`rope_scaling: null` (plain RoPE), so every low-frequency RoPE pair was
mis-scaled — pairs 0..11 divided by 64, pairs 11..26 ramped — landing
drafter Q/K rotations in the wrong angular basis at every layer.
Replace the hardcoded YaRN block with a config-driven loader:
* `DflashConfig` gains `rope_theta` (default 10M, matches Qwen3.6) and
an optional `rope_scaling` block mirroring HF transformers' Qwen3
config so `serde_json::from_str` works directly on the drafter's
`config.json`.
* `BlockDiffusionDraftHead::from_weights` reads that block:
- `None` ⇒ plain RoPE (`inv_freq[j] = 1 / θ^(2j/dim)`).
- `Some(yarn)` ⇒ existing YaRN formula, parameters from config.
- Other / unrecognised `rope_type` ⇒ warn and fall back to plain.
Tested against the v2 Qwen3.6-DFlash drafter; per-layer drafter hidden
states are now bit-perfect with the PyTorch reference forward (cos=1.0
through all 5 drafter layers). End-to-end DFlash speculative decoding
still has a separate bug downstream of the drafter — investigation
ongoing — but the RoPE basis is now correct and worth landing on its
own.
Refs: Avarok internal thread #development 2026-05-19.
Adds dump hooks at: out_proj : SSM out_proj output (between gnorm and residual chain) moe_out : MoE final output (post shared-expert blend) Definitive findings (Qwen3.6-35B-A3B, L=16k, BF16 weights forced for qkvz): L0 SSM out_proj cos vs HF: 0.99210 (clean — SSM out_proj works) L0 MoE out cos vs HF combined: 0.42130 |Atlas|/|HF| = 2.266 (drifted) L0 MoE out cos vs HF routed-only: 0.40608 |Atlas|/|HF| = 2.368 (drifted) The 2.37× magnitude inflation against HF's routed-only output proves the drift originates in the ROUTED-expert pathway, NOT the shared expert (which only contributes ~4% magnitude in HF). Ruled out: - Top-K routing normalization (kernel reviewed, dispatch.rs:72 hardcodes norm_topk_prob=true for qwen3_5_moe family) - Shared expert sigmoid gating (Atlas applies it via moe_batched_blend) - Chunked prefill (refuted earlier in this branch) - SSM kernels (L0 gnorm + out_proj are clean) Remaining suspect: FP8 expert weight scaling. Each of 256 experts has 3 FP8-quantized matrices (gate/up/down), each with per-block weight_scale and per-matrix weight_scale_2. An indexing or magnitude error in scale loading would systematically inflate the routed weighted sum by the same factor for all tokens. The 2.37× ratio is consistent with a ~sqrt(K=8)≈2.83 or a single global mis-scaled weight matrix. Cleanest next experiment: dump a single expert's GEMM input + output + weight_scale + weight_scale_2 from Atlas, compute HF expert(x) on the same input using BF16 weights, and compare. If Atlas's per-expert output differs by a constant factor from HF's, the FP8 scale loading is the bug. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… cause localized to L0 MoE routing
Diagnostic env knobs added:
ATLAS_DUMP_EXPERT_IDS=1 — logs top-K expert indices + weights per token
per layer (in moe/forward_prefill_fp8.rs)
+ ATLAS_GATE_INPUT log (router_in last_tok |x|+first5)
+ ATLAS_GATE_LOGITS log (raw pre-softmax top-10 + stats)
ATLAS_DUMP_EMBED=1 — logs hidden_dst magnitude pre/post scale_embeddings
(in prefill_b/embed_chunk.rs)
ATLAS_GDN_BF16_WEIGHTS=1 — extension: also installs BF16 out_proj_dense in
the MoE A3B loader so the dispatcher takes the
dense_gemm BF16 path (overrides FP8/NVFP4 out_proj).
Investigation chain (Qwen3.6-35B-A3B, L=16k, last token = pos 16098 = tok 271):
✅ embedding lookup: Atlas bit-identical to HF (|x|=0.3164, first5 match)
✅ embed_tokens.weight tensor: bit-identical between FP8 and BF16 checkpoints
✅ gate.weight tensor: bit-identical between checkpoints
✅ post_attention_layernorm.weight: bit-identical between checkpoints
✅ SSM out_proj cos vs HF: 0.99210 (clean)
✅ SSM gnorm cos vs HF: 0.99988 (clean)
❌ Gate INPUT direction:
Atlas |x|=23.06 first5=[ 0.352, -0.359, -0.598, 0.070, 0.408]
HF |x|=25.43 first5=[-0.832, -0.027, 0.297, -0.046, 0.036]
Magnitudes match within 10% but per-element directions completely differ.
❌ Gate LOGITS top-10:
Atlas: [102, 131, 15, 21, 228, 52, 93, 96, 14, 116] (raw, mean -5.66)
HF: [ 81, 140, 158, 208, 200, 132, 115, 86, 95, 206] (post-softmax)
❌ Top-K=8 expert selection overlap: 0/8 (ZERO common experts)
❌ MoE output cos vs HF: 0.42 (catastrophic divergence)
❌ BF16 out_proj fix attempt: NO meaningful change (still 0/8 overlap)
— eliminates SSM out_proj FP8 quant as the cause
Conclusion: with all upstream weights and SSM outputs bit-identical or
sufficiently close, the divergent gate input MUST come from a structural
difference in either residual_add_rms_norm (kernel formula/precision)
or some intermediate step I haven't instrumented yet. The MoE block's
top-K routing is so sensitive to gate-input direction that even a small
upstream noise produces 0/8 overlap with HF's selection.
Remaining suspects for the deeper drill:
1. residual_add_rms_norm kernel: Atlas's formula vs HF's torch RMSNorm
differ in some accumulation detail (eps, reduction order, post-norm
weight multiplication order). The kernel's BF16 input/output but
FP32 reduction is correct, but the formula might subtly differ.
2. The 'hidden' buffer at residual_add_rms_norm time may not be the
pristine embedding — some other op may have written to it (vision
overlay? marconi snapshot restore? warmup leftovers?).
3. post_attention_layernorm WEIGHT loading: bit-identical on disk
but possibly loaded with wrong stride/transpose in Atlas's
'load_dense_ffn' or similar.
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The v1 `moe_fp8_grouped_gemm` kernel was originally documented as just
having a coalescing performance bug. While debugging Qwen3.6-35B-A3B-FP8
producing gibberish at 16k context, captured per-expert dumps showed v1
has a NUMERICAL bug for some (token, expert) tile combinations:
chunk-4 last-token, expert 200, up_proj output:
v1 (default): |x|=28.0 ← 5× too large
v2 (coalesced): |x|=?? ← correct shape
HF baseline (BF16 oracle): |x| ~ 5
The amplification propagates: up=28 then silu(gate)*up=8.4 then
down_proj→4.5, vs the other 7 experts' down output ~0.2-0.6.
At the prefill-chunk-4 boundary this drives a 42% residual-stream
amplification (Atlas L1 input |x|=0.819 vs HF 0.577); v2 brings it to
6% (0.611 vs 0.577) and Atlas now produces a coherent summary of a
16k-token prompt instead of gibberish.
Verification:
- L0 chunk-4 routed-only MoE output: v1=0.44, v2=0.26, HF=0.21
- 16k-context generation: now produces clean summary
- v1 path preserved as env override ATLAS_FP8_MOE_COALESCED=0
Diagnostic instrumentation kept (gated on ATLAS_DUMP_EXPERT_IDS=1):
ATLAS_GATE_INPUT, ATLAS_GATE_LOGITS, ATLAS_EXPERT_IDS,
ATLAS_MOE_OUT, ATLAS_ROUTED_ONLY, ATLAS_SHARED_OUT,
ATLAS_SHARED_GATE; pre-norm SUM/HIDDEN/OUTPROJ in qwen3_ssm.
Zero overhead when env unset. Plus a diagnostic
ATLAS_FORCE_NVFP4_MOE=1 path in qwen35 load_layers for future
same-class bug bisection.
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The expert_gate_out/up_out/down_out buffers were only zeroed when `ctx.comm.is_some()` (EP mode). In single-GPU mode they were left uninitialized, which is a problem because: max_m_tiles = (avg_per_expert * 2).div_ceil(64).max(1) assumes peak-per-expert ≤ 2× average — skewed routing violates this, so the grouped GEMM kernel skips rows past max_m_tiles*64. Those rows keep STALE DATA from the previous prefill (or uninitialized memory on the first prefill), which propagates through unpermute_reduce as spurious contributions to the routed-MoE output. Effect was chunk-size + run-history dependent: - Same prompt, same gate_input (25.97), different ATLAS_ROUTED_ONLY: 4-chunk first-fire: 0.65 4-chunk after-warmup: 0.35 2-chunk: 1.13 All vs HF baseline: 0.21 After fix (zero-init unconditional): 3 runs same prompt: 0.186, 0.186, 0.184 (deterministic, -14% vs HF) Full L0..L39 sweep vs HF: 40/40 within 5% magnitude, 6+/8 expert overlap (Previously L1-L6 were 20-38% over with 1-5/8 overlap) Also: one-shot tracing log to verify v2-vs-v1 kernel selection at runtime (helpers_a.rs), useful for future bisection. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous heuristic max_m_tiles = (avg_per_expert * 2).div_ceil(64)
silently truncated heavily-loaded experts. Observed in
Qwen3.6-35B-A3B-FP8 at 4097-token chunk:
ATLAS_EXPERT_LOAD: n_tokens=4097 avg=129 max=929 (expert 227)
max_m_tiles=5 kernel_cap=320 truncated=true
Expert 227 had 929 tokens assigned but the kernel grid only covered
320 rows — the remaining 609 rows were zeroed (after the zero-init
fix) but never computed. Those 609 tokens lost their expert-227
contribution entirely, producing a SYSTEMATIC -14% bias in routed-MoE
output at L0.
Fix: size max_m_tiles for the worst possible case where one expert
takes all tokens:
max_m_tiles = (num_tokens * top_k).div_ceil(64)
This launches some empty tiles for under-utilized experts, but each
early-exits on `m_idx >= M_expert` so the overhead is small relative
to the previous correctness bug.
Result vs HF baseline (16k prompt, L0..L39 chunk-4 last-token):
Before all fixes: L0..L6 had 20-38% over-amplification + 1-5/8 overlap
After v2+zero-init: -14% bias systematic across all layers
After this fix:
- L0 MOE_OUT: 0.2084 vs HF 0.2154 (-3.3%)
- Mean ratio 0.996, stddev 0.0085, all 40 layers in [0.977, 1.021]
- Mean overlap 7.5/8, 21/40 layers perfect 8/8 overlap
- = FP8 quantization noise floor
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- /.dgx2-work/, /tasks/, /target-rebuild/, /spark-fastbin-*
- bench/longcode/hang-forensics/{*.log,*.jsonl,*.py,*.sh,...}
- docker/gb10/Dockerfile.{fix,67fix,combo,ffnbf16,fp32res,hangdiag,zeropen}
(the canonical Dockerfile + Dockerfile.fast + Dockerfile.fence +
Dockerfile.gemma-fp32 stay tracked)
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors the FP8 path fixes (commits 34626d3 + adf39ce) to the NVFP4 grouped-GEMM forward (`forward_prefill.rs`). Same two bugs, same symptoms expected: 1. Zero-init expert output buffers unconditionally (was only in EP mode via `ctx.comm.is_some()`). Some grouped-GEMM kernel paths skip rows past per-expert end; without zero-init those rows kept stale data from previous prefills which contaminated the unpermute_reduce sum. 2. max_m_tiles = (num_tokens * top_k).div_ceil(64) (worst case), not (avg * 2).div_ceil(64). Real MoE routers concentrate experts ~7× the average (observed on Qwen3.6-A3B at chunk=4097: avg=129, max=929 for expert 227). The Poisson(avg) assumption in the prior comment is wrong for trained routers. The (avg*2) truncation silently lost ~14% of routed-MoE magnitude per layer. Verified on AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4: 16k-context generation now produces a clean, coherent code summary. Models likely impacted (NVFP4 routed-expert MoE): MiniMax M2.7-NVFP4, Mistral-Small-4-119B-NVFP4, Nemotron-3-Nano-30B-A3B-NVFP4, Qwen3-VL-30B-A3B-NVFP4, Qwen3.6-A3B-heretic-NVFP4, Gemma-4-31B-NVFP4, Qwen3.5-27B-NVFP4. All had latent under-counting and run-to-run non-determinism in their routed-MoE path. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Problem
Qwen3.6-35B-A3B-FP8 with thinking mode enabled (--tool-call-parser qwen3_coder) produces empty-string values for
required tool call parameters. The model reasons about the call inside , which causes the XML parameter
extractor to see whitespace-only content and emit "" for typed fields — breaking downstream clients that expect
integers, booleans, or arrays.
Solution
Add a qwen3_xml parser that is identical to qwen3_coder in wire format and grammar, but applies a schema-driven
type coercion pass after extraction. String values are rewritten to the JSON type declared in the tool's
parameters schema:
┌──────────────────┬───────────────────────────────────────────────────┐
│ Schema type │ Coercion │
├──────────────────┼───────────────────────────────────────────────────┤
│ integer / number │ "10" → 10, "3.14" → 3.14 │
├──────────────────┼───────────────────────────────────────────────────┤
│ boolean │ "true" / "True" → true, "false" / "False" → false │
├──────────────────┼───────────────────────────────────────────────────┤
│ array / object │ JSON-parse the string value │
├──────────────────┼───────────────────────────────────────────────────┤
│ null │ "null" → JSON null │
├──────────────────┼───────────────────────────────────────────────────┤
│ Anything else │ left as-is │
└──────────────────┴───────────────────────────────────────────────────┘
Coercion never panics and never drops fields — unrecognised types and unparseable values are left unchanged.
Qwen3CoderParser is unmodified; existing tests stay green.
The hook runs at both call sites: non-streaming (build_choice_message) and streaming (handle_complete_tool_call +
handle_tool_call_delta), after backfill_required_params and before normalize_paths / validate_tool_calls.
kernels/gb10/qwen3.6-35b-a3b/MODEL.toml is updated to select qwen3_xml automatically via the Tier-2
[behavior].tool_call_parser override — no CLI flag needed.
Closes #
Test plan
Passes all standard test, builds clean, confirmed runs on GX10 hardware with the following docker: