Qwen3_xml Tool Parser by G-Deca · Pull Request #73 · Avarok-Cybersecurity/atlas

G-Deca · 2026-05-20T03:08:37Z

Summary

Problem

Qwen3.6-35B-A3B-FP8 with thinking mode enabled (--tool-call-parser qwen3_coder) produces empty-string values for
required tool call parameters. The model reasons about the call inside , which causes the XML parameter
extractor to see whitespace-only content and emit "" for typed fields — breaking downstream clients that expect
integers, booleans, or arrays.

Solution

Add a qwen3_xml parser that is identical to qwen3_coder in wire format and grammar, but applies a schema-driven
type coercion pass after extraction. String values are rewritten to the JSON type declared in the tool's
parameters schema:

┌──────────────────┬───────────────────────────────────────────────────┐
│ Schema type │ Coercion │
├──────────────────┼───────────────────────────────────────────────────┤
│ integer / number │ "10" → 10, "3.14" → 3.14 │
├──────────────────┼───────────────────────────────────────────────────┤
│ boolean │ "true" / "True" → true, "false" / "False" → false │
├──────────────────┼───────────────────────────────────────────────────┤
│ array / object │ JSON-parse the string value │
├──────────────────┼───────────────────────────────────────────────────┤
│ null │ "null" → JSON null │
├──────────────────┼───────────────────────────────────────────────────┤
│ Anything else │ left as-is │
└──────────────────┴───────────────────────────────────────────────────┘

Coercion never panics and never drops fields — unrecognised types and unparseable values are left unchanged.
Qwen3CoderParser is unmodified; existing tests stay green.

The hook runs at both call sites: non-streaming (build_choice_message) and streaming (handle_complete_tool_call +
handle_tool_call_delta), after backfill_required_params and before normalize_paths / validate_tool_calls.

kernels/gb10/qwen3.6-35b-a3b/MODEL.toml is updated to select qwen3_xml automatically via the Tier-2
[behavior].tool_call_parser override — no CLI flag needed.

Closes #

Test plan

Passes all standard test, builds clean, confirmed runs on GX10 hardware with the following docker:

  -p 8888:8888 \
  --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  atlas-gb10 \
  serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --port 8888 \
  --bind 0.0.0.0 \
  --max-seq-len 131072 \
  --kv-cache-dtype fp8 \
  --kv-high-precision-layers auto \
  --gpu-memory-utilization 0.90 \
  --scheduling-policy slai \
  --tool-call-parser qwen3_xml \
  --enable-prefix-caching \
  --speculative```

- [ ] `cargo fmt --all -- --check`
- [ ] `ATLAS_SKIP_BUILD=1 cargo clippy --workspace --tests --all-features -- -Dwarnings`
- [ ] `bash scripts/check-license-headers.sh`
- [ ] Tested against a real model / hardware if the change affects runtime behaviour
- [ ] Added or updated tests where applicable

## Notes for reviewers

<!-- Design rationale, trade-offs, follow-ups you deferred, things you want a second opinion on. -->

## CLA

- [ ] I have read and agree to the [Contributor License Agreement](../CLA.md).

The bundled `mtp.safetensors` for `Qwen/Qwen3.6-27B-FP8` carries a single full-attention layer + a dense gate/up/down MLP (no router, no experts). Atlas's MTP loader assumed every MTP head was MoE-shaped, so the dense loader (`Qwen35DenseWeightLoader`) just stubbed `load_mtp_weights → None` and `--speculative` silently no-opped. Changes: - `MtpWeights` gains `dense_ffn: Option<DenseExpertWeight>`. `load_mtp` auto-detects FFN flavor by inspecting weight names: presence of `mtp.layers.0.mlp.gate_proj.weight` without a `.gate.weight` router selects the dense path; the MoE fields are populated with NULL placeholders. - `MtpHead` gains a `dense_ffn_generic` projection triple. The constructor short-circuits all MoE quantization when dense weights are present (NVFP4 mode is rejected — Qwen3.6-27B-FP8 ships an FP8 MTP head, so users pass `--mtp-quantization fp8` or `bf16`). - `MtpHead::forward_one` step 10 dispatches `dense_ffn_forward_generic` when the dense triple is populated. The dense path reuses the existing `dense_gemv_*` and `moe_silu_mul` kernels — no new kernel wiring or PTX changes. - `Qwen35DenseWeightLoader::load_mtp_weights` now calls `load_mtp` when `mtp.fc.weight` is present in the store. The single auto-detecting loader handles both MoE (35B-A3B) and dense (27B) variants. - `factory::build` warns when `--speculative` is requested but no MTP weights were loaded (was: silent no-op). Tested via `scripts/check.sh check -p spark-model -p spark-server`. End-to-end validation pending sparkrun on dgx2 with the `qwen3.6-27b-dense-fp8-mtp-atlas` recipe.

Qwen3.6-27B-FP8 is the dense text sibling of the Qwen3.6-VL family. Its config.json declares the same `vision_config` block as the VL siblings (Qwen ships them as one config family), so since `82f4794 (Apple Metal)` widened `is_qwen3_vl()` to also match `qwen3_5 + vision`, the routing in `loader_for_config` sends the 27B dense checkpoint to `Qwen3VLWeightLoader` — which assumes MoE and panics on the missing `mlp.gate.weight`. The fix: check `is_qwen35_dense()` (requires `num_experts == 0`) BEFORE `is_qwen3_vl()`. Dense text-only checkpoints are unambiguously distinguishable by `num_experts == 0`; VL-MoE always has experts. Repro: spark serve Qwen/Qwen3.6-27B-FP8 → "Failed to build model: Weight 'model.layers.0.mlp.gate.weight' not found in store" After fix the dense loader handles it (and now also picks up the bundled MTP head — see 43a04d2).

…litting them (Avarok-Cybersecurity#62) * fix(thinkbrake): defer forced </think> past code fences instead of splitting them F2 confidence early-stop (and the thinking-budget cap) forced `</think>` the instant they tripped — including mid-line inside a ```python block the model was drafting in its reasoning. Code tokens are near- deterministic (top-1 >=0.95 for long runs) so F2 trips trivially on a code block; the forced boundary split a statement and the reasoning parser then cut reasoning_content mid-token, leaking the rest into content. User report: "stops half-way through thinking while generating a codeblock." Fix: track ``` code-fence parity per sequence via the atomic fence token id (resolved once in tokenizer_runtime, fail-open if a tokenizer splits it). F2 keeps DETECTING/arming everywhere (a model can ramble in code forever and must stay brakeable), but the forced `</think>` INJECTION is deferred until the fence closes — a safe boundary right after the code block, never mid-statement. THINK_LOOP period-repeat watchdog is unchanged and still active inside fences. Pure decisions extracted to SSOT helpers, called by production and asserted by tests: toggle_code_fence, confidence_run_step, should_inject_think_end. spark-server scheduler::helpers suite 18/18 green under #![deny(warnings)]. Verification: live repro (fibonacci, Qwen3.6-27B-FP8 + MTP) confirmed the mid-codeblock split is gone and the ```python block is emitted intact. End-to-end re-check of the defer-refinement's post-fence brake timing is the remaining live step. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(watchdog): digit-normalized content-loop detector for template degeneration Companion to the thinkbrake fence fix: after deferring the forced </think> past code fences, live testing surfaced a separate content- phase degeneration — Qwen3.6-27B at temp=0 emits a fixed line template with varying numeric payload (`- B(46) = 104509868777\n- B(47) = 273508641\n …`) until max_tokens. The exact-token content-loop watchdog structurally cannot catch this: the integer tokens differ every line so no fixed token period repeats. Fix: detect_content_token_loop_normalized maps numeric tokens to a sentinel AND run-length-collapses consecutive sentinels. Run-collapse is essential — Qwen3.6 is digit-level (`104509868777` → 12 single-digit tokens, `273508641` → 9), so a 1:1 map alone leaves variable-length sentinel runs and the period still varies; collapsing makes `- B(<digits>) = <digits>\n` identical regardless of digit count. Numeric-token mask (the 10 single-digit ids of 248070) built once at startup in tokenizer_runtime via decode_with_special (NOT id_to_token — that returns raw byte-level BPE pieces with the `Ġ` space marker), exposed via a set/get OnceLock mirroring enable_loop_watchdog. OR-ed into the existing content-loop watchdog at decode_logits_step.rs under the SAME per-model enable_loop_watchdog gate (qwen3.6-27b / qwen3.5-27b only). FP guard: CONTENT_LOOP_NORM_MIN_REPEATS=4 and the matched period must contain >=1 sentinel AND >=1 structural token, so pure-number columns and pure-prose loops stay the exact detector's job. scheduler::helpers 23/23 green (5 new incl. variable-length-digit-run + exact-path regression), clean under #![deny(warnings)]. Live (27B-FP8+MTP): degeneration repro bounded 2000 -> 577 tokens (watchdog fired), thinkbrake code-fence fix still intact, normal coding prompt finish=stop (no false-positive early stop). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(qwen3.6-27b): thinking_default=true (reasoning on by default) qwen3.6-27b is a reasoning model but its MODEL.toml [behavior] had no thinking_default key → ModelBehavior default (false) → resolve_thinking returned enable_thinking=false for any client that doesn't explicitly opt in. Open WebUI (and most OpenAI-compatible clients) send plain requests with no enable_thinking flag, so the chat template injected an empty `<think>\n\n</think>` thinking-off marker and reasoning_content was always empty — thinking never displayed. Set thinking_default=true. Plain requests now reason by default; explicit per-request enable_thinking=false (or reasoning effort, etc.) still wins via resolve_thinking. Verified live (27B-FP8+MTP): plain request → reasoning_tokens=176, reasoning_content populated, finish=stop. Build-time embedded via atlas-kernels build_parse.rs. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(qwen3.6-27b): anti-repetition sampling stack + bounded in-fence </think> defer Surfaced by the OpenWebUI "3D chess game" stress prompt (the canonical long-code hard case). Two coupled root causes, both Atlas-side: 1. Sampling presets on this branch had presence_penalty=0.0 and no DRY/LZ — the exact condition commit 6c69d3f live-bisected as the cause of the announce/restart + verbatim-CSS-block degeneration on hard code-gen (QwenLM/Qwen3.6 Avarok-Cybersecurity#88/#115/#145, confirmed upstream on official vLLM/BF16). Replicated 6c69d3f's battle-tested stack (presence_penalty=1.5 + lz_penalty=0.2 + DRY 0.8/1.75/2) on the presets THIS branch's build_sampling actually selects (thinking_text / thinking_coding / non_thinking — no thinking_coding routing here; tools left 0.0). XTC not ported on this branch; the digit-normalized content-loop watchdog is the safety net. 2. Regression from the thinkbrake fence-defer (cd0ca9d): when the model writes its whole deliverable as a ```code block INSIDE <think>, the fence never closes so should_inject_think_end deferred the forced </think> indefinitely — budget brake fired at 256 but reasoning ran to 3025 tokens and the real answer was trapped in reasoning_content with a 499-char content stub. Bounded the deferral: THINK_DEFER_BUDGET_FACTOR=3 (hard-inject </think> past 3x budget even mid-fence; absolute ceiling 2048 when budget=None). Live (27B-FP8+MTP, 3D-chess prompt): before = 401-token CSS-loop garbage; after = reasoning capped exactly at 768 (=3x256), content 7234 chars with 27 THREE. calls, Scene/WebGLRenderer/Camera/ OrbitControls, valid </script></html>, finish=stop, no watchdog. scheduler::helpers 24/24 green. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(qwen3.6-27b): retune presence_penalty 1.5 → 1.0 (temp=0.6 sweep) presence_penalty=1.5 (replicated from 6c69d3f) over-penalized on THIS branch's realistic path: Open WebUI sends no sampling params so requests run at the thinking_text preset default temp=0.6 (stochastic), where a flat 1.5 per-seen-token penalty boosts EOS once common code tokens are exhausted → premature termination (574 tok, 0 THREE. calls). The earlier validation passed only because it used temp=0 (greedy resists early EOS) — wrong regime. Live presence_penalty sweep at temp=0.6 on the 3D-chess prompt: 1.5 → 574 tok, 0 THREE. (premature EOS) 1.0 → 6782 chars, 17 THREE., </script></body></html>, finish=stop, single coherent impl, no loop ← chosen 0.5 → degenerate outlier 0.3 → complete but 3× wasteful announce/restart rewrites 1.0 is the balance point between premature-termination (high pp) and restart-spam (low pp). DRY (multi-token) + LZ (n-gram) + the digit-normalized content-loop watchdog carry the precision loop-breaking presence (single-token) shouldn't be doing alone. Loop/degeneration layer now resolved (no watchdog fires, no restart spam across regimes). NOTE: a separate, non-Atlas factor remains — Open WebUI injects an empty `system: "User Context:\n\n"` which the model reacts to with terse output (isolated: removing it 3×'s the generation). Tracked for a follow-up Atlas robustness fix (neutralize content-free system messages) + user-side Open WebUI system-prompt config. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chat): neutralize content-free client system messages Open WebUI injects an empty RAG/context system message — `"User Context:\n\n"` (trims to the bare label `User Context:`) — when no custom system prompt is set. Models react to a content-free system directive by producing terse / prematurely-terminated output: isolated 2026-05-17, the identical 3D-chess request WITHOUT that system message produced 3x the generation (1499 vs 469 completion tokens, 14 vs 3 THREE. calls). The string is purely client-side (zero matches in Atlas source); Atlas relayed it faithfully and the jinja template rendered it correctly — the model's terseness is a reasonable reaction to a meaningless instruction. We can't fix the client, so Atlas adapts: a leading system message whose trimmed content is empty, or a single short bare `Label:` line with no payload (`User Context:`, `Context:`, `System:`), is dropped before templating so it can't poison generation. Model-agnostic (one site, pre-jinja) — not per-template. Conservative: any multi-line or post-colon content is a real prompt and is never stripped (unit tests cover empty/whitespace, the OpenWebUI residue, and substantive prompts incl. label-like-with-payload and long prose ending in ':'). Live: exact Open WebUI request → log "Dropped content-free client system message dropped=User Context:", no degeneration. (Residual length variance on this prompt is upstream Qwen3.6 temp=0.6 behavior — QwenLM/Qwen3.6 Avarok-Cybersecurity#88 — not Atlas; Open WebUI sends no sampling params so requests run at the preset default temp=0.6.) Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: green the pipeline (LoC split + clippy + fmt) Make the full PR Avarok-Cybersecurity#62 merge chain pass CI. Three classes of fix: 1. LoC ≤500 (file-size-cap): split scheduler/helpers.rs (834 → 476) by moving its `#[cfg(test)] mod thinking_loop_tests` (360 lines) to helpers_tests.rs via `#[path]` — logical child of `helpers`, so `use super::*` resolves exactly as before; zero production change. (decode_logits_step.rs / serve.rs are on the CI allow_list — left.) 2. clippy (deny clippy::all) — fixes, several PRE-EXISTING on the base branch and unmasked once upstream crates compiled clean: - atlas-core config/methods.rs: manual checked division → checked_div - atlas-kernels build_codegen.rs: generated consts emit `&str`/`&[u8]` not `&'static …` (clears 15 redundant_static_lifetimes in the generated target_ptx.rs; fn-return `'static` kept — not linted) - spark-runtime buffers/sizes.rs + spark-server preflight.rs: manual checked division → checked_div().map().unwrap_or() - spark-model forward_layers.rs + spark-server phase_promote_prefills.rs: descending sort_by → sort_(unstable_)by_key(|x| Reverse(..)) - spark-server openai/annotations.rs: loop+let-else-break → while let - spark-server scheduler/helpers.rs: iter().any(|t| t==X) → contains(&X) 3. cargo fmt --all (workspace) — normalizes the clippy edits plus pre-existing base fmt debt (mtp_head/new.rs, qwen35_dense.rs, decode_logits_seq/step.rs). Verified in atlas-builder: fmt --check 0 diffs, no non-allowlisted .rs >500, `cargo clippy --tests --workspace --keep-going` zero errors, `cargo test --workspace` all green (incl. 342 spark-server + 23 helpers). Behavior-preserving throughout (test-mod move, semantically identical clippy rewrites, whitespace-only fmt). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Azeez Ishaqui <debaterishaqui@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…oken Reconcile the thinkbrake code-fence work with main's Avarok-Cybersecurity#56 chunked-prefill refactor (#37513bf split scheduler/phase_continue_prefills.rs into a 4-file dir-module) and Avarok-Cybersecurity#57 tool-call-parser system-prompt injection. Conflicts (2), both resolved by taking main's side: - atlas-kernels/build_codegen.rs: trivial; both sides independently dropped 'static from the generated const type. main's const_ty rename + preamble comment is canonical; branch's redundant change discarded. - scheduler/phase_continue_prefills.rs: structural; took main's 4-file dispatcher (run_standard/run_batched_mixed/run_batched_prefill). Re-threaded code_fence_token: Option<u32> (additive plumbing, after think_start_token / before tool_call_start_token, no logic change) into main's new module boundaries so the thinkbrake fence signal reaches the existing toggle_code_fence SSOT in process_decode_logits: - continue_in_progress_prefills sig + its run_standard_chunk_loop and run_batched_mixed_step calls - run_standard.rs run_standard_chunk_loop sig + process_decode_logits call - run_batched_mixed.rs run_batched_mixed_step sig + process_decode_logits call mod.rs/decode_logits_step.rs/decode_step.rs retained the branch's code_fence_token via clean 3-way auto-merge (verified, no E0061). Audits: Avarok-Cybersecurity#57 (api/chat/mod.rs only) is disjoint from the branch's thinkbrake decode logic (decode_logits_seq.rs only) — no interaction. tokenizers 0.23/safetensors 0.7 dep bumps compile clean workspace-wide. Verified (mirrors CI exactly): cargo fmt --all --check 0 diffs; no non-allowlisted .rs >500 LoC; cargo clippy --workspace --tests zero diagnostics; cargo test --workspace 662 passed / 0 failed (incl. thinkbrake fence/defer/loop suites green). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors PR Avarok-Cybersecurity#61 onto feat/qwen3.6-dense-mtp. The CLI default of `nvfp4` makes the dense-FFN MTP head guard (mtp_head/new.rs:58-65) reject Qwen3.6-27B-FP8 (ships an FP8/bf16 MTP head), so `--speculative` silently disabled MTP (has_proposer()=false → use_speculative=false). bf16 is accepted by the guard → `--speculative` now engages dense MTP. Higher accuracy; opt into nvfp4 explicitly with --mtp-quantization nvfp4. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 0 of the Qwen3.6-27B long-code plan. Paired-seed N=10 driver (harness.py) + acorn-loose AST analyzer (analyze.mjs) + paired diff (compare.py), frozen canonical 3D-chess prompt. analyze.mjs uses acorn-loose so duplicate-declaration detection sees the degenerate tail (strict-parse fallback would mask it). Validated on a known -degenerate sample (dup=2, completeness_pass=false) and a valid control (completeness_pass=true). results/ git-ignored. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… mode Degenerate output usually never emits the closing ``` or </script>, so the closed-only extractor returned "" exactly on failing samples and masked valid_js_line_count / duplicate_declaration_count (gate metric was unaffected). Handle unclosed trailing fence, no-fence raw HTML, and unclosed <script>. Add harness.py --reanalyze (re-score saved samples with SSOT summarize(); no regeneration). Validated: masked baseline seed3 went 0/0,dup=0 -> 379 lines,dup=3; controls unchanged. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…escalating resample Root cause (controlled greedy A/B, 2026-05-18): on long structured code the model falls into a repetition loop; the loop-watchdog only *truncated* (content kill / force-</think>) in BOTH the MTP and non-MTP paths — MTP was never the lever. This converts truncate→recover. - New scheduler/resample.rs (SSOT for Phase-2 tuning): per-output-length penalty ramp + escalation factor + `resample_penalty_factor` (unit-tested). - decode_logits_step.rs: content-loop watchdog now escalates `resample_escalation` (compounding per re-fire) to steer the model OUT of the loop; only after RESAMPLE_MAX_ESC un-cleared escalations falls back to the original hard finish; decays on recovery. RESAMPLE_MAX_ESC=0 ⇒ exactly the old kill behaviour. - decode_logits_seq.rs: scale presence (clamped < EOS cliff) / lz / DRY by resample_penalty_factor(output_len, escalation). RAMP_SLOPE=0 ⇒ identity. - resample_escalation:u8 added to ActiveSeq+SwappedSeq + all 5 ctor sites + both lifecycle mappings (mirrors think_watchdog_fires exactly). Verified: cargo check + clippy clean, fmt 0 diffs, LoC ≤500 (helpers back to 476; new resample.rs 74), resample unit tests 3/3. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…atchdog→escalating resample" This reverts 571e2e3. The vLLM-oracle diagnosis (2026-05-18) proved the Qwen3.6-27B-FP8 long-code degeneration is an Atlas bug where Atlas's anti-repetition PENALTIES are part of the cause, not the cure: same model+prompt+temp on vLLM (no penalties, no MTP) produces a complete game, while Atlas's penalty stack induces premature-EOS (documented in the MODEL.toml 6c69d3f bisect comment). Phase-2 ADDED a per-length penalty ramp — directionally backwards. Its watchdog→escalate half is also moot: the real failure is a fuzzy loop with period ~80 tok > the 64-tok CONTENT_LOOP_PERIOD_MAX, so the watchdog never detects it. Full revert restores the known-good baseline; the correct fixes (penalty-preset realignment, thinking_default, prompt-template divergence) are tracked separately and gated on the N≥10 paired harness. See memory project_qwen36_27b_degeneration_rootcause. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…o-thinking) Additive, PCND-clean: --presence-penalty/--frequency-penalty/--no-thinking are only sent when explicitly passed; default cell unchanged (server's MODEL.toml preset). Enables the N≥10 paired data gate (shipped preset vs vLLM-minimal) that Avarok-Cybersecurity#67's MODEL.toml realignment is contingent on — so the preset change is data-driven, not n=1 (feedback_no_n1_stochastic_ab). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mbed_scale guard Three independent fixes restore Qwen3.6-27B-FP8 long-code generation to match vLLM behavior (~14.5x later degeneration, complete coherent code). 1. weight_loader/qwen35_dense.rs: A_log and dt_bias must be FP32. Consumer kernels (ssm_preprocess.cu, mamba2_ssm_decode.cu) declare them `const float*`; loading via `dense()` kept BF16 storage, reinterpreting 48-elt BF16 (96B) as 48-elt FP32 -> per-head scrambled decay gates. MoE sister loader (ssm_qwen35.rs:59-62) had this fix already with an explicit warning about exponential GDR amplification at long context — dense path missed the mirror. 2. weight_loader/qwen35_dense.rs + layers/qwen3_ssm/init.rs: ATLAS_FP8_SSM_PREFILL=1 routes SSM in_proj_qkv/out_proj through a native FP8 prefill GEMM (bf16_to_fp8 -> fp8_gemm_n128), eliminating the BF16-truncation intermediate that, amplified by k-conv's tiny weights (||conv-k|| ~ 18x smaller than ||conv-v||), rotated the k-band conv output direction. NVFP4 fallback preserved for decode batched paths. Mirrors the MoE set_fp8_weights pattern. 3. model/impl_a3.rs: scale_embeddings_fp32 now applies the same no-embed-scale guard as scale_embeddings_bf16. Without it, every non-Gemma model that opted into fp32-residual hard-failed with "Module 'embed_scale' not loaded". Verification (greedy, --no-thinking, --max-tokens 6000, vs aligned 54-tok HF oracle and end-to-end generation): conv-k cos 0.550 -> 0.9998, recur per-head cos 0.99 -> 1.00 with magnitude ratio 0.82 -> 0.99, gnorm per-head cos 0.82 -> 0.999, tokens_to_first_degeneration 1196 -> 17327. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-20T03:08:46Z

Thanks for the contribution. Before we can merge, please sign our Contributor License Agreement by replying to this comment with exactly:

I have read the CLA Document and I hereby sign the CLA

I have read the CLA Document and I hereby sign the CLA

2 out of 3 committers have signed the CLA.
✅ (tbraun96)[https://github.com/tbraun96]
✅ (rrstesiak)[https://github.com/rrstesiak]
❌ @G-Deca
_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

G-Deca · 2026-05-20T03:09:12Z

I have read the CLA Document and I hereby sign the CLA

G-Deca · 2026-05-20T03:14:11Z

recheck

Sweep 2026-05-20: every Qwen-family MODEL.toml under kernels/gb10 now declares 'thinking_default = true' in [behavior]. Reasoning-tier Qwen3.5/3.6 models materially benefit from CoT on multi-step prompts; the prior implicit default of false (or explicit false on qwen3-next-80b-a3b) silenced thinking unless the caller passed chat_template_kwargs.enable_thinking=true. Per-request override and CLI --disable-thinking kill-switch retain precedence per the documented ladder (CLI flag > request body > MODEL.toml [behavior]). Note: MODEL.toml is compile-time-baked via atlas-kernels/build.rs → build_codegen.rs, so a rebuild is required for these edits to take effect in the running binary. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…stigation Distilled playbook for tracking down quality regressions where Atlas output diverges from a reference framework (vLLM / HF) on the same model checkpoint. Grounded in commit 3ebc08a (the GDN decay-gate precision + FP8 SSM prefill fixes that produced the 14.2× improvement in tokens_to_first_degeneration on Qwen3.6-27B-FP8). Covers the cheapest-signal-first elimination ladder, the byte-exact HF CPU oracle pattern, per-head magnitude as a pointer-bug fingerprint (std/min/max diagnostic), the sister-loader diff rule, reversal discipline, and the polling cadence for multi-hour reproductions. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous false value (inherited from Qwen3.5's 'Think-Satisfy' guard) unconditionally suppressed thinking on every tool-active turn — i.e. every opencode interaction, since opencode always carries tools. Verified post-rebuild on atlas-gb10:fix sha 5072cb1562c4: tools_active=true, --num-drafts=1, thinking_in_tools=false → 0 reasoning_content chunks tools_active=true, --num-drafts=1, thinking_in_tools=true → 34 reasoning_content chunks Qwen3.6's reasoning is meaningfully better than 3.5's so Think-Satisfy shouldn't re-emerge; max_thinking_budget=512 caps any drift and F28 auto-disables thinking when the previous message is a tool error. One-line revert if tool-call density regresses. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Three changes (audit per DEBUGGING_METHODOLOGY.md): 1. weight_loader/qwen35_dense.rs: drop ATLAS_FP8_SSM_PREFILL env gate. The fix has been live since 3ebc08a, verified 14.2× degeneration-onset improvement and matching vLLM behavior class. Unconditional now for every FP8-on-disk dense Qwen3.6 variant. Also switch norm.weight to dense_f32_safe (mirrors the MoE sister loader's FP32-aware path — defensive against checkpoints that ship norm weights as fp32). 2. weight_loader/qwen35/load_layers/linear_attn_arms.rs: same FP8 SSM prefill path cross-ported to the MoE A3B loader. Sister-loader audit (3 parallel agents per DEBUGGING_METHODOLOGY.md §6) found that the MoE A3B has identical asymmetric conv weights (k-segment ~18× smaller than v) and was therefore vulnerable to the same conv-k SNR collapse via FP8→BF16→NVFP4→BF16 triple-quant chain. Now both variants dispatch SSM prefill through fp8_gemm_n128 (BF16 act × FP8 weight, FP32 accumulator) with NVFP4 retained as decode/batch fallback. 3. DEBUGGING_METHODOLOGY.md appendix: update Bug Avarok-Cybersecurity#1 description to note the env gate has been removed and the fix cross-ported to MoE. Bug Avarok-Cybersecurity#2 audit was clean — codebase-wide sweep of every const float* kernel parameter found all SSM scalar params (A_log, dt_bias, conv biases, D_param across Qwen + Nemotron) already on the dense_keep_f32 / dense_bf16_as_f32 path. No new pointer-alias sites. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Quick follow-up to 7d5e8fc: the MoE A3B Bug Avarok-Cybersecurity#1 cross-port allocates native-FP8 SSM weights inside build_linear_attention_nvfp4 (called per LinearAttention layer), which was silent. Add one tracing::info! inside the if-Fp8Dequanted block so startup logs visibly confirm the fix is active — mirrors the dense loader's top-level log (which fires once before the layer loop). For 35B-A3B with 30 SSM layers, expect 30 'SSM[...] ... native FP8 prefill GEMM' lines. Quick rebuild only, no behavior change. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…hunk-position lesson Adds env-gated layer-0..N GDN intermediate dumper for the MoE A3B path: ATLAS_GDN_DUMP=<dir> — base output dir ATLAS_GDN_DUMP_LAYERS=0,15,29 — SSM-layer indices to capture ATLAS_GDN_DUMP_N_SSM=30 — SSM-layer-count for counter modulo The dumper hooks into trait_prefill.rs at 4 sites (post-conv, post-l2norm, post-recurrence, post-gnorm) and captures the last token's BF16 slice. Under chunked prefill, the file is overwritten on every scheduler chunk so the on-disk dump corresponds to the LAST chunk's last-token (== position L-1 of the full prefill, not chunk_len-1 of the first chunk). Companion Python tooling: hf_gdn_ref_a3b.py — HF CPU oracle for A3B (32 v-heads, conv_dim=8192, hidden=2048). Generated TOK list from Atlas /tokenize. gdn_chain_diff_a3b.py — diff comparator with A3B segment shapes (q/k/v = 2048/2048/4096). Methodology doc update: §3 gotcha Avarok-Cybersecurity#4 (don't use FP8 checkpoint as oracle — HF silently ignores scale_inv), Avarok-Cybersecurity#5 (when comparing across SUT configs, guarantee identical dumped positions or you'll see methodological noise). §7 reversal log: chunked-prefill drift hypothesis added (refuted). Investigation findings on Qwen3.6-35B-A3B at L=16k: Layer 0: cos > 0.99985 — byte-perfect to BF16 floor at every stage. Layer 15: gnorm cos 0.64 — depth + long-context quantization noise. Layer 29: gnorm cos 0.71 — same class. At L=1244 the L15/L29 drift is much smaller (gnorm 0.92 / 0.86). Final model output remains coherent (sub-perceptual drift). No discrete numerical bug — drift is expected quantized-inference noise. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two changes in service of long-context drift investigation: 1. Multi-layer dump support — ATLAS_GDN_DUMP_LAYERS=0,1,2,3,15,29 (comma list of SSM-layer indices). Per-stage AtomicBool latches per (layer_idx, stage); counter wraps mod ATLAS_GDN_DUMP_N_SSM. Last-call overwrites so dumps land at position L-1 of the full prefill (not chunk_len-1 of the first chunk). 2. ATLAS_DISABLE_WY4=1 — forces fallback to single-token persistent kernel (or split4) for kernel-numerics isolation. WY4 was a suspect; turned out not to be the dominant noise source. Investigation summary (Qwen3.6-35B-A3B, L=16k, layer-0..L15 walk against HF BF16 baseline): L0 cos=0.99988 (byte-perfect at all L: 31, 1244, 16100) — proves Bug#1/Avarok-Cybersecurity#2/Avarok-Cybersecurity#3 fixes work and rules out first-layer kernel bugs. L1 gnorm cos=0.69547 at L=16k vs 0.94266 at L=1244 → real, length-dependent drift starting at layer 1. L3+ gnorm cos in 0.23-0.67 range — non-monotonic compounding. Mechanism: FP8/NVFP4 weight-quantization noise on the per-layer in_proj_qkv (specifically the W_z portion fed into the gnorm silu(z) gate). silu is a non-linear amplifier near z≈0, so small per-layer FP8 noise in z becomes large drift in gnorm output. At long L, the SSM recurrence H_t = g*H_{t-1} + v*k^T amplifies tiny per-step noise through the multiplicative gate (H decay ratio compounds e^(ε·L) over L steps). Refuted hypotheses (all evidence in commit + tests): - Chunked-prefill state-precision loss (refuted with corrected dumper) - FP8 KV cache as primary culprit (only +0.05 cos with BF16 KV) - BF16 residual stream as primary culprit (FP32 residual is *worse* at L1) - WY4 algebraic correction (disabling makes L1 worse, not better) True fix requires loader-side precision schedule: load qkvz weight (or at least its W_z slice) at BF16 unconditionally, accepting ~50 MB/layer extra VRAM. Roughly 1.5 GB more for the A3B at 30 SSM layers. Not done in this patch — needs design discussion on whether to make this default-on, env-gated, or per-model. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Diagnostic env override that skips both FP8 and NVFP4 paths and uses the BF16 in_proj_qkvz weight via ops::dense_gemm. Tests whether weight quantization is the dominant source of layer-1+ long-context drift. Result (Qwen3.6-35B-A3B, L=16k, vs HF BF16 baseline): Stage FP8/NVFP4 (default) BF16 weights forced L0 gnorm 0.99988 1.00000 (perfect match) L1 gnorm 0.69547 0.4274 (WORSE by 0.27) L2 gnorm 0.93958 0.9513 (+0.01) L3 gnorm 0.23399 0.4655 (+0.23) Key inference: at L0 BF16 weights give byte-perfect match (cos=1.0000) proving Atlas's BF16 GEMM == HF's BF16 GEMM when there's no SSM-state or downstream-layer mixing. At L1, identical BF16 weights produce a substantially DIFFERENT result from HF (cos 0.43, magnitude diverges too). This rules out FP8/NVFP4 weight precision as the dominant drift source at deeper layers under long context. Remaining suspects (real causes): - Atlas's GDN prefill kernels (WY4/persistent/split4) vs HF's chunk_gated_delta_rule have different FP32-accumulation reduction orders. Per-step recurrence errors accumulate differently. - L2-norm reduction order on q,k may differ. - Gated RMSNorm kernel's silu(z) is computed with __expf (fast intrinsic) on Atlas but expf (IEEE) on HF — small per-element differences amplify through the non-linear gate. - Attention layer (model layer 3+) at L=16k has FP8 weights and a 16k- token softmax; its output noise propagates into downstream SSM layers. Each of these is a non-trivial kernel-level investigation. The dumper + Python comparator from this commit chain (+ the per-layer walk pattern proven here) are the right infrastructure to drill further when prioritized. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…hts override Three diagnostic env knobs for long-context drift investigation: ATLAS_DISABLE_WY4=1 — fall back to single-token persistent kernel ATLAS_FORCE_PERSISTENT=1 — force per-token persistent at any k ATLAS_GDN_BF16_WEIGHTS=1 — skip FP8/NVFP4 paths, use BF16 dense GEMM Per-substep dumpers added at: pre_norm (layer input from residual stream) post_norm (post-input_norm, into in_proj_qkv) post_qkvz (post-deinterleave Q|K|V|Z, into conv1d) Investigation findings on Qwen3.6-35B-A3B at L=16k vs HF Qwen3.6-35B-A3B BF16 baseline: L0 gnorm cos = 0.99988 (clean at all L) L1 gnorm cos = 0.43-0.70 depending on config L3+ drift compounds non-monotonically Combinations tried (no single fix found): default (FP8/NVFP4): L1 gnorm 0.695 BF16 KV cache: L1 gnorm 0.704 (+0.01) ATLAS_FP32_RESIDUAL=1: L1 gnorm 0.386 (worse) ATLAS_DISABLE_WY4=1 → split4: L1 gnorm 0.301 (worse) ATLAS_GDN_BF16_WEIGHTS=1: L1 gnorm 0.427 (worse — and proves weight precision is NOT the cause) ATLAS_FORCE_PERSISTENT=1: L1 gnorm 0.595 (slightly worse) Max precision (BF16w + FP32res + persistent): L1 gnorm 0.648 Key inference: at L0 with BF16 weights, cos=1.00000 (byte-perfect) proving Atlas's GEMM == HF's GEMM with identical inputs. At L1, same configuration gives cos=0.43, meaning L1's INPUT differs from HF's L1 input. The drift must come from the L0→L1 transition: residual_add chain that includes L0's out_proj output + post_attn_norm + MoE output (NOT covered by ATLAS_GDN_BF16_WEIGHTS which only affects qkvz GEMM, NOT the MoE experts or out_proj). Remaining suspect: MoE expert quantization noise propagating via the residual stream. The MoE has 256 experts with FP8/NVFP4 weights and per-token routing; each token's MoE output has small quant noise that ADDS to the residual, drifting L1+ inputs vs HF. The pre_norm/post_norm dumpers have a subtle bug at small-L (capture zeros for short prompts but real data for long) — buffer layout interaction. Filed but not blocking the main finding. Cleanest next move: instrument MoE output dumping + comparison vs HF to confirm/refute the MoE-quant-noise hypothesis. If confirmed, load MoE experts at BF16 (large memory cost) is the only real fix path. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The DFlash drafter constructor unconditionally applied YaRN scaling with factor=64 / original_max_position_embeddings=4096 / beta_fast=32 / beta_slow=1 hardcoded. The v2 2026-04-27 Qwen3.6-DFlash drafter ships `rope_scaling: null` (plain RoPE), so every low-frequency RoPE pair was mis-scaled — pairs 0..11 divided by 64, pairs 11..26 ramped — landing drafter Q/K rotations in the wrong angular basis at every layer. Replace the hardcoded YaRN block with a config-driven loader: * `DflashConfig` gains `rope_theta` (default 10M, matches Qwen3.6) and an optional `rope_scaling` block mirroring HF transformers' Qwen3 config so `serde_json::from_str` works directly on the drafter's `config.json`. * `BlockDiffusionDraftHead::from_weights` reads that block: - `None` ⇒ plain RoPE (`inv_freq[j] = 1 / θ^(2j/dim)`). - `Some(yarn)` ⇒ existing YaRN formula, parameters from config. - Other / unrecognised `rope_type` ⇒ warn and fall back to plain. Tested against the v2 Qwen3.6-DFlash drafter; per-layer drafter hidden states are now bit-perfect with the PyTorch reference forward (cos=1.0 through all 5 drafter layers). End-to-end DFlash speculative decoding still has a separate bug downstream of the drafter — investigation ongoing — but the RoPE basis is now correct and worth landing on its own. Refs: Avarok internal thread #development 2026-05-19.

Adds dump hooks at: out_proj : SSM out_proj output (between gnorm and residual chain) moe_out : MoE final output (post shared-expert blend) Definitive findings (Qwen3.6-35B-A3B, L=16k, BF16 weights forced for qkvz): L0 SSM out_proj cos vs HF: 0.99210 (clean — SSM out_proj works) L0 MoE out cos vs HF combined: 0.42130 |Atlas|/|HF| = 2.266 (drifted) L0 MoE out cos vs HF routed-only: 0.40608 |Atlas|/|HF| = 2.368 (drifted) The 2.37× magnitude inflation against HF's routed-only output proves the drift originates in the ROUTED-expert pathway, NOT the shared expert (which only contributes ~4% magnitude in HF). Ruled out: - Top-K routing normalization (kernel reviewed, dispatch.rs:72 hardcodes norm_topk_prob=true for qwen3_5_moe family) - Shared expert sigmoid gating (Atlas applies it via moe_batched_blend) - Chunked prefill (refuted earlier in this branch) - SSM kernels (L0 gnorm + out_proj are clean) Remaining suspect: FP8 expert weight scaling. Each of 256 experts has 3 FP8-quantized matrices (gate/up/down), each with per-block weight_scale and per-matrix weight_scale_2. An indexing or magnitude error in scale loading would systematically inflate the routed weighted sum by the same factor for all tokens. The 2.37× ratio is consistent with a ~sqrt(K=8)≈2.83 or a single global mis-scaled weight matrix. Cleanest next experiment: dump a single expert's GEMM input + output + weight_scale + weight_scale_2 from Atlas, compute HF expert(x) on the same input using BF16 weights, and compare. If Atlas's per-expert output differs by a constant factor from HF's, the FP8 scale loading is the bug. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… cause localized to L0 MoE routing Diagnostic env knobs added: ATLAS_DUMP_EXPERT_IDS=1 — logs top-K expert indices + weights per token per layer (in moe/forward_prefill_fp8.rs) + ATLAS_GATE_INPUT log (router_in last_tok |x|+first5) + ATLAS_GATE_LOGITS log (raw pre-softmax top-10 + stats) ATLAS_DUMP_EMBED=1 — logs hidden_dst magnitude pre/post scale_embeddings (in prefill_b/embed_chunk.rs) ATLAS_GDN_BF16_WEIGHTS=1 — extension: also installs BF16 out_proj_dense in the MoE A3B loader so the dispatcher takes the dense_gemm BF16 path (overrides FP8/NVFP4 out_proj). Investigation chain (Qwen3.6-35B-A3B, L=16k, last token = pos 16098 = tok 271): ✅ embedding lookup: Atlas bit-identical to HF (|x|=0.3164, first5 match) ✅ embed_tokens.weight tensor: bit-identical between FP8 and BF16 checkpoints ✅ gate.weight tensor: bit-identical between checkpoints ✅ post_attention_layernorm.weight: bit-identical between checkpoints ✅ SSM out_proj cos vs HF: 0.99210 (clean) ✅ SSM gnorm cos vs HF: 0.99988 (clean) ❌ Gate INPUT direction: Atlas |x|=23.06 first5=[ 0.352, -0.359, -0.598, 0.070, 0.408] HF |x|=25.43 first5=[-0.832, -0.027, 0.297, -0.046, 0.036] Magnitudes match within 10% but per-element directions completely differ. ❌ Gate LOGITS top-10: Atlas: [102, 131, 15, 21, 228, 52, 93, 96, 14, 116] (raw, mean -5.66) HF: [ 81, 140, 158, 208, 200, 132, 115, 86, 95, 206] (post-softmax) ❌ Top-K=8 expert selection overlap: 0/8 (ZERO common experts) ❌ MoE output cos vs HF: 0.42 (catastrophic divergence) ❌ BF16 out_proj fix attempt: NO meaningful change (still 0/8 overlap) — eliminates SSM out_proj FP8 quant as the cause Conclusion: with all upstream weights and SSM outputs bit-identical or sufficiently close, the divergent gate input MUST come from a structural difference in either residual_add_rms_norm (kernel formula/precision) or some intermediate step I haven't instrumented yet. The MoE block's top-K routing is so sensitive to gate-input direction that even a small upstream noise produces 0/8 overlap with HF's selection. Remaining suspects for the deeper drill: 1. residual_add_rms_norm kernel: Atlas's formula vs HF's torch RMSNorm differ in some accumulation detail (eps, reduction order, post-norm weight multiplication order). The kernel's BF16 input/output but FP32 reduction is correct, but the formula might subtly differ. 2. The 'hidden' buffer at residual_add_rms_norm time may not be the pristine embedding — some other op may have written to it (vision overlay? marconi snapshot restore? warmup leftovers?). 3. post_attention_layernorm WEIGHT loading: bit-identical on disk but possibly loaded with wrong stride/transpose in Atlas's 'load_dense_ffn' or similar. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The v1 `moe_fp8_grouped_gemm` kernel was originally documented as just having a coalescing performance bug. While debugging Qwen3.6-35B-A3B-FP8 producing gibberish at 16k context, captured per-expert dumps showed v1 has a NUMERICAL bug for some (token, expert) tile combinations: chunk-4 last-token, expert 200, up_proj output: v1 (default): |x|=28.0 ← 5× too large v2 (coalesced): |x|=?? ← correct shape HF baseline (BF16 oracle): |x| ~ 5 The amplification propagates: up=28 then silu(gate)*up=8.4 then down_proj→4.5, vs the other 7 experts' down output ~0.2-0.6. At the prefill-chunk-4 boundary this drives a 42% residual-stream amplification (Atlas L1 input |x|=0.819 vs HF 0.577); v2 brings it to 6% (0.611 vs 0.577) and Atlas now produces a coherent summary of a 16k-token prompt instead of gibberish. Verification: - L0 chunk-4 routed-only MoE output: v1=0.44, v2=0.26, HF=0.21 - 16k-context generation: now produces clean summary - v1 path preserved as env override ATLAS_FP8_MOE_COALESCED=0 Diagnostic instrumentation kept (gated on ATLAS_DUMP_EXPERT_IDS=1): ATLAS_GATE_INPUT, ATLAS_GATE_LOGITS, ATLAS_EXPERT_IDS, ATLAS_MOE_OUT, ATLAS_ROUTED_ONLY, ATLAS_SHARED_OUT, ATLAS_SHARED_GATE; pre-norm SUM/HIDDEN/OUTPROJ in qwen3_ssm. Zero overhead when env unset. Plus a diagnostic ATLAS_FORCE_NVFP4_MOE=1 path in qwen35 load_layers for future same-class bug bisection. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The expert_gate_out/up_out/down_out buffers were only zeroed when `ctx.comm.is_some()` (EP mode). In single-GPU mode they were left uninitialized, which is a problem because: max_m_tiles = (avg_per_expert * 2).div_ceil(64).max(1) assumes peak-per-expert ≤ 2× average — skewed routing violates this, so the grouped GEMM kernel skips rows past max_m_tiles*64. Those rows keep STALE DATA from the previous prefill (or uninitialized memory on the first prefill), which propagates through unpermute_reduce as spurious contributions to the routed-MoE output. Effect was chunk-size + run-history dependent: - Same prompt, same gate_input (25.97), different ATLAS_ROUTED_ONLY: 4-chunk first-fire: 0.65 4-chunk after-warmup: 0.35 2-chunk: 1.13 All vs HF baseline: 0.21 After fix (zero-init unconditional): 3 runs same prompt: 0.186, 0.186, 0.184 (deterministic, -14% vs HF) Full L0..L39 sweep vs HF: 40/40 within 5% magnitude, 6+/8 expert overlap (Previously L1-L6 were 20-38% over with 1-5/8 overlap) Also: one-shot tracing log to verify v2-vs-v1 kernel selection at runtime (helpers_a.rs), useful for future bisection. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous heuristic max_m_tiles = (avg_per_expert * 2).div_ceil(64) silently truncated heavily-loaded experts. Observed in Qwen3.6-35B-A3B-FP8 at 4097-token chunk: ATLAS_EXPERT_LOAD: n_tokens=4097 avg=129 max=929 (expert 227) max_m_tiles=5 kernel_cap=320 truncated=true Expert 227 had 929 tokens assigned but the kernel grid only covered 320 rows — the remaining 609 rows were zeroed (after the zero-init fix) but never computed. Those 609 tokens lost their expert-227 contribution entirely, producing a SYSTEMATIC -14% bias in routed-MoE output at L0. Fix: size max_m_tiles for the worst possible case where one expert takes all tokens: max_m_tiles = (num_tokens * top_k).div_ceil(64) This launches some empty tiles for under-utilized experts, but each early-exits on `m_idx >= M_expert` so the overhead is small relative to the previous correctness bug. Result vs HF baseline (16k prompt, L0..L39 chunk-4 last-token): Before all fixes: L0..L6 had 20-38% over-amplification + 1-5/8 overlap After v2+zero-init: -14% bias systematic across all layers After this fix: - L0 MOE_OUT: 0.2084 vs HF 0.2154 (-3.3%) - Mean ratio 0.996, stddev 0.0085, all 40 layers in [0.977, 1.021] - Mean overlap 7.5/8, 21/40 layers perfect 8/8 overlap - = FP8 quantization noise floor Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- /.dgx2-work/, /tasks/, /target-rebuild/, /spark-fastbin-* - bench/longcode/hang-forensics/{*.log,*.jsonl,*.py,*.sh,...} - docker/gb10/Dockerfile.{fix,67fix,combo,ffnbf16,fp32res,hangdiag,zeropen} (the canonical Dockerfile + Dockerfile.fast + Dockerfile.fence + Dockerfile.gemma-fp32 stay tracked) Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Mirrors the FP8 path fixes (commits 34626d3 + adf39ce) to the NVFP4 grouped-GEMM forward (`forward_prefill.rs`). Same two bugs, same symptoms expected: 1. Zero-init expert output buffers unconditionally (was only in EP mode via `ctx.comm.is_some()`). Some grouped-GEMM kernel paths skip rows past per-expert end; without zero-init those rows kept stale data from previous prefills which contaminated the unpermute_reduce sum. 2. max_m_tiles = (num_tokens * top_k).div_ceil(64) (worst case), not (avg * 2).div_ceil(64). Real MoE routers concentrate experts ~7× the average (observed on Qwen3.6-A3B at chunk=4097: avg=129, max=929 for expert 227). The Poisson(avg) assumption in the prior comment is wrong for trained routers. The (avg*2) truncation silently lost ~14% of routed-MoE magnitude per layer. Verified on AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4: 16k-context generation now produces a clean, coherent code summary. Models likely impacted (NVFP4 routed-expert MoE): MiniMax M2.7-NVFP4, Mistral-Small-4-119B-NVFP4, Nemotron-3-Nano-30B-A3B-NVFP4, Qwen3-VL-30B-A3B-NVFP4, Qwen3.6-A3B-heretic-NVFP4, Gemma-4-31B-NVFP4, Qwen3.5-27B-NVFP4. All had latent under-counting and run-to-run non-determinism in their routed-MoE path. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

tbraun96 and others added 11 commits May 10, 2026 21:00

G-Deca requested review from AzeezIsh and tbraun96 as code owners May 20, 2026 03:08

tbraun96 and others added 14 commits May 20, 2026 05:25

Merge branch 'main' into fix/dflash-rope-config-driven

7854ace

tbraun96 and others added 8 commits May 20, 2026 14:31

Merge branch 'main' into fix/dflash-rope-config-driven

5aff4b1

Merge branch 'feat/qwen3.6-dense-mtp' into fix/dflash-rope-config-driven

64b2a20

xml_parser init

cda8ce9

cleanup+tests

0c71450

G-Deca force-pushed the qwen3_xml_tool branch from 34e0b18 to 0c71450 Compare May 21, 2026 03:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3_xml Tool Parser#73

Qwen3_xml Tool Parser#73
G-Deca wants to merge 33 commits into
Avarok-Cybersecurity:mainfrom
G-Deca:qwen3_xml_tool

G-Deca commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026 •

edited

Loading

Uh oh!

G-Deca commented May 20, 2026

Uh oh!

G-Deca commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

G-Deca commented May 20, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

G-Deca commented May 20, 2026

Uh oh!

G-Deca commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 20, 2026 •

edited

Loading