Skip to content

Qwen3_xml Tool Parser#73

Open
G-Deca wants to merge 33 commits into
Avarok-Cybersecurity:mainfrom
G-Deca:qwen3_xml_tool
Open

Qwen3_xml Tool Parser#73
G-Deca wants to merge 33 commits into
Avarok-Cybersecurity:mainfrom
G-Deca:qwen3_xml_tool

Conversation

@G-Deca
Copy link
Copy Markdown

@G-Deca G-Deca commented May 20, 2026

Summary

Problem

Qwen3.6-35B-A3B-FP8 with thinking mode enabled (--tool-call-parser qwen3_coder) produces empty-string values for
required tool call parameters. The model reasons about the call inside , which causes the XML parameter
extractor to see whitespace-only content and emit "" for typed fields — breaking downstream clients that expect
integers, booleans, or arrays.

Solution

Add a qwen3_xml parser that is identical to qwen3_coder in wire format and grammar, but applies a schema-driven
type coercion pass after extraction. String values are rewritten to the JSON type declared in the tool's
parameters schema:

┌──────────────────┬───────────────────────────────────────────────────┐
│ Schema type │ Coercion │
├──────────────────┼───────────────────────────────────────────────────┤
│ integer / number │ "10" → 10, "3.14" → 3.14 │
├──────────────────┼───────────────────────────────────────────────────┤
│ boolean │ "true" / "True" → true, "false" / "False" → false │
├──────────────────┼───────────────────────────────────────────────────┤
│ array / object │ JSON-parse the string value │
├──────────────────┼───────────────────────────────────────────────────┤
│ null │ "null" → JSON null │
├──────────────────┼───────────────────────────────────────────────────┤
│ Anything else │ left as-is │
└──────────────────┴───────────────────────────────────────────────────┘

Coercion never panics and never drops fields — unrecognised types and unparseable values are left unchanged.
Qwen3CoderParser is unmodified; existing tests stay green.

The hook runs at both call sites: non-streaming (build_choice_message) and streaming (handle_complete_tool_call +
handle_tool_call_delta), after backfill_required_params and before normalize_paths / validate_tool_calls.

kernels/gb10/qwen3.6-35b-a3b/MODEL.toml is updated to select qwen3_xml automatically via the Tier-2
[behavior].tool_call_parser override — no CLI flag needed.

Closes #

Test plan

Passes all standard test, builds clean, confirmed runs on GX10 hardware with the following docker:

  -p 8888:8888 \
  --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  atlas-gb10 \
  serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --port 8888 \
  --bind 0.0.0.0 \
  --max-seq-len 131072 \
  --kv-cache-dtype fp8 \
  --kv-high-precision-layers auto \
  --gpu-memory-utilization 0.90 \
  --scheduling-policy slai \
  --tool-call-parser qwen3_xml \
  --enable-prefix-caching \
  --speculative```

- [ ] `cargo fmt --all -- --check`
- [ ] `ATLAS_SKIP_BUILD=1 cargo clippy --workspace --tests --all-features -- -Dwarnings`
- [ ] `bash scripts/check-license-headers.sh`
- [ ] Tested against a real model / hardware if the change affects runtime behaviour
- [ ] Added or updated tests where applicable

## Notes for reviewers

<!-- Design rationale, trade-offs, follow-ups you deferred, things you want a second opinion on. -->

## CLA

- [ ] I have read and agree to the [Contributor License Agreement](../CLA.md).

tbraun96 and others added 11 commits May 10, 2026 21:00
The bundled `mtp.safetensors` for `Qwen/Qwen3.6-27B-FP8` carries a single
full-attention layer + a dense gate/up/down MLP (no router, no experts).
Atlas's MTP loader assumed every MTP head was MoE-shaped, so the dense
loader (`Qwen35DenseWeightLoader`) just stubbed `load_mtp_weights → None`
and `--speculative` silently no-opped.

Changes:

- `MtpWeights` gains `dense_ffn: Option<DenseExpertWeight>`. `load_mtp`
  auto-detects FFN flavor by inspecting weight names: presence of
  `mtp.layers.0.mlp.gate_proj.weight` without a `.gate.weight` router
  selects the dense path; the MoE fields are populated with NULL
  placeholders.
- `MtpHead` gains a `dense_ffn_generic` projection triple. The
  constructor short-circuits all MoE quantization when dense weights
  are present (NVFP4 mode is rejected — Qwen3.6-27B-FP8 ships an
  FP8 MTP head, so users pass `--mtp-quantization fp8` or `bf16`).
- `MtpHead::forward_one` step 10 dispatches `dense_ffn_forward_generic`
  when the dense triple is populated. The dense path reuses the
  existing `dense_gemv_*` and `moe_silu_mul` kernels — no new kernel
  wiring or PTX changes.
- `Qwen35DenseWeightLoader::load_mtp_weights` now calls `load_mtp` when
  `mtp.fc.weight` is present in the store. The single auto-detecting
  loader handles both MoE (35B-A3B) and dense (27B) variants.
- `factory::build` warns when `--speculative` is requested but no MTP
  weights were loaded (was: silent no-op).

Tested via `scripts/check.sh check -p spark-model -p spark-server`.
End-to-end validation pending sparkrun on dgx2 with the
`qwen3.6-27b-dense-fp8-mtp-atlas` recipe.
Qwen3.6-27B-FP8 is the dense text sibling of the Qwen3.6-VL family.
Its config.json declares the same `vision_config` block as the VL
siblings (Qwen ships them as one config family), so since
`82f4794 (Apple Metal)` widened `is_qwen3_vl()` to also match
`qwen3_5 + vision`, the routing in `loader_for_config` sends the
27B dense checkpoint to `Qwen3VLWeightLoader` — which assumes MoE
and panics on the missing `mlp.gate.weight`.

The fix: check `is_qwen35_dense()` (requires `num_experts == 0`) BEFORE
`is_qwen3_vl()`. Dense text-only checkpoints are unambiguously
distinguishable by `num_experts == 0`; VL-MoE always has experts.

Repro:

  spark serve Qwen/Qwen3.6-27B-FP8 → "Failed to build model:
  Weight 'model.layers.0.mlp.gate.weight' not found in store"

After fix the dense loader handles it (and now also picks up the
bundled MTP head — see 43a04d2).
…litting them (Avarok-Cybersecurity#62)

* fix(thinkbrake): defer forced </think> past code fences instead of splitting them

F2 confidence early-stop (and the thinking-budget cap) forced `</think>`
the instant they tripped — including mid-line inside a ```python block
the model was drafting in its reasoning. Code tokens are near-
deterministic (top-1 >=0.95 for long runs) so F2 trips trivially on a
code block; the forced boundary split a statement and the reasoning
parser then cut reasoning_content mid-token, leaking the rest into
content. User report: "stops half-way through thinking while generating
a codeblock."

Fix: track ``` code-fence parity per sequence via the atomic fence
token id (resolved once in tokenizer_runtime, fail-open if a tokenizer
splits it). F2 keeps DETECTING/arming everywhere (a model can ramble in
code forever and must stay brakeable), but the forced `</think>`
INJECTION is deferred until the fence closes — a safe boundary right
after the code block, never mid-statement. THINK_LOOP period-repeat
watchdog is unchanged and still active inside fences.

Pure decisions extracted to SSOT helpers, called by production and
asserted by tests: toggle_code_fence, confidence_run_step,
should_inject_think_end. spark-server scheduler::helpers suite 18/18
green under #![deny(warnings)].

Verification: live repro (fibonacci, Qwen3.6-27B-FP8 + MTP) confirmed
the mid-codeblock split is gone and the ```python block is emitted
intact. End-to-end re-check of the defer-refinement's post-fence brake
timing is the remaining live step.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(watchdog): digit-normalized content-loop detector for template degeneration

Companion to the thinkbrake fence fix: after deferring the forced
</think> past code fences, live testing surfaced a separate content-
phase degeneration — Qwen3.6-27B at temp=0 emits a fixed line template
with varying numeric payload (`- B(46) = 104509868777\n- B(47) =
273508641\n …`) until max_tokens. The exact-token content-loop watchdog
structurally cannot catch this: the integer tokens differ every line so
no fixed token period repeats.

Fix: detect_content_token_loop_normalized maps numeric tokens to a
sentinel AND run-length-collapses consecutive sentinels. Run-collapse is
essential — Qwen3.6 is digit-level (`104509868777` → 12 single-digit
tokens, `273508641` → 9), so a 1:1 map alone leaves variable-length
sentinel runs and the period still varies; collapsing makes
`- B(<digits>) = <digits>\n` identical regardless of digit count.

Numeric-token mask (the 10 single-digit ids of 248070) built once at
startup in tokenizer_runtime via decode_with_special (NOT id_to_token —
that returns raw byte-level BPE pieces with the `Ġ` space marker),
exposed via a set/get OnceLock mirroring enable_loop_watchdog. OR-ed
into the existing content-loop watchdog at decode_logits_step.rs under
the SAME per-model enable_loop_watchdog gate (qwen3.6-27b /
qwen3.5-27b only). FP guard: CONTENT_LOOP_NORM_MIN_REPEATS=4 and the
matched period must contain >=1 sentinel AND >=1 structural token, so
pure-number columns and pure-prose loops stay the exact detector's job.

scheduler::helpers 23/23 green (5 new incl. variable-length-digit-run +
exact-path regression), clean under #![deny(warnings)]. Live
(27B-FP8+MTP): degeneration repro bounded 2000 -> 577 tokens (watchdog
fired), thinkbrake code-fence fix still intact, normal coding prompt
finish=stop (no false-positive early stop).

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(qwen3.6-27b): thinking_default=true (reasoning on by default)

qwen3.6-27b is a reasoning model but its MODEL.toml [behavior] had no
thinking_default key → ModelBehavior default (false) → resolve_thinking
returned enable_thinking=false for any client that doesn't explicitly
opt in. Open WebUI (and most OpenAI-compatible clients) send plain
requests with no enable_thinking flag, so the chat template injected an
empty `<think>\n\n</think>` thinking-off marker and reasoning_content
was always empty — thinking never displayed.

Set thinking_default=true. Plain requests now reason by default;
explicit per-request enable_thinking=false (or reasoning effort, etc.)
still wins via resolve_thinking. Verified live (27B-FP8+MTP): plain
request → reasoning_tokens=176, reasoning_content populated,
finish=stop. Build-time embedded via atlas-kernels build_parse.rs.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(qwen3.6-27b): anti-repetition sampling stack + bounded in-fence </think> defer

Surfaced by the OpenWebUI "3D chess game" stress prompt (the canonical
long-code hard case). Two coupled root causes, both Atlas-side:

1. Sampling presets on this branch had presence_penalty=0.0 and no
   DRY/LZ — the exact condition commit 6c69d3f live-bisected as the
   cause of the announce/restart + verbatim-CSS-block degeneration on
   hard code-gen (QwenLM/Qwen3.6 Avarok-Cybersecurity#88/#115/#145, confirmed upstream on
   official vLLM/BF16). Replicated 6c69d3f's battle-tested stack
   (presence_penalty=1.5 + lz_penalty=0.2 + DRY 0.8/1.75/2) on the
   presets THIS branch's build_sampling actually selects
   (thinking_text / thinking_coding / non_thinking — no thinking_coding
   routing here; tools left 0.0). XTC not ported on this branch; the
   digit-normalized content-loop watchdog is the safety net.

2. Regression from the thinkbrake fence-defer (cd0ca9d): when the model
   writes its whole deliverable as a ```code block INSIDE <think>, the
   fence never closes so should_inject_think_end deferred the forced
   </think> indefinitely — budget brake fired at 256 but reasoning ran
   to 3025 tokens and the real answer was trapped in reasoning_content
   with a 499-char content stub. Bounded the deferral:
   THINK_DEFER_BUDGET_FACTOR=3 (hard-inject </think> past 3x budget
   even mid-fence; absolute ceiling 2048 when budget=None).

Live (27B-FP8+MTP, 3D-chess prompt): before = 401-token CSS-loop
garbage; after = reasoning capped exactly at 768 (=3x256), content
7234 chars with 27 THREE. calls, Scene/WebGLRenderer/Camera/
OrbitControls, valid </script></html>, finish=stop, no watchdog.
scheduler::helpers 24/24 green.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(qwen3.6-27b): retune presence_penalty 1.5 → 1.0 (temp=0.6 sweep)

presence_penalty=1.5 (replicated from 6c69d3f) over-penalized on THIS
branch's realistic path: Open WebUI sends no sampling params so
requests run at the thinking_text preset default temp=0.6 (stochastic),
where a flat 1.5 per-seen-token penalty boosts EOS once common code
tokens are exhausted → premature termination (574 tok, 0 THREE. calls).
The earlier validation passed only because it used temp=0 (greedy
resists early EOS) — wrong regime.

Live presence_penalty sweep at temp=0.6 on the 3D-chess prompt:
  1.5 → 574 tok, 0 THREE.   (premature EOS)
  1.0 → 6782 chars, 17 THREE., </script></body></html>, finish=stop,
        single coherent impl, no loop  ← chosen
  0.5 → degenerate outlier
  0.3 → complete but 3× wasteful announce/restart rewrites
1.0 is the balance point between premature-termination (high pp) and
restart-spam (low pp). DRY (multi-token) + LZ (n-gram) + the
digit-normalized content-loop watchdog carry the precision
loop-breaking presence (single-token) shouldn't be doing alone.

Loop/degeneration layer now resolved (no watchdog fires, no restart
spam across regimes). NOTE: a separate, non-Atlas factor remains —
Open WebUI injects an empty `system: "User Context:\n\n"` which the
model reacts to with terse output (isolated: removing it 3×'s the
generation). Tracked for a follow-up Atlas robustness fix
(neutralize content-free system messages) + user-side Open WebUI
system-prompt config.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chat): neutralize content-free client system messages

Open WebUI injects an empty RAG/context system message —
`"User Context:\n\n"` (trims to the bare label `User Context:`) — when
no custom system prompt is set. Models react to a content-free system
directive by producing terse / prematurely-terminated output: isolated
2026-05-17, the identical 3D-chess request WITHOUT that system message
produced 3x the generation (1499 vs 469 completion tokens, 14 vs 3
THREE. calls). The string is purely client-side (zero matches in Atlas
source); Atlas relayed it faithfully and the jinja template rendered it
correctly — the model's terseness is a reasonable reaction to a
meaningless instruction.

We can't fix the client, so Atlas adapts: a leading system message
whose trimmed content is empty, or a single short bare `Label:` line
with no payload (`User Context:`, `Context:`, `System:`), is dropped
before templating so it can't poison generation. Model-agnostic (one
site, pre-jinja) — not per-template. Conservative: any multi-line or
post-colon content is a real prompt and is never stripped (unit tests
cover empty/whitespace, the OpenWebUI residue, and substantive prompts
incl. label-like-with-payload and long prose ending in ':').

Live: exact Open WebUI request → log "Dropped content-free client
system message dropped=User Context:", no degeneration. (Residual
length variance on this prompt is upstream Qwen3.6 temp=0.6 behavior —
QwenLM/Qwen3.6 Avarok-Cybersecurity#88 — not Atlas; Open WebUI sends no sampling params so
requests run at the preset default temp=0.6.)

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: green the pipeline (LoC split + clippy + fmt)

Make the full PR Avarok-Cybersecurity#62 merge chain pass CI. Three classes of fix:

1. LoC ≤500 (file-size-cap): split scheduler/helpers.rs (834 → 476) by
   moving its `#[cfg(test)] mod thinking_loop_tests` (360 lines) to
   helpers_tests.rs via `#[path]` — logical child of `helpers`, so
   `use super::*` resolves exactly as before; zero production change.
   (decode_logits_step.rs / serve.rs are on the CI allow_list — left.)

2. clippy (deny clippy::all) — fixes, several PRE-EXISTING on the base
   branch and unmasked once upstream crates compiled clean:
   - atlas-core config/methods.rs: manual checked division → checked_div
   - atlas-kernels build_codegen.rs: generated consts emit `&str`/`&[u8]`
     not `&'static …` (clears 15 redundant_static_lifetimes in the
     generated target_ptx.rs; fn-return `'static` kept — not linted)
   - spark-runtime buffers/sizes.rs + spark-server preflight.rs: manual
     checked division → checked_div().map().unwrap_or()
   - spark-model forward_layers.rs + spark-server phase_promote_prefills.rs:
     descending sort_by → sort_(unstable_)by_key(|x| Reverse(..))
   - spark-server openai/annotations.rs: loop+let-else-break → while let
   - spark-server scheduler/helpers.rs: iter().any(|t| t==X) → contains(&X)

3. cargo fmt --all (workspace) — normalizes the clippy edits plus
   pre-existing base fmt debt (mtp_head/new.rs, qwen35_dense.rs,
   decode_logits_seq/step.rs).

Verified in atlas-builder: fmt --check 0 diffs, no non-allowlisted .rs
>500, `cargo clippy --tests --workspace --keep-going` zero errors,
`cargo test --workspace` all green (incl. 342 spark-server + 23
helpers). Behavior-preserving throughout (test-mod move, semantically
identical clippy rewrites, whitespace-only fmt).

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oken

Reconcile the thinkbrake code-fence work with main's Avarok-Cybersecurity#56 chunked-prefill
refactor (#37513bf split scheduler/phase_continue_prefills.rs into a
4-file dir-module) and Avarok-Cybersecurity#57 tool-call-parser system-prompt injection.

Conflicts (2), both resolved by taking main's side:
- atlas-kernels/build_codegen.rs: trivial; both sides independently
  dropped 'static from the generated const type. main's const_ty rename
  + preamble comment is canonical; branch's redundant change discarded.
- scheduler/phase_continue_prefills.rs: structural; took main's 4-file
  dispatcher (run_standard/run_batched_mixed/run_batched_prefill).

Re-threaded code_fence_token: Option<u32> (additive plumbing, after
think_start_token / before tool_call_start_token, no logic change) into
main's new module boundaries so the thinkbrake fence signal reaches the
existing toggle_code_fence SSOT in process_decode_logits:
- continue_in_progress_prefills sig + its run_standard_chunk_loop and
  run_batched_mixed_step calls
- run_standard.rs run_standard_chunk_loop sig + process_decode_logits call
- run_batched_mixed.rs run_batched_mixed_step sig + process_decode_logits call
mod.rs/decode_logits_step.rs/decode_step.rs retained the branch's
code_fence_token via clean 3-way auto-merge (verified, no E0061).

Audits: Avarok-Cybersecurity#57 (api/chat/mod.rs only) is disjoint from the branch's
thinkbrake decode logic (decode_logits_seq.rs only) — no interaction.
tokenizers 0.23/safetensors 0.7 dep bumps compile clean workspace-wide.

Verified (mirrors CI exactly): cargo fmt --all --check 0 diffs;
no non-allowlisted .rs >500 LoC; cargo clippy --workspace --tests
zero diagnostics; cargo test --workspace 662 passed / 0 failed
(incl. thinkbrake fence/defer/loop suites green).

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors PR Avarok-Cybersecurity#61 onto feat/qwen3.6-dense-mtp. The CLI default of `nvfp4`
makes the dense-FFN MTP head guard (mtp_head/new.rs:58-65) reject
Qwen3.6-27B-FP8 (ships an FP8/bf16 MTP head), so `--speculative`
silently disabled MTP (has_proposer()=false → use_speculative=false).
bf16 is accepted by the guard → `--speculative` now engages dense MTP.
Higher accuracy; opt into nvfp4 explicitly with --mtp-quantization nvfp4.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 0 of the Qwen3.6-27B long-code plan. Paired-seed N=10 driver
(harness.py) + acorn-loose AST analyzer (analyze.mjs) + paired diff
(compare.py), frozen canonical 3D-chess prompt. analyze.mjs uses
acorn-loose so duplicate-declaration detection sees the degenerate
tail (strict-parse fallback would mask it). Validated on a known
-degenerate sample (dup=2, completeness_pass=false) and a valid
control (completeness_pass=true). results/ git-ignored.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… mode

Degenerate output usually never emits the closing ``` or </script>, so
the closed-only extractor returned "" exactly on failing samples and
masked valid_js_line_count / duplicate_declaration_count (gate metric
was unaffected). Handle unclosed trailing fence, no-fence raw HTML, and
unclosed <script>. Add harness.py --reanalyze (re-score saved samples
with SSOT summarize(); no regeneration). Validated: masked baseline
seed3 went 0/0,dup=0 -> 379 lines,dup=3; controls unchanged.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…escalating resample

Root cause (controlled greedy A/B, 2026-05-18): on long structured
code the model falls into a repetition loop; the loop-watchdog only
*truncated* (content kill / force-</think>) in BOTH the MTP and
non-MTP paths — MTP was never the lever. This converts truncate→recover.

- New scheduler/resample.rs (SSOT for Phase-2 tuning): per-output-length
  penalty ramp + escalation factor + `resample_penalty_factor` (unit-tested).
- decode_logits_step.rs: content-loop watchdog now escalates
  `resample_escalation` (compounding per re-fire) to steer the model OUT
  of the loop; only after RESAMPLE_MAX_ESC un-cleared escalations falls
  back to the original hard finish; decays on recovery. RESAMPLE_MAX_ESC=0
  ⇒ exactly the old kill behaviour.
- decode_logits_seq.rs: scale presence (clamped < EOS cliff) / lz / DRY
  by resample_penalty_factor(output_len, escalation). RAMP_SLOPE=0 ⇒ identity.
- resample_escalation:u8 added to ActiveSeq+SwappedSeq + all 5 ctor sites
  + both lifecycle mappings (mirrors think_watchdog_fires exactly).

Verified: cargo check + clippy clean, fmt 0 diffs, LoC ≤500 (helpers
back to 476; new resample.rs 74), resample unit tests 3/3.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…atchdog→escalating resample"

This reverts 571e2e3. The vLLM-oracle diagnosis (2026-05-18) proved the
Qwen3.6-27B-FP8 long-code degeneration is an Atlas bug where Atlas's
anti-repetition PENALTIES are part of the cause, not the cure: same
model+prompt+temp on vLLM (no penalties, no MTP) produces a complete
game, while Atlas's penalty stack induces premature-EOS (documented in
the MODEL.toml 6c69d3f bisect comment). Phase-2 ADDED a per-length
penalty ramp — directionally backwards. Its watchdog→escalate half is
also moot: the real failure is a fuzzy loop with period ~80 tok > the
64-tok CONTENT_LOOP_PERIOD_MAX, so the watchdog never detects it.
Full revert restores the known-good baseline; the correct fixes
(penalty-preset realignment, thinking_default, prompt-template
divergence) are tracked separately and gated on the N≥10 paired harness.
See memory project_qwen36_27b_degeneration_rootcause.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o-thinking)

Additive, PCND-clean: --presence-penalty/--frequency-penalty/--no-thinking
are only sent when explicitly passed; default cell unchanged (server's
MODEL.toml preset). Enables the N≥10 paired data gate (shipped preset vs
vLLM-minimal) that Avarok-Cybersecurity#67's MODEL.toml realignment is contingent on — so
the preset change is data-driven, not n=1 (feedback_no_n1_stochastic_ab).

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mbed_scale guard

Three independent fixes restore Qwen3.6-27B-FP8 long-code generation to
match vLLM behavior (~14.5x later degeneration, complete coherent code).

1. weight_loader/qwen35_dense.rs: A_log and dt_bias must be FP32.
   Consumer kernels (ssm_preprocess.cu, mamba2_ssm_decode.cu) declare
   them `const float*`; loading via `dense()` kept BF16 storage,
   reinterpreting 48-elt BF16 (96B) as 48-elt FP32 -> per-head scrambled
   decay gates. MoE sister loader (ssm_qwen35.rs:59-62) had this fix
   already with an explicit warning about exponential GDR amplification
   at long context — dense path missed the mirror.

2. weight_loader/qwen35_dense.rs + layers/qwen3_ssm/init.rs:
   ATLAS_FP8_SSM_PREFILL=1 routes SSM in_proj_qkv/out_proj through a
   native FP8 prefill GEMM (bf16_to_fp8 -> fp8_gemm_n128), eliminating
   the BF16-truncation intermediate that, amplified by k-conv's tiny
   weights (||conv-k|| ~ 18x smaller than ||conv-v||), rotated the k-band
   conv output direction. NVFP4 fallback preserved for decode batched
   paths. Mirrors the MoE set_fp8_weights pattern.

3. model/impl_a3.rs: scale_embeddings_fp32 now applies the same
   no-embed-scale guard as scale_embeddings_bf16. Without it, every
   non-Gemma model that opted into fp32-residual hard-failed with
   "Module 'embed_scale' not loaded".

Verification (greedy, --no-thinking, --max-tokens 6000, vs aligned
54-tok HF oracle and end-to-end generation): conv-k cos 0.550 -> 0.9998,
recur per-head cos 0.99 -> 1.00 with magnitude ratio 0.82 -> 0.99, gnorm
per-head cos 0.82 -> 0.999, tokens_to_first_degeneration 1196 -> 17327.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@G-Deca G-Deca requested review from AzeezIsh and tbraun96 as code owners May 20, 2026 03:08
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 20, 2026

Thanks for the contribution. Before we can merge, please sign our Contributor License Agreement by replying to this comment with exactly:

I have read the CLA Document and I hereby sign the CLA


I have read the CLA Document and I hereby sign the CLA


2 out of 3 committers have signed the CLA.
✅ (tbraun96)[https://github.com/tbraun96]
✅ (rrstesiak)[https://github.com/rrstesiak]
@G-Deca
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@G-Deca
Copy link
Copy Markdown
Author

G-Deca commented May 20, 2026

I have read the CLA Document and I hereby sign the CLA

@G-Deca
Copy link
Copy Markdown
Author

G-Deca commented May 20, 2026

recheck

tbraun96 and others added 14 commits May 20, 2026 05:25
Sweep 2026-05-20: every Qwen-family MODEL.toml under kernels/gb10 now
declares 'thinking_default = true' in [behavior].

Reasoning-tier Qwen3.5/3.6 models materially benefit from CoT on
multi-step prompts; the prior implicit default of false (or explicit
false on qwen3-next-80b-a3b) silenced thinking unless the caller passed
chat_template_kwargs.enable_thinking=true.

Per-request override and CLI --disable-thinking kill-switch retain
precedence per the documented ladder (CLI flag > request body >
MODEL.toml [behavior]).

Note: MODEL.toml is compile-time-baked via atlas-kernels/build.rs →
build_codegen.rs, so a rebuild is required for these edits to take
effect in the running binary.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…stigation

Distilled playbook for tracking down quality regressions where Atlas
output diverges from a reference framework (vLLM / HF) on the same
model checkpoint. Grounded in commit 3ebc08a (the GDN decay-gate
precision + FP8 SSM prefill fixes that produced the 14.2× improvement
in tokens_to_first_degeneration on Qwen3.6-27B-FP8).

Covers the cheapest-signal-first elimination ladder, the byte-exact
HF CPU oracle pattern, per-head magnitude as a pointer-bug fingerprint
(std/min/max diagnostic), the sister-loader diff rule, reversal
discipline, and the polling cadence for multi-hour reproductions.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous false value (inherited from Qwen3.5's 'Think-Satisfy' guard)
unconditionally suppressed thinking on every tool-active turn — i.e.
every opencode interaction, since opencode always carries tools.

Verified post-rebuild on atlas-gb10:fix sha 5072cb1562c4:
  tools_active=true, --num-drafts=1, thinking_in_tools=false  →  0 reasoning_content chunks
  tools_active=true, --num-drafts=1, thinking_in_tools=true   → 34 reasoning_content chunks

Qwen3.6's reasoning is meaningfully better than 3.5's so Think-Satisfy
shouldn't re-emerge; max_thinking_budget=512 caps any drift and F28
auto-disables thinking when the previous message is a tool error.
One-line revert if tool-call density regresses.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three changes (audit per DEBUGGING_METHODOLOGY.md):

1. weight_loader/qwen35_dense.rs: drop ATLAS_FP8_SSM_PREFILL env gate.
   The fix has been live since 3ebc08a, verified 14.2× degeneration-onset
   improvement and matching vLLM behavior class. Unconditional now for
   every FP8-on-disk dense Qwen3.6 variant. Also switch norm.weight to
   dense_f32_safe (mirrors the MoE sister loader's FP32-aware path —
   defensive against checkpoints that ship norm weights as fp32).

2. weight_loader/qwen35/load_layers/linear_attn_arms.rs: same FP8 SSM
   prefill path cross-ported to the MoE A3B loader. Sister-loader audit
   (3 parallel agents per DEBUGGING_METHODOLOGY.md §6) found that the
   MoE A3B has identical asymmetric conv weights (k-segment ~18×
   smaller than v) and was therefore vulnerable to the same conv-k SNR
   collapse via FP8→BF16→NVFP4→BF16 triple-quant chain. Now both
   variants dispatch SSM prefill through fp8_gemm_n128 (BF16 act × FP8
   weight, FP32 accumulator) with NVFP4 retained as decode/batch
   fallback.

3. DEBUGGING_METHODOLOGY.md appendix: update Bug Avarok-Cybersecurity#1 description to
   note the env gate has been removed and the fix cross-ported to MoE.

Bug Avarok-Cybersecurity#2 audit was clean — codebase-wide sweep of every const float*
kernel parameter found all SSM scalar params (A_log, dt_bias, conv
biases, D_param across Qwen + Nemotron) already on the dense_keep_f32
/ dense_bf16_as_f32 path. No new pointer-alias sites.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Quick follow-up to 7d5e8fc: the MoE A3B Bug Avarok-Cybersecurity#1 cross-port allocates
native-FP8 SSM weights inside build_linear_attention_nvfp4 (called
per LinearAttention layer), which was silent. Add one tracing::info!
inside the if-Fp8Dequanted block so startup logs visibly confirm the
fix is active — mirrors the dense loader's top-level log (which fires
once before the layer loop).

For 35B-A3B with 30 SSM layers, expect 30 'SSM[...] ... native FP8
prefill GEMM' lines. Quick rebuild only, no behavior change.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…hunk-position lesson

Adds env-gated layer-0..N GDN intermediate dumper for the MoE A3B path:
  ATLAS_GDN_DUMP=<dir>           — base output dir
  ATLAS_GDN_DUMP_LAYERS=0,15,29  — SSM-layer indices to capture
  ATLAS_GDN_DUMP_N_SSM=30        — SSM-layer-count for counter modulo

The dumper hooks into trait_prefill.rs at 4 sites (post-conv, post-l2norm,
post-recurrence, post-gnorm) and captures the last token's BF16 slice.
Under chunked prefill, the file is overwritten on every scheduler chunk so
the on-disk dump corresponds to the LAST chunk's last-token (== position
L-1 of the full prefill, not chunk_len-1 of the first chunk).

Companion Python tooling:
  hf_gdn_ref_a3b.py        — HF CPU oracle for A3B (32 v-heads, conv_dim=8192,
                             hidden=2048). Generated TOK list from Atlas /tokenize.
  gdn_chain_diff_a3b.py    — diff comparator with A3B segment shapes
                             (q/k/v = 2048/2048/4096).

Methodology doc update: §3 gotcha Avarok-Cybersecurity#4 (don't use FP8 checkpoint as oracle —
HF silently ignores scale_inv), Avarok-Cybersecurity#5 (when comparing across SUT configs,
guarantee identical dumped positions or you'll see methodological noise).
§7 reversal log: chunked-prefill drift hypothesis added (refuted).

Investigation findings on Qwen3.6-35B-A3B at L=16k:
  Layer 0:   cos > 0.99985 — byte-perfect to BF16 floor at every stage.
  Layer 15:  gnorm cos 0.64 — depth + long-context quantization noise.
  Layer 29:  gnorm cos 0.71 — same class.
  At L=1244 the L15/L29 drift is much smaller (gnorm 0.92 / 0.86).
  Final model output remains coherent (sub-perceptual drift).
  No discrete numerical bug — drift is expected quantized-inference noise.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two changes in service of long-context drift investigation:

1. Multi-layer dump support — ATLAS_GDN_DUMP_LAYERS=0,1,2,3,15,29 (comma list
   of SSM-layer indices). Per-stage AtomicBool latches per (layer_idx, stage);
   counter wraps mod ATLAS_GDN_DUMP_N_SSM. Last-call overwrites so dumps land
   at position L-1 of the full prefill (not chunk_len-1 of the first chunk).

2. ATLAS_DISABLE_WY4=1 — forces fallback to single-token persistent kernel
   (or split4) for kernel-numerics isolation. WY4 was a suspect; turned out
   not to be the dominant noise source.

Investigation summary (Qwen3.6-35B-A3B, L=16k, layer-0..L15 walk against HF
BF16 baseline):

  L0  cos=0.99988 (byte-perfect at all L: 31, 1244, 16100) — proves
      Bug#1/Avarok-Cybersecurity#2/Avarok-Cybersecurity#3 fixes work and rules out first-layer kernel bugs.
  L1  gnorm cos=0.69547 at L=16k  vs  0.94266 at L=1244
      → real, length-dependent drift starting at layer 1.
  L3+ gnorm cos in 0.23-0.67 range — non-monotonic compounding.

Mechanism: FP8/NVFP4 weight-quantization noise on the per-layer in_proj_qkv
(specifically the W_z portion fed into the gnorm silu(z) gate). silu is a
non-linear amplifier near z≈0, so small per-layer FP8 noise in z becomes
large drift in gnorm output. At long L, the SSM recurrence H_t = g*H_{t-1}
+ v*k^T amplifies tiny per-step noise through the multiplicative gate
(H decay ratio compounds e^(ε·L) over L steps).

Refuted hypotheses (all evidence in commit + tests):
 - Chunked-prefill state-precision loss (refuted with corrected dumper)
 - FP8 KV cache as primary culprit (only +0.05 cos with BF16 KV)
 - BF16 residual stream as primary culprit (FP32 residual is *worse* at L1)
 - WY4 algebraic correction (disabling makes L1 worse, not better)

True fix requires loader-side precision schedule: load qkvz weight (or at
least its W_z slice) at BF16 unconditionally, accepting ~50 MB/layer extra
VRAM. Roughly 1.5 GB more for the A3B at 30 SSM layers. Not done in this
patch — needs design discussion on whether to make this default-on, env-gated,
or per-model.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Diagnostic env override that skips both FP8 and NVFP4 paths and uses
the BF16 in_proj_qkvz weight via ops::dense_gemm. Tests whether weight
quantization is the dominant source of layer-1+ long-context drift.

Result (Qwen3.6-35B-A3B, L=16k, vs HF BF16 baseline):

  Stage        FP8/NVFP4 (default)     BF16 weights forced
  L0 gnorm     0.99988                 1.00000   (perfect match)
  L1 gnorm     0.69547                 0.4274    (WORSE by 0.27)
  L2 gnorm     0.93958                 0.9513    (+0.01)
  L3 gnorm     0.23399                 0.4655    (+0.23)

Key inference: at L0 BF16 weights give byte-perfect match (cos=1.0000)
proving Atlas's BF16 GEMM == HF's BF16 GEMM when there's no SSM-state
or downstream-layer mixing. At L1, identical BF16 weights produce a
substantially DIFFERENT result from HF (cos 0.43, magnitude diverges
too). This rules out FP8/NVFP4 weight precision as the dominant drift
source at deeper layers under long context.

Remaining suspects (real causes):
 - Atlas's GDN prefill kernels (WY4/persistent/split4) vs HF's
   chunk_gated_delta_rule have different FP32-accumulation reduction
   orders. Per-step recurrence errors accumulate differently.
 - L2-norm reduction order on q,k may differ.
 - Gated RMSNorm kernel's silu(z) is computed with __expf (fast intrinsic)
   on Atlas but expf (IEEE) on HF — small per-element differences amplify
   through the non-linear gate.
 - Attention layer (model layer 3+) at L=16k has FP8 weights and a 16k-
   token softmax; its output noise propagates into downstream SSM layers.

Each of these is a non-trivial kernel-level investigation. The dumper
+ Python comparator from this commit chain (+ the per-layer walk pattern
proven here) are the right infrastructure to drill further when prioritized.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…hts override

Three diagnostic env knobs for long-context drift investigation:
  ATLAS_DISABLE_WY4=1          — fall back to single-token persistent kernel
  ATLAS_FORCE_PERSISTENT=1     — force per-token persistent at any k
  ATLAS_GDN_BF16_WEIGHTS=1     — skip FP8/NVFP4 paths, use BF16 dense GEMM

Per-substep dumpers added at:
  pre_norm    (layer input from residual stream)
  post_norm   (post-input_norm, into in_proj_qkv)
  post_qkvz   (post-deinterleave Q|K|V|Z, into conv1d)

Investigation findings on Qwen3.6-35B-A3B at L=16k vs HF Qwen3.6-35B-A3B
BF16 baseline:

  L0 gnorm cos = 0.99988 (clean at all L)
  L1 gnorm cos = 0.43-0.70 depending on config
  L3+ drift compounds non-monotonically

Combinations tried (no single fix found):
  default (FP8/NVFP4):             L1 gnorm 0.695
  BF16 KV cache:                   L1 gnorm 0.704  (+0.01)
  ATLAS_FP32_RESIDUAL=1:           L1 gnorm 0.386  (worse)
  ATLAS_DISABLE_WY4=1 → split4:    L1 gnorm 0.301  (worse)
  ATLAS_GDN_BF16_WEIGHTS=1:        L1 gnorm 0.427  (worse — and proves
                                     weight precision is NOT the cause)
  ATLAS_FORCE_PERSISTENT=1:        L1 gnorm 0.595  (slightly worse)
  Max precision (BF16w + FP32res + persistent):  L1 gnorm 0.648

Key inference: at L0 with BF16 weights, cos=1.00000 (byte-perfect)
proving Atlas's GEMM == HF's GEMM with identical inputs. At L1, same
configuration gives cos=0.43, meaning L1's INPUT differs from HF's L1
input. The drift must come from the L0→L1 transition: residual_add
chain that includes L0's out_proj output + post_attn_norm + MoE
output (NOT covered by ATLAS_GDN_BF16_WEIGHTS which only affects
qkvz GEMM, NOT the MoE experts or out_proj).

Remaining suspect: MoE expert quantization noise propagating via the
residual stream. The MoE has 256 experts with FP8/NVFP4 weights and
per-token routing; each token's MoE output has small quant noise that
ADDS to the residual, drifting L1+ inputs vs HF.

The pre_norm/post_norm dumpers have a subtle bug at small-L (capture
zeros for short prompts but real data for long) — buffer layout
interaction. Filed but not blocking the main finding.

Cleanest next move: instrument MoE output dumping + comparison vs HF
to confirm/refute the MoE-quant-noise hypothesis. If confirmed, load
MoE experts at BF16 (large memory cost) is the only real fix path.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The DFlash drafter constructor unconditionally applied YaRN scaling with
factor=64 / original_max_position_embeddings=4096 / beta_fast=32 /
beta_slow=1 hardcoded. The v2 2026-04-27 Qwen3.6-DFlash drafter ships
`rope_scaling: null` (plain RoPE), so every low-frequency RoPE pair was
mis-scaled — pairs 0..11 divided by 64, pairs 11..26 ramped — landing
drafter Q/K rotations in the wrong angular basis at every layer.

Replace the hardcoded YaRN block with a config-driven loader:

  * `DflashConfig` gains `rope_theta` (default 10M, matches Qwen3.6) and
    an optional `rope_scaling` block mirroring HF transformers' Qwen3
    config so `serde_json::from_str` works directly on the drafter's
    `config.json`.
  * `BlockDiffusionDraftHead::from_weights` reads that block:
      - `None` ⇒ plain RoPE (`inv_freq[j] = 1 / θ^(2j/dim)`).
      - `Some(yarn)` ⇒ existing YaRN formula, parameters from config.
      - Other / unrecognised `rope_type` ⇒ warn and fall back to plain.

Tested against the v2 Qwen3.6-DFlash drafter; per-layer drafter hidden
states are now bit-perfect with the PyTorch reference forward (cos=1.0
through all 5 drafter layers). End-to-end DFlash speculative decoding
still has a separate bug downstream of the drafter — investigation
ongoing — but the RoPE basis is now correct and worth landing on its
own.

Refs: Avarok internal thread #development 2026-05-19.
Adds dump hooks at:
  out_proj : SSM out_proj output (between gnorm and residual chain)
  moe_out  : MoE final output (post shared-expert blend)

Definitive findings (Qwen3.6-35B-A3B, L=16k, BF16 weights forced for qkvz):

  L0 SSM out_proj cos vs HF:    0.99210 (clean — SSM out_proj works)
  L0 MoE out cos vs HF combined: 0.42130  |Atlas|/|HF| = 2.266 (drifted)
  L0 MoE out cos vs HF routed-only: 0.40608  |Atlas|/|HF| = 2.368 (drifted)

The 2.37× magnitude inflation against HF's routed-only output proves the
drift originates in the ROUTED-expert pathway, NOT the shared expert
(which only contributes ~4% magnitude in HF).

Ruled out:
 - Top-K routing normalization (kernel reviewed, dispatch.rs:72
   hardcodes norm_topk_prob=true for qwen3_5_moe family)
 - Shared expert sigmoid gating (Atlas applies it via moe_batched_blend)
 - Chunked prefill (refuted earlier in this branch)
 - SSM kernels (L0 gnorm + out_proj are clean)

Remaining suspect: FP8 expert weight scaling. Each of 256 experts has
3 FP8-quantized matrices (gate/up/down), each with per-block weight_scale
and per-matrix weight_scale_2. An indexing or magnitude error in scale
loading would systematically inflate the routed weighted sum by the
same factor for all tokens. The 2.37× ratio is consistent with a
~sqrt(K=8)≈2.83 or a single global mis-scaled weight matrix.

Cleanest next experiment: dump a single expert's GEMM input + output +
weight_scale + weight_scale_2 from Atlas, compute HF expert(x) on the
same input using BF16 weights, and compare. If Atlas's per-expert
output differs by a constant factor from HF's, the FP8 scale loading
is the bug.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… cause localized to L0 MoE routing

Diagnostic env knobs added:
  ATLAS_DUMP_EXPERT_IDS=1  — logs top-K expert indices + weights per token
                              per layer (in moe/forward_prefill_fp8.rs)
                            + ATLAS_GATE_INPUT log (router_in last_tok |x|+first5)
                            + ATLAS_GATE_LOGITS log (raw pre-softmax top-10 + stats)
  ATLAS_DUMP_EMBED=1       — logs hidden_dst magnitude pre/post scale_embeddings
                              (in prefill_b/embed_chunk.rs)
  ATLAS_GDN_BF16_WEIGHTS=1 — extension: also installs BF16 out_proj_dense in
                              the MoE A3B loader so the dispatcher takes the
                              dense_gemm BF16 path (overrides FP8/NVFP4 out_proj).

Investigation chain (Qwen3.6-35B-A3B, L=16k, last token = pos 16098 = tok 271):

  ✅ embedding lookup: Atlas bit-identical to HF (|x|=0.3164, first5 match)
  ✅ embed_tokens.weight tensor: bit-identical between FP8 and BF16 checkpoints
  ✅ gate.weight tensor: bit-identical between checkpoints
  ✅ post_attention_layernorm.weight: bit-identical between checkpoints
  ✅ SSM out_proj cos vs HF: 0.99210 (clean)
  ✅ SSM gnorm cos vs HF: 0.99988 (clean)

  ❌ Gate INPUT direction:
       Atlas |x|=23.06 first5=[ 0.352, -0.359, -0.598,  0.070,  0.408]
       HF    |x|=25.43 first5=[-0.832, -0.027,  0.297, -0.046,  0.036]
       Magnitudes match within 10% but per-element directions completely differ.

  ❌ Gate LOGITS top-10:
       Atlas: [102, 131, 15, 21, 228, 52, 93, 96, 14, 116] (raw, mean -5.66)
       HF:    [ 81, 140, 158, 208, 200, 132, 115, 86, 95, 206] (post-softmax)

  ❌ Top-K=8 expert selection overlap: 0/8 (ZERO common experts)

  ❌ MoE output cos vs HF: 0.42 (catastrophic divergence)

  ❌ BF16 out_proj fix attempt: NO meaningful change (still 0/8 overlap)
     — eliminates SSM out_proj FP8 quant as the cause

Conclusion: with all upstream weights and SSM outputs bit-identical or
sufficiently close, the divergent gate input MUST come from a structural
difference in either residual_add_rms_norm (kernel formula/precision)
or some intermediate step I haven't instrumented yet. The MoE block's
top-K routing is so sensitive to gate-input direction that even a small
upstream noise produces 0/8 overlap with HF's selection.

Remaining suspects for the deeper drill:
  1. residual_add_rms_norm kernel: Atlas's formula vs HF's torch RMSNorm
     differ in some accumulation detail (eps, reduction order, post-norm
     weight multiplication order). The kernel's BF16 input/output but
     FP32 reduction is correct, but the formula might subtly differ.
  2. The 'hidden' buffer at residual_add_rms_norm time may not be the
     pristine embedding — some other op may have written to it (vision
     overlay? marconi snapshot restore? warmup leftovers?).
  3. post_attention_layernorm WEIGHT loading: bit-identical on disk
     but possibly loaded with wrong stride/transpose in Atlas's
     'load_dense_ffn' or similar.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The v1 `moe_fp8_grouped_gemm` kernel was originally documented as just
having a coalescing performance bug. While debugging Qwen3.6-35B-A3B-FP8
producing gibberish at 16k context, captured per-expert dumps showed v1
has a NUMERICAL bug for some (token, expert) tile combinations:

  chunk-4 last-token, expert 200, up_proj output:
    v1 (default): |x|=28.0   ← 5× too large
    v2 (coalesced): |x|=??   ← correct shape
  HF baseline (BF16 oracle): |x| ~ 5

The amplification propagates: up=28 then silu(gate)*up=8.4 then
down_proj→4.5, vs the other 7 experts' down output ~0.2-0.6.

At the prefill-chunk-4 boundary this drives a 42% residual-stream
amplification (Atlas L1 input |x|=0.819 vs HF 0.577); v2 brings it to
6% (0.611 vs 0.577) and Atlas now produces a coherent summary of a
16k-token prompt instead of gibberish.

Verification:
- L0 chunk-4 routed-only MoE output: v1=0.44, v2=0.26, HF=0.21
- 16k-context generation: now produces clean summary
- v1 path preserved as env override ATLAS_FP8_MOE_COALESCED=0

Diagnostic instrumentation kept (gated on ATLAS_DUMP_EXPERT_IDS=1):
  ATLAS_GATE_INPUT, ATLAS_GATE_LOGITS, ATLAS_EXPERT_IDS,
  ATLAS_MOE_OUT, ATLAS_ROUTED_ONLY, ATLAS_SHARED_OUT,
  ATLAS_SHARED_GATE; pre-norm SUM/HIDDEN/OUTPROJ in qwen3_ssm.
  Zero overhead when env unset. Plus a diagnostic
  ATLAS_FORCE_NVFP4_MOE=1 path in qwen35 load_layers for future
  same-class bug bisection.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tbraun96 and others added 8 commits May 20, 2026 14:31
The expert_gate_out/up_out/down_out buffers were only zeroed when
`ctx.comm.is_some()` (EP mode). In single-GPU mode they were left
uninitialized, which is a problem because:

  max_m_tiles = (avg_per_expert * 2).div_ceil(64).max(1)

assumes peak-per-expert ≤ 2× average — skewed routing violates this,
so the grouped GEMM kernel skips rows past max_m_tiles*64. Those rows
keep STALE DATA from the previous prefill (or uninitialized memory on
the first prefill), which propagates through unpermute_reduce as
spurious contributions to the routed-MoE output.

Effect was chunk-size + run-history dependent:
- Same prompt, same gate_input (25.97), different ATLAS_ROUTED_ONLY:
  4-chunk first-fire:  0.65
  4-chunk after-warmup: 0.35
  2-chunk:             1.13
  All vs HF baseline:  0.21

After fix (zero-init unconditional):
  3 runs same prompt: 0.186, 0.186, 0.184 (deterministic, -14% vs HF)
  Full L0..L39 sweep vs HF: 40/40 within 5% magnitude, 6+/8 expert overlap
  (Previously L1-L6 were 20-38% over with 1-5/8 overlap)

Also: one-shot tracing log to verify v2-vs-v1 kernel selection at
runtime (helpers_a.rs), useful for future bisection.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous heuristic max_m_tiles = (avg_per_expert * 2).div_ceil(64)
silently truncated heavily-loaded experts. Observed in
Qwen3.6-35B-A3B-FP8 at 4097-token chunk:

  ATLAS_EXPERT_LOAD: n_tokens=4097 avg=129 max=929 (expert 227)
                     max_m_tiles=5 kernel_cap=320 truncated=true

Expert 227 had 929 tokens assigned but the kernel grid only covered
320 rows — the remaining 609 rows were zeroed (after the zero-init
fix) but never computed. Those 609 tokens lost their expert-227
contribution entirely, producing a SYSTEMATIC -14% bias in routed-MoE
output at L0.

Fix: size max_m_tiles for the worst possible case where one expert
takes all tokens:

  max_m_tiles = (num_tokens * top_k).div_ceil(64)

This launches some empty tiles for under-utilized experts, but each
early-exits on `m_idx >= M_expert` so the overhead is small relative
to the previous correctness bug.

Result vs HF baseline (16k prompt, L0..L39 chunk-4 last-token):
  Before all fixes: L0..L6 had 20-38% over-amplification + 1-5/8 overlap
  After v2+zero-init: -14% bias systematic across all layers
  After this fix:
    - L0 MOE_OUT: 0.2084 vs HF 0.2154 (-3.3%)
    - Mean ratio 0.996, stddev 0.0085, all 40 layers in [0.977, 1.021]
    - Mean overlap 7.5/8, 21/40 layers perfect 8/8 overlap
    - = FP8 quantization noise floor

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- /.dgx2-work/, /tasks/, /target-rebuild/, /spark-fastbin-*
- bench/longcode/hang-forensics/{*.log,*.jsonl,*.py,*.sh,...}
- docker/gb10/Dockerfile.{fix,67fix,combo,ffnbf16,fp32res,hangdiag,zeropen}
  (the canonical Dockerfile + Dockerfile.fast + Dockerfile.fence +
  Dockerfile.gemma-fp32 stay tracked)

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors the FP8 path fixes (commits 34626d3 + adf39ce) to the NVFP4
grouped-GEMM forward (`forward_prefill.rs`). Same two bugs, same
symptoms expected:

  1. Zero-init expert output buffers unconditionally (was only in EP
     mode via `ctx.comm.is_some()`). Some grouped-GEMM kernel paths
     skip rows past per-expert end; without zero-init those rows kept
     stale data from previous prefills which contaminated the
     unpermute_reduce sum.

  2. max_m_tiles = (num_tokens * top_k).div_ceil(64) (worst case),
     not (avg * 2).div_ceil(64). Real MoE routers concentrate experts
     ~7× the average (observed on Qwen3.6-A3B at chunk=4097:
     avg=129, max=929 for expert 227). The Poisson(avg) assumption
     in the prior comment is wrong for trained routers. The (avg*2)
     truncation silently lost ~14% of routed-MoE magnitude per layer.

Verified on AEON-7/Qwen3.6-35B-A3B-heretic-NVFP4: 16k-context
generation now produces a clean, coherent code summary.

Models likely impacted (NVFP4 routed-expert MoE): MiniMax M2.7-NVFP4,
Mistral-Small-4-119B-NVFP4, Nemotron-3-Nano-30B-A3B-NVFP4,
Qwen3-VL-30B-A3B-NVFP4, Qwen3.6-A3B-heretic-NVFP4, Gemma-4-31B-NVFP4,
Qwen3.5-27B-NVFP4. All had latent under-counting and run-to-run
non-determinism in their routed-MoE path.

Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants