Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
163 commits
Select commit Hold shift + click to select a range
2006195
feat(quality): TQ+ port P2 — KLD-vs-baseline harness for AURA quality…
TheTom May 26, 2026
728721c
test(quality): characterize AURA quality curve across bit-widths
TheTom May 26, 2026
0af5844
feat(aura): auto-asymmetric K-precision policy + aura8v4 / aura8v2 pr…
TheTom May 26, 2026
039b830
test(aura): pin matched-norm L2 correction across bit-widths + dynami…
TheTom May 26, 2026
6e1d7bf
feat(ops): Ops.auraFlashSdpa + supportsAuraFlashSdpa wrappers (model-…
TheTom May 26, 2026
4ca88cd
fix(ops): plumb kvStride through Ops.auraFlashSdpa
TheTom May 26, 2026
a0eb292
feat(aura): wire compressed flash decode in Qwen3Layer + Q pre-scale
TheTom May 27, 2026
66a1238
review(pr-15): credit Tom Turney on new files + auto-asym opt-in default
TheTom May 27, 2026
6f6e15d
telemetry: per-mixer Profile.signpost wrappers for decode/prefill att…
TheTom May 27, 2026
6a9817f
perf(aura): unify cache to activation dtype + wire 2-pass FA decode (…
TheTom May 27, 2026
dc767bf
perf(aura): default 2-pass blockSize 64 → 32 (sweep-validated +2-4% o…
TheTom May 28, 2026
3649f96
chore(aura): sync to metaltile #226 — rotation/boundaries → Tensor<T>
TheTom May 30, 2026
e88aec2
test(aura): drop hardcoded model path — env-driven FFAI_AURA_BENCH_MO…
TheTom May 30, 2026
9fb7625
test(aura): blockSize sweep — side-channel log + per-cell wall time
TheTom May 30, 2026
3f95b10
feat(loader): GGUF v3 reader + DeepSeek-V4 tensor bundle & block dequant
TheTom Jun 2, 2026
5358477
feat(ops): DeepSeek-V4 MoE/attention Metal ops + live-compile PSO cache
TheTom Jun 2, 2026
e1fc536
feat(model): DeepSeek-V4-Flash prefill + decode forward
TheTom Jun 2, 2026
ff8f48c
feat(kernels): NAX (matmul2d) + simdgroup MoE GEMM kernel sources
TheTom Jun 2, 2026
8ca4269
feat(cli): dsv4bench harness (prefill/decode throughput + correctness)
TheTom Jun 2, 2026
c70b0bf
test(dsv4): GGUF reader/ops/integration + Device scratch tests + M5 M…
TheTom Jun 2, 2026
e45aa83
chore(dsv4): scrub WIP labels from status comments + neutralize bench…
TheTom Jun 3, 2026
b4a2f65
fix(loader): nonisolated(unsafe) on resident-gather pool pointers for…
TheTom Jun 3, 2026
e720481
refactor(dsv4): address review — consolidate model files, drop dev/ +…
TheTom Jun 3, 2026
64f1502
refactor(dsv4): move system-free-memory check into MemoryStats
TheTom Jun 3, 2026
07e06fb
chore(test): neutralize ds4-model default path → deepseek-v4-flash
TheTom Jun 3, 2026
b563676
test(dsv4): pare integration suite to 4-point pattern + add GGUF cove…
TheTom Jun 4, 2026
2b82ecc
refactor(aura): move quality helpers to Telemetry/ + scrub external-r…
TheTom Jun 4, 2026
3cc2368
feat(rust): modular cross-platform inference engine skeleton
TheTom Jun 4, 2026
ee18bed
merge main (rust engine skeleton) into tom/wip/gguf-loader-and-dsv4
TheTom Jun 4, 2026
b9b209b
feat(rust/cuda): real Device impl + on-hardware smoke test
TheTom Jun 4, 2026
c7be66c
feat(ffi): C-ABI bridge so Swift consumes the shared engine layer
TheTom Jun 4, 2026
c60c511
feat(rust/ops): real elementwise op layer over the Device trait
TheTom Jun 4, 2026
d171310
feat(rust/ops): dispatch real registered metaltile kernels (rms_norm,…
TheTom Jun 4, 2026
d7abda6
feat(rust/ops): full elementwise+reduction op set on CUDA via the reg…
TheTom Jun 4, 2026
5b7f43a
feat(rust/ops): decode-time attention (sdpa_decode) on CUDA
TheTom Jun 4, 2026
0eef182
feat(rust/models): transformer-LLM decode layer on the shared op layer
TheTom Jun 4, 2026
27781cd
feat(rust/models): full transformer forward → logits on the shared layer
TheTom Jun 4, 2026
55a465b
feat(rust): real Qwen3-0.6B runs on the shared engine, matches HF
TheTom Jun 4, 2026
350191e
feat(rust/metal): real Metal backend — same model runs on Apple GPU
TheTom Jun 4, 2026
a22bfa2
test(metal): Qwen3-0.6B verified on Apple GPU — matches HF + CUDA
TheTom Jun 4, 2026
c4bc34d
feat(rust/models): QKV bias + auto-config load_hf — dense LLM family
TheTom Jun 4, 2026
ab7bf0e
test: dense-LLM family verified on BOTH platforms + verification matrix
TheTom Jun 4, 2026
ad91b7f
feat(rust/ops): strided rms_norm fallback — SmolLM2 (non-Qwen) verifi…
TheTom Jun 4, 2026
49f7b00
feat(rust/models): MoE feed-forward builder — verified both platforms
TheTom Jun 4, 2026
8d60e67
docs: DeepSeek-V4-Flash port spec (groundwork) + verification matrix
TheTom Jun 4, 2026
ecc307e
feat(rust/ops): DSv4 partial RoPE — first MLA primitive, both platforms
TheTom Jun 4, 2026
4d37700
feat(rust/ops): DSv4 d512 sink-SDPA — second MLA primitive, both plat…
TheTom Jun 4, 2026
adcd489
feat(rust/ops): DSv4 MoE ops — swiglu_limit (GPU) + sqrtsoftplus rout…
TheTom Jun 4, 2026
e225510
feat(rust/ops): DSv4 mHC collapse + expand — both platforms
TheTom Jun 4, 2026
7ebed13
feat(rust/ops): DSv4 mHC sinkhorn split (host) — DSv4 op set complete…
TheTom Jun 4, 2026
ae2d3f6
feat(rust/models): DSv4 MLA attention composite — verified both platf…
TheTom Jun 4, 2026
84b82a7
feat(rust/models): DSv4 MoE feed-forward composite — both platforms
TheTom Jun 4, 2026
66670a5
feat(rust/models): full DSv4 attention layer (mHC + MLA) — both platf…
TheTom Jun 4, 2026
f0dfe7c
feat(rust/loader): GGUF v3 reader + Q8_0/F16/F32 dequant (mmap)
TheTom Jun 4, 2026
51a109c
feat(rust/loader): Q2_K dequant — matches gguf-py
TheTom Jun 4, 2026
44605a9
feat(rust/loader): IQ2_XXS dequant — GGUF loader complete for DSv4
TheTom Jun 4, 2026
fe428ce
test(moe): real Qwen2-MoE block verified vs HF on both platforms
TheTom Jun 4, 2026
7f9907e
feat(rust/ops): Mamba2 SSD selective-scan step (ssm_step) — both plat…
TheTom Jun 4, 2026
e3870e7
feat(rust/ops): causal conv1d step — both platforms (Mamba2 short-con…
TheTom Jun 4, 2026
c3dc3bb
feat: full real Mamba2-130m forward verified vs HF — both platforms
TheTom Jun 4, 2026
1304bb7
MoE: full real OLMoE-1B-7B verified vs HF on Metal+CUDA (argmax 310)
TheTom Jun 4, 2026
74a443a
VLM: full real SigLIP vision tower verified vs HF on Metal+CUDA
TheTom Jun 4, 2026
705b382
Audio: full real Whisper-tiny encoder verified vs HF on Metal+CUDA
TheTom Jun 4, 2026
5e077de
Verify 3 distinct LLM arch paths vs HF: GPT-2, Pythia, Gemma-2
TheTom Jun 4, 2026
276436f
Verify Phi-1.5 + Falcon-H1 (hybrid) vs HF on both platforms
TheTom Jun 4, 2026
acd016d
Audio end-to-end: full Whisper-base STT (encoder→decoder) vs HF both …
TheTom Jun 4, 2026
7538a5e
VLM stitch: SmolVLM (Idefics3) connector verified vs HF both platforms
TheTom Jun 4, 2026
b014de0
Causal multi-token prefill verified vs HF (GPT-2, 8 tokens)
TheTom Jun 4, 2026
3ea1a26
Llama RoPE-at-position causal prefill verified vs HF (SmolVLM text mo…
TheTom Jun 4, 2026
e2959eb
GPT-2 greedy decode (30 tok) matches HF generate() exactly — decode +…
TheTom Jun 4, 2026
e857155
Perf: incremental KV-cache decode + device-resident weights + tok/s
TheTom Jun 4, 2026
8c4f33a
Coverage +3 architectures vs HF (Metal): OLMo-2, StableLM-2, GPT-Neo
TheTom Jun 4, 2026
28e1d9e
All 17 archs + prefill/decode/kvcache confirmed on CUDA; record tok/s
TheTom Jun 4, 2026
69c1a42
Decode: device-resident activations (on-device residuals) + bottlenec…
TheTom Jun 4, 2026
08493be
docs/PERF.md: root-cause the decode bottleneck (Metal host-shadow shim)
TheTom Jun 4, 2026
6f8bf0f
Merge remote-tracking branch 'origin/tom/feat/tq-plus-port-aura' into…
TheTom Jun 4, 2026
592d0fc
bench: per-dispatch overhead micro-bench (Rust -> Metal) + decompose …
TheTom Jun 4, 2026
ca74c1f
perf(metal): resident-buffer fast path — weights upload once, not per…
TheTom Jun 4, 2026
a2195aa
bench: Swift vs Rust per-dispatch on Metal — settles the FFI-overhead…
TheTom Jun 4, 2026
0adca95
cleanup: apply Rust-playbook hot-path + hygiene wins
TheTom Jun 4, 2026
c94d9b6
docs(dsv4): reverse-engineered real-checkpoint layout + CUDA verifica…
TheTom Jun 4, 2026
9657bf8
dedup: shared ffai-modeltests crate — model forward written ONCE, bot…
TheTom Jun 4, 2026
6284b82
dedup: all 10 model forwards shared in ffai-modeltests; one test file…
TheTom Jun 4, 2026
7ee9a1d
share: ffai-runtime — sampling + generation loop, pure & backend-free
TheTom Jun 4, 2026
418422b
feat(loader): capability probe — derive what a model CAN do from tens…
TheTom Jun 5, 2026
79d8a0b
NemotronH-Nano 30B-A3B resident decode on CUDA/GB10
TheTom Jun 5, 2026
743c0e2
docs(readme): embed architecture diagram
TheTom Jun 5, 2026
905bc76
docs(readme): refresh architecture diagram (mark GB10 backend new)
TheTom Jun 5, 2026
8dac513
docs(readme): add scope & naming note (multi-backend rename under dis…
TheTom Jun 5, 2026
27b3973
feat(nemotron): graph decode, Q4 weight cache, MoE rpt=2
TheTom Jun 5, 2026
32af4a6
feat(nemotron): batched prefill forward — 80->432 tok/s @ S=512 (~5x)
TheTom Jun 6, 2026
8332cef
feat(nemotron): grouped-MoE GEMM (NEMOTRON_GROUPED_GEMM) — +6-17% lar…
TheTom Jun 6, 2026
64aabf2
feat(nemotron): W4A16 Marlin-style MoE GEMM (NEMOTRON_W4A16) — beats …
TheTom Jun 6, 2026
fe00e55
feat(nemotron): Marlin permuted-Q4 W4A16 MoE GEMM (NEMOTRON_W4A16_MAR…
TheTom Jun 6, 2026
0e72ef8
feat(nemotron): 128x128 W4A16 MoE tile (large-S) + occupancy finding
TheTom Jun 6, 2026
9d7ccf7
fix(nemotron): deep-context KV-cap IMA + gated prefill infra
TheTom Jun 6, 2026
56efa25
fix(nemotron): conv-state double-shift in all-device decode step
TheTom Jun 6, 2026
d80ea21
feat(nemotron): tensor-core flash-attn + SSD matmul scan (both gated)
TheTom Jun 6, 2026
50d37ce
merge: tensor-core flash-attn + SSD matmul scan levers
TheTom Jun 6, 2026
7762f01
feat(nemotron): depth auto-select for tensor-core flash-attn
TheTom Jun 6, 2026
6d950a9
feat(nemotron): batched prefill on Metal backend (backend-gated) + ML…
TheTom Jun 6, 2026
06ffdc7
merge: Nemotron batched prefill on Metal backend (backend-gated, +MLX…
TheTom Jun 6, 2026
3962791
perf(nemotron): fix 1-thread-per-block launch geometry on prefill ele…
TheTom Jun 6, 2026
fb179de
merge: fix 1-thread-per-block launch geometry on prefill kernels (+7-…
TheTom Jun 6, 2026
28b98e1
feat(nemotron): portable SSD matmul scan (NEMOTRON_SSD_PORTABLE)
TheTom Jun 6, 2026
35ce749
merge: portable SSD matmul scan (runs on all 4 backends)
TheTom Jun 6, 2026
e9009ab
perf(nemotron): NEMOTRON_FEWER_SYNCS — collapse MoE host readbacks (1…
TheTom Jun 6, 2026
09297ee
merge: NEMOTRON_FEWER_SYNCS — collapse MoE readbacks (+26% prefill)
TheTom Jun 6, 2026
45efa4a
feat(ffai-vulkan): real VulkanDevice — op-validated on RDNA4 (4176/41…
TheTom Jun 6, 2026
94967fe
feat(nemotron): f16 expert compute default-on for Metal prefill
TheTom Jun 6, 2026
0e6499b
merge: f16 expert compute default-on for Metal prefill (+15-27%)
TheTom Jun 6, 2026
c90ab67
feat(ffai): load + run standard GGUF small models (Qwen2.5) on the sh…
TheTom Jun 6, 2026
3620e8f
feat(nemotron): deterministic moe_scatter_add (no atomics), default-o…
TheTom Jun 7, 2026
5788981
feat(gguf): resident-Q8 decode (gemv_q8) — 3.56x smaller upload, fast…
TheTom Jun 7, 2026
5bf7f09
feat(gguf): Q4_K/Q5_K/Q6_K dequant + split-GGUF + f16-subnormal fix
TheTom Jun 7, 2026
bbcc745
feat(nemotron): FEWER_SYNCS default-on (race fixed, deterministic, +26%)
TheTom Jun 7, 2026
df47226
perf(nemotron): dequant_q4 dispatch for 1-thread-per-word kernel (n_w…
TheTom Jun 7, 2026
4a0a311
perf(kernel): launch-geometry fixes for 4 Grid3D kernels (byte-identi…
TheTom Jun 7, 2026
eb9abdd
feat(gguf): Phi-3 support (fused-tensor split + SentencePiece tokenizer)
TheTom Jun 7, 2026
a9992e6
test(gguf): single regression gate over all models x {f32,resident-Q8…
TheTom Jun 7, 2026
d7943fc
fix(ffai-ops): defensive bounds check in dequant_q4_off (catch caller…
TheTom Jun 7, 2026
cd82d9b
fix(ffai-ops): bounds checks for host-offset sub-slab ops (slice, str…
TheTom Jun 7, 2026
766c82e
fix(cuda): pin CUDA context in CachedModule so cuModuleUnload outlive…
TheTom Jun 7, 2026
ceeb478
feat(ffai-vulkan): resident-buffer consumer (device-local weights) — …
TheTom Jun 7, 2026
9a7f10f
perf(nemotron): grouped Q4 MoE GEMM for Metal prefill (NEMOTRON_MOE_G…
TheTom Jun 7, 2026
8c080bf
perf(nemotron): on-device MoE gather/scatter for CUDA prefill (defaul…
TheTom Jun 7, 2026
8cb67f4
perf(nemotron): on-device MoE gather/scatter default-on for CUDA (+70…
TheTom Jun 7, 2026
699a34e
perf(ssd): wire fused SSD scan (drop gather_bc 8x + mmask round-trip)
TheTom Jun 7, 2026
672346d
perf(nemotron): SSD chunked-matmul scan DEFAULT-ON for CUDA + fused p…
TheTom Jun 7, 2026
07ebb1b
feat(ffai-vulkan): coopmat-accelerated Q8 prefill GEMM path — +166-17…
TheTom Jun 7, 2026
6bd622f
fix(nemotron): stop Marlin layout aliasing standard MoE weights (W4A1…
TheTom Jun 7, 2026
16ccd30
test(moe): straddle + n_out-not-div128 guard — moe_w4a16/bgemm bit-ex…
TheTom Jun 7, 2026
09764f6
perf(nemotron): default-on TC flash-attn for prefill (s>=512) — +7% d…
TheTom Jun 7, 2026
01005e1
feat(ffai-ops): varlen tensor-core FlashAttention for packed prefill
TheTom Jun 7, 2026
0025c22
feat(ffai-ops): varlen SSD prefill scan — per-segment state reset
TheTom Jun 7, 2026
4173a4b
feat(nemotron): packed multi-sequence prefill wiring (NEMOTRON_PACKED=N)
TheTom Jun 7, 2026
5262103
feat(ffai-ops): segment-skip in varlen flash-attn — O((NL)^2)->O(NL^2)
TheTom Jun 7, 2026
2c723b7
perf(nemotron): f16-direct MoE scatter — drop per-layer cast_f16_f32
TheTom Jun 7, 2026
b35b1e3
perf(nemotron): fused f32-output projection GEMM (drop per-proj cast)
TheTom Jun 7, 2026
f0ce442
feat(nemotron): fused single-kernel causal FlashAttention (v1, scalar…
TheTom Jun 8, 2026
3b6c8a7
feat(nemotron): wmma tensor-core fused FlashAttention (v2)
TheTom Jun 8, 2026
54808c4
perf(nemotron): FlashAttention-2 register-O kernel beats cuBLAS atten…
TheTom Jun 8, 2026
07dd8b9
feat(nemotron): grouped MoE GEMM kernel (mma.sync, v1 foundation)
TheTom Jun 8, 2026
d078ee5
perf(nemotron): grouped MoE GEMM v2 — multi-warp + BK=64 K-staging
TheTom Jun 8, 2026
b8ef243
perf(nemotron): grouped MoE GEMM v3 — cp.async double-buffered (+44%)
TheTom Jun 8, 2026
16f1302
feat(nemotron): CUTLASS grouped MoE GEMM callable from ffai stack
TheTom Jun 8, 2026
45e858d
feat(nemotron): wire CUTLASS grouped MoE GEMM into model (NEMOTRON_CU…
TheTom Jun 8, 2026
95392ed
feat(nemotron): on-device MoE route + counting-sort (kills host triples)
TheTom Jun 8, 2026
b345458
perf(nemotron): on-device MoE router+sort in prefill (default-on, +10%)
TheTom Jun 8, 2026
d4ecdf0
perf(nemotron): STABLE on-device MoE scatter, drop host re-sort
TheTom Jun 8, 2026
dfe1ddc
feat(nemotron): Q4-native grouped MoE GEMM (moe_q4_grouped_mma) — +35…
TheTom Jun 8, 2026
3dde2c9
perf(nemotron): Q4 grouped MoE default-on via exact-early hybrid — +2…
TheTom Jun 8, 2026
4348df9
perf(nemotron): magic-0x6400 all-f16 dequant in Q4 grouped MoE GEMM (…
TheTom Jun 8, 2026
6685663
feat(nemotron): on-device Q4 MoE tile descriptors (graph-safe foundat…
TheTom Jun 8, 2026
14ee882
feat(nemotron): fully on-device Q4 MoE (device gather+scatter) — host…
TheTom Jun 8, 2026
3f321ee
wip(nemotron): host-sync-free Q4 MoE (skip triples) — capture prereq,…
TheTom Jun 8, 2026
2858d92
feat(nemotron): deterministic on-device MoE scatter (moe_scatter_add_…
TheTom Jun 8, 2026
a1bdb99
feat(nemotron): host-free shared-expert under devdesc (graph-capture …
TheTom Jun 8, 2026
ccd9eb8
wip(nemotron): CUDA-graph capture of the full prefill forward — captu…
TheTom Jun 8, 2026
0f8f4b6
perf(nemotron): on-device MoE prefill DEFAULT-ON — ~2x (kills host-br…
TheTom Jun 8, 2026
9f36425
feat(nemotron): quantize_nvfp4 — NVFP4 (E2M1+UE4M3 block-16+fp32 glob…
TheTom Jun 8, 2026
d0a841c
feat(nemotron): NEMOTRON_FP4_SIM + dequantize_nvfp4 — argmax 1104 sur…
TheTom Jun 8, 2026
d43dfc5
docs(nemotron): NVFP4 reference harnesses — validated mxf4nvf4 mma la…
TheTom Jun 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 140 additions & 0 deletions DECODE_OPTIMIZATION_RESEARCH.md

Large diffs are not rendered by default.

66 changes: 66 additions & 0 deletions PROFILING_32K_DECODE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Nemotron-Nano-30B-A3B — 32K Decode Profiling Map (GB10 / ASUS GX10)

**Goal:** 75+ tok/s decode @ 32,768 ctx. **Current:** 68.2 tok/s graph-batched (14.63 ms/token), full-quality Q4, argmax 1234.

**Config:** `NEMOTRON_FAKECTX=32768 NEMOTRON_GRAPH=1 NEMOTRON_DEVROUTER=1 NEMOTRON_Q4CACHE=1 NEMOTRON_F16KV=1`, GB10 sm_121, LPDDR5X 273 GB/s peak.

**Method:** per-op CUDA-event profiler (`NEMOTRON_PROFILE=1`, eager, sync-around-each-op). ⚠️ The profiler synchronizes per op, so **ms/tok is sync-inflated for tiny elementwise ops** (silu/rope/rms_norm/conv read ~0 bytes → their "8 ms" is sync overhead, real ≈0). **Trust the `GB/s` column and the ablation, not the per-op ms.**

## The reference: achievable bandwidth = ~189 GB/s
`lm_head` is a big contiguous Q4 GEMV and runs at **187 GB/s ≈ 99% of the achievable roofline**. So **189 GB/s is reachable on this hardware** — any kernel below it has headroom that is *kernel efficiency*, not a hardware wall.

## Per-kernel map (real, bandwidth-bound work)

| kernel | GB read/tok | eff GB/s | % of 189 achievable | calls/tok | verdict |
|---|---|---|---|---|---|
| **lm_head** | 0.176 | **187** | **99%** | 1 | ✅ saturated — no headroom |
| m_in_proj (Mamba) | 0.319 | 144 | 76% | 23 | 🟡 mild headroom |
| moe_gather_up | 0.345 | 140 | 74% | 23 | 🟡 **scatter penalty** |
| moe_gather_down | 0.346 | 134 | 71% | 23 | 🟡 **scatter penalty** |
| m_out_proj (Mamba) | 0.127 | 120 | 64% | 23 | 🟡 headroom |
| shared_up_q4 | 0.115 | 101 | 53% | 23 | 🔴 **single-warp, big headroom** |
| shared_down_acc | 0.115 | 90 | 48% | 23 | 🔴 **single-warp, big headroom** |
| sdpa_2pass | 0.202 | 78 | 41% | 6 | 🟡 KV-read, latency-bound |
| silu/rope/rms_norm/conv/ssm/router | ~0 | — | — | 12–52 | sync-artifact, real ≈0 ms |

**host overhead:** 0.75 ms/tok (eager; graphs remove most of it).

**Confirmation run (warmer box) — ranking reproduced, efficiencies thermal-sensitive:**
`lm_head` **186.8 GB/s = 99%** (stable), moe_gather_up/down **119/115 = 63%/61%**, m_in_proj **121 = 64%**, m_out_proj **118 = 63%**, shared_up_q4 **77 = 41%**, shared_down_acc **68 = 36%**, sdpa_2pass **58 = 31%**. Absolute % drifts with temperature but `lm_head`≈99% and the under-performer ranking are invariant. **Headroom is real and likely *larger* than the first table suggests.**

## Category ablation (skip-based ground truth)
- **MoE total: 7.3 ms (45.6%)** — gather up/down + gate + shared experts
- **Attn + lm_head + norms: 6.3 ms (39.4%)** — q/k/v/o proj, sdpa, lm_head
- **Mamba: 2.4 ms (15.0%)** — in/out proj, conv1d, ssm_step

## KEY INSIGHT — this is NOT a hardware wall
`lm_head` proves 189 GB/s is achievable. The dominant kernels run well under it:
- **shared-expert (up 53% / down 48%)**: weakest. Cause = single-warp-per-row (no multi-warp `rows_per_tg`). Fix = mirror the multi-warp coalesced kernel → target 80%+.
- **moe_gather up/down (74%/71%)**: the top-6-of-128 **dynamic scatter** (6 experts at different offsets) costs ~25% vs lm_head's contiguous read. Levers: cache-streaming hints (`ld.global.cs`), better expert-block locality, or compaction.
- **m_out_proj (64%) / m_in_proj (76%)**: mild headroom.

## Path-to-75 math (clean, no precision loss)
Token = 14.63 ms. Pulling the underperformers toward the proven 189 GB/s:
- shared-expert 0.23 GB @ ~95 → @ 170 GB/s: saves ~1.0 ms
- moe_gather 0.69 GB @ ~137 → @ 175 GB/s: saves ~1.0 ms
- → ~12.6 ms = **~79 tok/s**. **75 is reachable on efficiency alone.**

## Ruled out (measured, no gain)
- SDPA 2-pass BC4 / TILED variants: **much worse** (56 ms vs 14 ms)
- SDPA split-K block sweep (64–512): flat
- `MT_MOE_RPT` 1–4 on gather: flat (gather bottleneck is scatter, not warp count)
- `--use_fast_math`: no change
- `MT_GEMV_2ROW`, `MT_GEMV_VEC`: crash
- uint4 vectorized loads: −34% (starves the pipe)
- f16-KV: +2 tok/s (banked)

## Ranked opportunities (next work)
1. **shared_up_q4 + shared_down_acc → multi-warp (`rows_per_tg`)** — 48–53% → 80%+, est. **−0.4–0.5 ms/tok**. Lowest risk, clearest headroom. *(blocked earlier by f32-vs-f16 scale type mismatch in the accum fusion — fix the dtype.)*
2. **moe_gather scatter efficiency** — 71–74% → 85%+. Try `ld.global.cs` cache-streaming hints (inline PTX) + expert-block locality. Biggest share of the token (45%), so highest absolute payoff if the scatter penalty is partly cache-pollution.
3. **m_out_proj** (64%) — multi-warp / config tune.

## Banked optimizations (in the 68.2)
CUDA graphs (+6.5%), Q4 disk cache (setup 120 s→20 s), MoE rpt2 default, parallel dequant, f16-KV, FMAD-on, `__ldg`/`__restrict__`/`__expf` codegen.

---
*Generated from the in-tree `NEMOTRON_PROFILE=1` per-op profiler (ffai-modeltests/src/lib.rs). Re-run: `NEMOTRON_PROFILE=1 NEMOTRON_FAKECTX=32768 NEMOTRON_DECODE=24 ... nemotron_decode_bench`.*
26 changes: 26 additions & 0 deletions PROFILING_PREFILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Nemotron-Nano-30B BATCHED PREFILL — per-op profiling map

- Device: GB10 sm_121 (GB10 Blackwell)
- S (prompt tokens): 2048
- Clean batched throughput: **74.3 tok/s** (13.46 ms/tok)
- Profiled pass wall (sync-bracketed, inflated): 28.465s; summed op time: 13.536s
- vLLM-on-GB10 reference: pp2048@d0=6395, @d8192=4993, @d32768=2734 tok/s
- Tensor-core peak assumed: 1000 TFLOP/s (bf16 dense)

| op | ms | % | calls | TFLOP/s | %peak |
|---|---:|---:|---:|---:|---:|
| moe_experts | 7137.65 | 52.7% | 69 | 0.790 | 0.08% |
| proj_gemm | 3764.71 | 27.8% | 70 | 1.121 | 0.11% |
| moe_shared | 1680.60 | 12.4% | 46 | 1.119 | 0.11% |
| ssm_scan | 673.69 | 5.0% | 23 | 0.147 | 0.01% |
| sdpa_prefill | 202.75 | 1.5% | 6 | 1.017 | 0.10% |
| moe_router | 32.37 | 0.2% | 23 | 1.001 | 0.10% |
| slice/cast | 31.36 | 0.2% | 140 | — | — |
| rms_norm | 12.01 | 0.1% | 52 | — | — |
| lm_head | 1.01 | 0.0% | 1 | 0.699 | 0.07% |

## Gap analysis
- `proj_gemm`/`lm_head` running far below %peak → projection GEMMs not at tensor-core roofline (f32 matmul, not bf16-MMA; dequant overhead separate).
- `ssm_scan` high % → the sequential-in-T `ssm_step_record` is the Mamba bottleneck → Milestone B: chunked/parallel SSD scan.
- `moe_experts`/`moe_shared` high % with many calls → per-token MoE gather loop → Milestone B: Q4 batched-expert GEMM over S.
- `host_conv` time is CPU (host-bridged) → move causal conv on-device for S-batched.
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,16 @@ A minimal, dependency-light LLM inference library for Apple Silicon, built on pr

**Just really f*cking fast AI on your Mac!** 🚀

## Architecture

FFAI is a Rust + Swift inference engine spanning 35 model families, with resident decode across four GPU backends (Apple Metal, NVIDIA CUDA, AMD HIP, and Vulkan) via [metaltile](https://github.com/thewafflehaus/metaltile)'s `#[kernel]` DSL. The diagram below traces the engine stack from model loading through the per-token dispatch loop down to the shared kernel layer.

![FFAI architecture](docs/architecture.png)

## Scope & naming

FFAI began as an Apple/Metal-focused inference engine. It now runs across NVIDIA (GB10), AMD, and Vulkan-class GPUs via metaltile, so the Apple-specific name no longer reflects the multi-backend scope. A rename is under discussion to match the broadened reach — the candidate name is still TBD and nothing is decided. The current name (FFAI) continues to apply until any rename is settled.

## Status

Early bootstrap — the dense-text, hybrid, vision-language, and audio model waves have all landed; end-to-end inference runs real HuggingFace checkpoints across every shipped family.
Expand Down
174 changes: 174 additions & 0 deletions Sources/FFAI/Device.swift
Original file line number Diff line number Diff line change
Expand Up @@ -46,14 +46,188 @@ public final class Device: @unchecked Sendable {
self.commandQueue = commandQueue
}

// ─── Scratch slab — generic transient-buffer allocator ────────────
//
// `Device.makeBuffer` is the default path for persistent buffers.
// For transients that live for the duration of a forward sub-block
// — and would otherwise hammer Metal's internal driver pool with
// hundreds of `makeBuffer(length:)` calls per token — there's a
// **scratch slab**: a single pre-allocated `MTLBuffer` that callers
// slice into via offset bumps. `device.allocScratch(bytes:)` returns
// `(buffer, offset)`; `Tensor.scratch(shape:dtype:)` wraps the slice
// as a Tensor; `device.resetScratch()` rewinds the offset to 0.
//
// Wrap a sub-block in `device.withScratch { ... }`: it flips
// `scratchModeActive` on (so plain `Tensor.empty` routes through the
// slab) and rewinds the offset at scope exit. State that CARRIES
// OVER between scratch scopes (e.g., the mHC 4-channel residual)
// must NOT live in scratch — allocate it with the default
// `Device.makeBuffer` instead.
public var scratchSlabBytes: Int = 256 * 1024 * 1024 // 256 MB cap
private var scratchBuffer: MTLBuffer?
private var scratchOffset: Int = 0

/// When `true`, `Tensor.empty(...)` routes through the scratch slab
/// instead of allocating a fresh MTLBuffer. Set by
/// `withScratch { ... }` so callers don't need to switch every
/// allocation site over to `Tensor.scratch` explicitly.
public var scratchModeActive: Bool = false

// ─── Allocation counters (diagnostic) ────────────────────────────
public var bufferAllocCount: Int = 0
public var bufferAllocBytes: Int = 0
public var scratchAllocCount: Int = 0
public var scratchAllocBytes: Int = 0

// ─── Dequant-intermediate scratch (persistent reusable buffer) ────
//
// GGUF dequant kernels need 1-2 large transient buffers per call
// (e.g., IQ2_XXS expert tensor: ~524 MB qs intermediate + ~32 MB
// d_f32 scales). Caller commits + waits the dequant cmd buffer
// BEFORE returning, so the intermediate is safely reusable
// across calls. These slabs grow lazily to the largest size
// requested.
private var dequantIntermediateBuffers: [String: MTLBuffer] = [:]
private let scratchLock = NSLock()

/// Returns a pre-allocated MTLBuffer ≥ `minBytes` keyed by `tag`.
/// Thread-safe: multiple parallel staging tasks may call with
/// distinct slot-keyed tags concurrently.
public func intermediateScratch(tag: String, minBytes: Int) -> MTLBuffer {
scratchLock.lock()
defer { scratchLock.unlock() }
let need = max(minBytes, 64)
if let buf = dequantIntermediateBuffers[tag], buf.length >= need {
return buf
}
let alloc = max(need, (dequantIntermediateBuffers[tag]?.length ?? 0) * 2)
guard let buf = mtlDevice.makeBuffer(length: alloc, options: .storageModeShared) else {
fatalError("Device.intermediateScratch: failed to allocate \(alloc)-byte slab")
}
dequantIntermediateBuffers[tag] = buf
return buf
}

/// Process RSS in KB via a `ps` shell-out. Slow (~10 ms per call)
/// but works without entitlements. Use sparingly — only at
/// per-sub-block instrumentation points.
public static func currentRssKB() -> Int {
let pid = ProcessInfo.processInfo.processIdentifier
let task = Process()
task.launchPath = "/bin/ps"
task.arguments = ["-o", "rss=", "-p", "\(pid)"]
let pipe = Pipe()
task.standardOutput = pipe
do { try task.run() } catch { return -1 }
task.waitUntilExit()
let data = pipe.fileHandleForReading.readDataToEndOfFile()
let s =
String(data: data, encoding: .utf8)?
.trimmingCharacters(in: .whitespacesAndNewlines) ?? "0"
return Int(s) ?? 0
}

/// Allocate `bytes` from the scratch slab (lazily creating the slab
/// on first use). 16-byte aligned. Fatal if the slab overflows —
/// caller should size `scratchSlabBytes` to fit one sub-block of
/// transients.
public func allocScratch(bytes: Int) -> (buffer: MTLBuffer, offset: Int) {
if scratchBuffer == nil {
scratchBuffer = mtlDevice.makeBuffer(
length: scratchSlabBytes, options: .storageModeShared)
guard scratchBuffer != nil else {
fatalError("Device.allocScratch: failed to allocate \(scratchSlabBytes)-byte slab")
}
}
let aligned = (scratchOffset + 15) & ~15
if aligned + bytes > scratchSlabBytes {
fatalError(
"Device.allocScratch: slab overflow — needed \(aligned + bytes), have \(scratchSlabBytes). Caller should resetScratch() between sub-blocks or grow scratchSlabBytes."
)
}
scratchOffset = aligned + bytes
scratchAllocCount += 1
scratchAllocBytes += bytes
return (scratchBuffer!, aligned)
}

/// Reset the scratch slab offset to 0. **Every Tensor sliced into
/// the slab via `Tensor.scratch(...)` becomes invalid after this
/// call** — all sub-block-local transients must be done with.
public func resetScratch() {
scratchOffset = 0
}

/// Convenience scope wrapper — runs the body with
/// `scratchModeActive = true` (so `Tensor.empty` transparently
/// uses the scratch slab), then resets the slab at scope exit.
/// Any Tensor sliced into the slab inside the body is INVALID
/// once `body` returns — carry-over state must be copied to a
/// persistent buffer (or allocated via `Tensor.empty` while
/// `scratchModeActive == false`) before the scope exits.
public func withScratch<T>(_ body: () throws -> T) rethrows -> T {
let wasActive = scratchModeActive
scratchModeActive = true
defer {
if !wasActive {
scratchModeActive = false
resetScratch()
}
}
return try body()
}

/// Allocate a fresh shared-storage MTLBuffer of the given byte length.
public func makeBuffer(length: Int) -> MTLBuffer {
guard let buf = mtlDevice.makeBuffer(length: length, options: .storageModeShared) else {
fatalError("Device.makeBuffer(length: \(length)) returned nil")
}
bufferAllocCount += 1
bufferAllocBytes += length
return buf
}

/// Ensure the scratch slab is at least `bytes`, reallocating if needed.
/// SAFE ONLY when no scratch slices are live (`scratchOffset == 0`) —
/// call at the top of a forward pass before any `allocScratch`. The slab
/// is a single reused buffer (not a per-call allocation), so growing it
/// for a large prefill chunk is bounded, not a leak. Decode keeps 256 MB.
public func ensureScratchSlab(_ bytes: Int) {
if let buf = scratchBuffer, buf.length >= bytes { return }
precondition(
scratchOffset == 0,
"ensureScratchSlab: cannot resize with \(scratchOffset) bytes of live slices")
scratchSlabBytes = bytes
scratchBuffer = mtlDevice.makeBuffer(length: bytes, options: .storageModeShared)
guard scratchBuffer != nil else {
fatalError("ensureScratchSlab: failed to allocate \(bytes)-byte slab")
}
}

// Cache of 4-byte scalar-argument buffers, keyed by value. Kernel
// scalar args (rmsNorm eps, RoPE start/step, …) were allocating a
// fresh 4-byte MTLBuffer on EVERY op call — ~5 rmsNorms/layer ×
// 43 layers = ~220 tiny allocations per token. Over a long
// (e.g. 32k) decode that churned millions of buffers and eventually
// tripped `makeBuffer returned nil`. Scalars are ~constant, so cache
// one reusable buffer per value.
nonisolated(unsafe) private var scalarBufCache: [Float: MTLBuffer] = [:]
private let scalarBufLock = NSLock()
public func scalarBuffer(_ value: Float) -> MTLBuffer {
scalarBufLock.lock()
defer { scalarBufLock.unlock() }
if let b = scalarBufCache[value] { return b }
guard let b = mtlDevice.makeBuffer(length: 4, options: .storageModeShared) else {
fatalError("Device.scalarBuffer: makeBuffer(4) returned nil")
}
var v = value
memcpy(b.contents(), &v, 4)
scalarBufCache[value] = b
bufferAllocCount += 1
bufferAllocBytes += 4
return b
}

/// Make a new MTLCommandBuffer.
public func makeCommandBuffer() -> MTLCommandBuffer {
guard let cb = commandQueue.makeCommandBuffer() else {
Expand Down
52 changes: 51 additions & 1 deletion Sources/FFAI/KVCache/AURACodebook.swift
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
// the coordinate distribution of unit-sphere vectors converges to a
// near-Gaussian, so a fixed Lloyd-Max table is near-optimal.
//
// The reference values here are mined from llama.cpp's `k_quants`
// The reference values here are mined from the reference C++ `k_quants`
// tables (empirically optimal for unit-norm Gaussian data at d=128)
// and scaled to other head dims by √(128 / dim) — a heuristic that
// approximates the analytic 1/√d Beta-variance scaling from the
Expand Down Expand Up @@ -246,6 +246,56 @@ public enum AURACodebook {
return base.map { $0 * scale }
}

/// Allocate a codebook tensor in the requested activation dtype.
/// AURA cache stores codebook in the same dtype as the model
/// activations so both encode + decode kernels (which take
/// `Tensor<T>` for the codebook) read directly with no per-call
/// cast. The Lloyd-Max values themselves are computed in Float;
/// narrow dtypes (`bf16`/`f16`) round at the CPU-side host conversion.
public static func centroidsTensor(
dim: Int, bits: Int, dtype: DType, device: Device = .shared
) -> Tensor {
let values = centroids(dim: dim, bits: bits)
return writeFloatsToTensor(values, shape: [values.count], dtype: dtype, device: device)
}

/// Allocate a boundaries tensor in the requested activation dtype.
/// Post-metaltile #226, `aura_encode` takes `boundaries: Tensor<T>`
/// — kernel-side bandwidth win (Π + boundaries dominate the encode
/// kernel's memory traffic). Lloyd-Max boundary values are computed
/// in Float; narrow dtypes (bf16/f16) round at the host-side
/// conversion. The bf16/f16 rounding (~1e-3) sits well below the
/// 2-4-bit quant bin so the matched-norm correction stays stable.
public static func boundariesTensor(
dim: Int, bits: Int, dtype: DType, device: Device = .shared
) -> Tensor {
let values = boundaries(dim: dim, bits: bits)
return writeFloatsToTensor(values, shape: [values.count], dtype: dtype, device: device)
}

/// CPU-side host conversion from `[Float]` into a tensor of the
/// requested float dtype. Used by `centroidsTensor` and any caller
/// that needs Lloyd-Max-precise values landed into narrow storage.
private static func writeFloatsToTensor(
_ values: [Float], shape: [Int],
dtype: DType, device: Device
) -> Tensor {
let t = Tensor.empty(shape: shape, dtype: dtype, device: device)
switch dtype {
case .f32:
t.copyIn(from: values)
case .f16:
t.copyIn(from: values.map { Float16($0) })
case .bf16:
t.copyIn(from: values.map { UInt16(truncatingIfNeeded: $0.bitPattern >> 16) })
default:
fatalError(
"AURACodebook.centroidsTensor: unsupported dtype \(dtype); "
+ "AURA cache supports f32 / f16 / bf16")
}
return t
}

/// Bytes-per-token after AURA packing at this bit width and dim.
/// `ceil(dim * bits / 32) * 4` for the packed u32 array, plus 4
/// bytes for the f32 per-token norm. Excludes any per-vector DC
Expand Down
Loading