thewafflehaus · TheTom · May 26, 2026 · May 26, 2026 · May 26, 2026 · May 26, 2026
diff --git a/DECODE_OPTIMIZATION_RESEARCH.md b/DECODE_OPTIMIZATION_RESEARCH.md
diff --git a/PROFILING_32K_DECODE.md b/PROFILING_32K_DECODE.md
@@ -0,0 +1,66 @@
+# Nemotron-Nano-30B-A3B — 32K Decode Profiling Map (GB10 / ASUS GX10)
+
+**Goal:** 75+ tok/s decode @ 32,768 ctx. **Current:** 68.2 tok/s graph-batched (14.63 ms/token), full-quality Q4, argmax 1234.
+
+**Config:** `NEMOTRON_FAKECTX=32768 NEMOTRON_GRAPH=1 NEMOTRON_DEVROUTER=1 NEMOTRON_Q4CACHE=1 NEMOTRON_F16KV=1`, GB10 sm_121, LPDDR5X 273 GB/s peak.
+
+**Method:** per-op CUDA-event profiler (`NEMOTRON_PROFILE=1`, eager, sync-around-each-op). ⚠️ The profiler synchronizes per op, so **ms/tok is sync-inflated for tiny elementwise ops** (silu/rope/rms_norm/conv read ~0 bytes → their "8 ms" is sync overhead, real ≈0). **Trust the `GB/s` column and the ablation, not the per-op ms.**
+
+## The reference: achievable bandwidth = ~189 GB/s
+`lm_head` is a big contiguous Q4 GEMV and runs at **187 GB/s ≈ 99% of the achievable roofline**. So **189 GB/s is reachable on this hardware** — any kernel below it has headroom that is *kernel efficiency*, not a hardware wall.
+
+## Per-kernel map (real, bandwidth-bound work)
+
+| kernel | GB read/tok | eff GB/s | % of 189 achievable | calls/tok | verdict |
+|---|---|---|---|---|---|
+| **lm_head** | 0.176 | **187** | **99%** | 1 | ✅ saturated — no headroom |
+| m_in_proj (Mamba) | 0.319 | 144 | 76% | 23 | 🟡 mild headroom |
+| moe_gather_up | 0.345 | 140 | 74% | 23 | 🟡 **scatter penalty** |
+| moe_gather_down | 0.346 | 134 | 71% | 23 | 🟡 **scatter penalty** |
+| m_out_proj (Mamba) | 0.127 | 120 | 64% | 23 | 🟡 headroom |
+| shared_up_q4 | 0.115 | 101 | 53% | 23 | 🔴 **single-warp, big headroom** |
+| shared_down_acc | 0.115 | 90 | 48% | 23 | 🔴 **single-warp, big headroom** |
+| sdpa_2pass | 0.202 | 78 | 41% | 6 | 🟡 KV-read, latency-bound |
+| silu/rope/rms_norm/conv/ssm/router | ~0 | — | — | 12–52 | sync-artifact, real ≈0 ms |
+
+**host overhead:** 0.75 ms/tok (eager; graphs remove most of it).
+
+**Confirmation run (warmer box) — ranking reproduced, efficiencies thermal-sensitive:**
+`lm_head` **186.8 GB/s = 99%** (stable), moe_gather_up/down **119/115 = 63%/61%**, m_in_proj **121 = 64%**, m_out_proj **118 = 63%**, shared_up_q4 **77 = 41%**, shared_down_acc **68 = 36%**, sdpa_2pass **58 = 31%**. Absolute % drifts with temperature but `lm_head`≈99% and the under-performer ranking are invariant. **Headroom is real and likely *larger* than the first table suggests.**
+
+## Category ablation (skip-based ground truth)
+- **MoE total: 7.3 ms (45.6%)** — gather up/down + gate + shared experts
+- **Attn + lm_head + norms: 6.3 ms (39.4%)** — q/k/v/o proj, sdpa, lm_head
+- **Mamba: 2.4 ms (15.0%)** — in/out proj, conv1d, ssm_step
+
+## KEY INSIGHT — this is NOT a hardware wall
+`lm_head` proves 189 GB/s is achievable. The dominant kernels run well under it:
+- **shared-expert (up 53% / down 48%)**: weakest. Cause = single-warp-per-row (no multi-warp `rows_per_tg`). Fix = mirror the multi-warp coalesced kernel → target 80%+.
+- **moe_gather up/down (74%/71%)**: the top-6-of-128 **dynamic scatter** (6 experts at different offsets) costs ~25% vs lm_head's contiguous read. Levers: cache-streaming hints (`ld.global.cs`), better expert-block locality, or compaction.
+- **m_out_proj (64%) / m_in_proj (76%)**: mild headroom.
+
+## Path-to-75 math (clean, no precision loss)
+Token = 14.63 ms. Pulling the underperformers toward the proven 189 GB/s:
+- shared-expert 0.23 GB @ ~95 → @ 170 GB/s: saves ~1.0 ms
+- moe_gather 0.69 GB @ ~137 → @ 175 GB/s: saves ~1.0 ms
+- → ~12.6 ms = **~79 tok/s**. **75 is reachable on efficiency alone.**
+
+## Ruled out (measured, no gain)
+- SDPA 2-pass BC4 / TILED variants: **much worse** (56 ms vs 14 ms)
+- SDPA split-K block sweep (64–512): flat
+- `MT_MOE_RPT` 1–4 on gather: flat (gather bottleneck is scatter, not warp count)
+- `--use_fast_math`: no change
+- `MT_GEMV_2ROW`, `MT_GEMV_VEC`: crash
+- uint4 vectorized loads: −34% (starves the pipe)
+- f16-KV: +2 tok/s (banked)
+
+## Ranked opportunities (next work)
+1. **shared_up_q4 + shared_down_acc → multi-warp (`rows_per_tg`)** — 48–53% → 80%+, est. **−0.4–0.5 ms/tok**. Lowest risk, clearest headroom. *(blocked earlier by f32-vs-f16 scale type mismatch in the accum fusion — fix the dtype.)*
+2. **moe_gather scatter efficiency** — 71–74% → 85%+. Try `ld.global.cs` cache-streaming hints (inline PTX) + expert-block locality. Biggest share of the token (45%), so highest absolute payoff if the scatter penalty is partly cache-pollution.
+3. **m_out_proj** (64%) — multi-warp / config tune.
+
+## Banked optimizations (in the 68.2)
+CUDA graphs (+6.5%), Q4 disk cache (setup 120 s→20 s), MoE rpt2 default, parallel dequant, f16-KV, FMAD-on, `__ldg`/`__restrict__`/`__expf` codegen.
+
+---
+*Generated from the in-tree `NEMOTRON_PROFILE=1` per-op profiler (ffai-modeltests/src/lib.rs). Re-run: `NEMOTRON_PROFILE=1 NEMOTRON_FAKECTX=32768 NEMOTRON_DECODE=24 ... nemotron_decode_bench`.*
diff --git a/PROFILING_PREFILL.md b/PROFILING_PREFILL.md
@@ -0,0 +1,26 @@
+# Nemotron-Nano-30B BATCHED PREFILL — per-op profiling map
+
+- Device: GB10 sm_121 (GB10 Blackwell)
+- S (prompt tokens): 2048
+- Clean batched throughput: **74.3 tok/s** (13.46 ms/tok)
+- Profiled pass wall (sync-bracketed, inflated): 28.465s; summed op time: 13.536s
+- vLLM-on-GB10 reference: pp2048@d0=6395, @d8192=4993, @d32768=2734 tok/s
+- Tensor-core peak assumed: 1000 TFLOP/s (bf16 dense)
+
+| op | ms | % | calls | TFLOP/s | %peak |
+|---|---:|---:|---:|---:|---:|
+| moe_experts | 7137.65 | 52.7% | 69 | 0.790 | 0.08% |
+| proj_gemm | 3764.71 | 27.8% | 70 | 1.121 | 0.11% |
+| moe_shared | 1680.60 | 12.4% | 46 | 1.119 | 0.11% |
+| ssm_scan | 673.69 | 5.0% | 23 | 0.147 | 0.01% |
+| sdpa_prefill | 202.75 | 1.5% | 6 | 1.017 | 0.10% |
+| moe_router | 32.37 | 0.2% | 23 | 1.001 | 0.10% |
+| slice/cast | 31.36 | 0.2% | 140 | — | — |
+| rms_norm | 12.01 | 0.1% | 52 | — | — |
+| lm_head | 1.01 | 0.0% | 1 | 0.699 | 0.07% |
+
+## Gap analysis
+- `proj_gemm`/`lm_head` running far below %peak → projection GEMMs not at tensor-core roofline (f32 matmul, not bf16-MMA; dequant overhead separate).
+- `ssm_scan` high % → the sequential-in-T `ssm_step_record` is the Mamba bottleneck → Milestone B: chunked/parallel SSD scan.
+- `moe_experts`/`moe_shared` high % with many calls → per-token MoE gather loop → Milestone B: Q4 batched-expert GEMM over S.
+- `host_conv` time is CPU (host-bridged) → move causal conv on-device for S-batched.
diff --git a/README.md b/README.md
@@ -6,6 +6,16 @@ A minimal, dependency-light LLM inference library for Apple Silicon, built on pr
 
 **Just really f*cking fast AI on your Mac!** 🚀
 
+## Architecture
+
+FFAI is a Rust + Swift inference engine spanning 35 model families, with resident decode across four GPU backends (Apple Metal, NVIDIA CUDA, AMD HIP, and Vulkan) via [metaltile](https://github.com/thewafflehaus/metaltile)'s `#[kernel]` DSL. The diagram below traces the engine stack from model loading through the per-token dispatch loop down to the shared kernel layer.
+
+![FFAI architecture](docs/architecture.png)
+
+## Scope & naming
+
+FFAI began as an Apple/Metal-focused inference engine. It now runs across NVIDIA (GB10), AMD, and Vulkan-class GPUs via metaltile, so the Apple-specific name no longer reflects the multi-backend scope. A rename is under discussion to match the broadened reach — the candidate name is still TBD and nothing is decided. The current name (FFAI) continues to apply until any rename is settled.
+
 ## Status
 
 Early bootstrap — the dense-text, hybrid, vision-language, and audio model waves have all landed; end-to-end inference runs real HuggingFace checkpoints across every shipped family.

diff --git a/Sources/FFAI/Device.swift b/Sources/FFAI/Device.swift
@@ -46,14 +46,188 @@ public final class Device: @unchecked Sendable {
         self.commandQueue = commandQueue
     }
 
+    // ─── Scratch slab — generic transient-buffer allocator ────────────
+    //
+    // `Device.makeBuffer` is the default path for persistent buffers.
+    // For transients that live for the duration of a forward sub-block
+    // — and would otherwise hammer Metal's internal driver pool with
+    // hundreds of `makeBuffer(length:)` calls per token — there's a
+    // **scratch slab**: a single pre-allocated `MTLBuffer` that callers
+    // slice into via offset bumps. `device.allocScratch(bytes:)` returns
+    // `(buffer, offset)`; `Tensor.scratch(shape:dtype:)` wraps the slice
+    // as a Tensor; `device.resetScratch()` rewinds the offset to 0.
+    //
+    // Wrap a sub-block in `device.withScratch { ... }`: it flips
+    // `scratchModeActive` on (so plain `Tensor.empty` routes through the
+    // slab) and rewinds the offset at scope exit. State that CARRIES
+    // OVER between scratch scopes (e.g., the mHC 4-channel residual)
+    // must NOT live in scratch — allocate it with the default
+    // `Device.makeBuffer` instead.
+    public var scratchSlabBytes: Int = 256 * 1024 * 1024  // 256 MB cap
+    private var scratchBuffer: MTLBuffer?
+    private var scratchOffset: Int = 0
+
+    /// When `true`, `Tensor.empty(...)` routes through the scratch slab
+    /// instead of allocating a fresh MTLBuffer. Set by
+    /// `withScratch { ... }` so callers don't need to switch every
+    /// allocation site over to `Tensor.scratch` explicitly.
+    public var scratchModeActive: Bool = false
+
+    // ─── Allocation counters (diagnostic) ────────────────────────────
+    public var bufferAllocCount: Int = 0
+    public var bufferAllocBytes: Int = 0
+    public var scratchAllocCount: Int = 0
+    public var scratchAllocBytes: Int = 0
+
+    // ─── Dequant-intermediate scratch (persistent reusable buffer) ────
+    //
+    // GGUF dequant kernels need 1-2 large transient buffers per call
+    // (e.g., IQ2_XXS expert tensor: ~524 MB qs intermediate + ~32 MB
+    // d_f32 scales). Caller commits + waits the dequant cmd buffer
+    // BEFORE returning, so the intermediate is safely reusable
+    // across calls. These slabs grow lazily to the largest size
+    // requested.
+    private var dequantIntermediateBuffers: [String: MTLBuffer] = [:]
+    private let scratchLock = NSLock()
+
+    /// Returns a pre-allocated MTLBuffer ≥ `minBytes` keyed by `tag`.
+    /// Thread-safe: multiple parallel staging tasks may call with
+    /// distinct slot-keyed tags concurrently.
+    public func intermediateScratch(tag: String, minBytes: Int) -> MTLBuffer {
+        scratchLock.lock()
+        defer { scratchLock.unlock() }
+        let need = max(minBytes, 64)
+        if let buf = dequantIntermediateBuffers[tag], buf.length >= need {
+            return buf
+        }
+        let alloc = max(need, (dequantIntermediateBuffers[tag]?.length ?? 0) * 2)
+        guard let buf = mtlDevice.makeBuffer(length: alloc, options: .storageModeShared) else {
+            fatalError("Device.intermediateScratch: failed to allocate \(alloc)-byte slab")
+        }
+        dequantIntermediateBuffers[tag] = buf
+        return buf
+    }
+
+    /// Process RSS in KB via a `ps` shell-out. Slow (~10 ms per call)
+    /// but works without entitlements. Use sparingly — only at
+    /// per-sub-block instrumentation points.
+    public static func currentRssKB() -> Int {
+        let pid = ProcessInfo.processInfo.processIdentifier
+        let task = Process()
+        task.launchPath = "/bin/ps"
+        task.arguments = ["-o", "rss=", "-p", "\(pid)"]
+        let pipe = Pipe()
+        task.standardOutput = pipe
+        do { try task.run() } catch { return -1 }
+        task.waitUntilExit()
+        let data = pipe.fileHandleForReading.readDataToEndOfFile()
+        let s =
+            String(data: data, encoding: .utf8)?
+            .trimmingCharacters(in: .whitespacesAndNewlines) ?? "0"
+        return Int(s) ?? 0
+    }
+
+    /// Allocate `bytes` from the scratch slab (lazily creating the slab
+    /// on first use). 16-byte aligned. Fatal if the slab overflows —
+    /// caller should size `scratchSlabBytes` to fit one sub-block of
+    /// transients.
+    public func allocScratch(bytes: Int) -> (buffer: MTLBuffer, offset: Int) {
+        if scratchBuffer == nil {
+            scratchBuffer = mtlDevice.makeBuffer(
+                length: scratchSlabBytes, options: .storageModeShared)
+            guard scratchBuffer != nil else {
+                fatalError("Device.allocScratch: failed to allocate \(scratchSlabBytes)-byte slab")
+            }
+        }
+        let aligned = (scratchOffset + 15) & ~15
+        if aligned + bytes > scratchSlabBytes {
+            fatalError(
+                "Device.allocScratch: slab overflow — needed \(aligned + bytes), have \(scratchSlabBytes). Caller should resetScratch() between sub-blocks or grow scratchSlabBytes."
+            )
+        }
+        scratchOffset = aligned + bytes
+        scratchAllocCount += 1
+        scratchAllocBytes += bytes
+        return (scratchBuffer!, aligned)
+    }
+
+    /// Reset the scratch slab offset to 0. **Every Tensor sliced into
+    /// the slab via `Tensor.scratch(...)` becomes invalid after this
+    /// call** — all sub-block-local transients must be done with.
+    public func resetScratch() {
+        scratchOffset = 0
+    }
+
+    /// Convenience scope wrapper — runs the body with
+    /// `scratchModeActive = true` (so `Tensor.empty` transparently
+    /// uses the scratch slab), then resets the slab at scope exit.
+    /// Any Tensor sliced into the slab inside the body is INVALID
+    /// once `body` returns — carry-over state must be copied to a
+    /// persistent buffer (or allocated via `Tensor.empty` while
+    /// `scratchModeActive == false`) before the scope exits.
+    public func withScratch<T>(_ body: () throws -> T) rethrows -> T {
+        let wasActive = scratchModeActive
+        scratchModeActive = true
+        defer {
+            if !wasActive {
+                scratchModeActive = false
+                resetScratch()
+            }
+        }
+        return try body()
+    }
+
     /// Allocate a fresh shared-storage MTLBuffer of the given byte length.
     public func makeBuffer(length: Int) -> MTLBuffer {
         guard let buf = mtlDevice.makeBuffer(length: length, options: .storageModeShared) else {
             fatalError("Device.makeBuffer(length: \(length)) returned nil")
         }
+        bufferAllocCount += 1
+        bufferAllocBytes += length
         return buf
     }
 
+    /// Ensure the scratch slab is at least `bytes`, reallocating if needed.
+    /// SAFE ONLY when no scratch slices are live (`scratchOffset == 0`) —
+    /// call at the top of a forward pass before any `allocScratch`. The slab
+    /// is a single reused buffer (not a per-call allocation), so growing it
+    /// for a large prefill chunk is bounded, not a leak. Decode keeps 256 MB.
+    public func ensureScratchSlab(_ bytes: Int) {
+        if let buf = scratchBuffer, buf.length >= bytes { return }
+        precondition(
+            scratchOffset == 0,
+            "ensureScratchSlab: cannot resize with \(scratchOffset) bytes of live slices")
+        scratchSlabBytes = bytes
+        scratchBuffer = mtlDevice.makeBuffer(length: bytes, options: .storageModeShared)
+        guard scratchBuffer != nil else {
+            fatalError("ensureScratchSlab: failed to allocate \(bytes)-byte slab")
+        }
+    }
+
+    // Cache of 4-byte scalar-argument buffers, keyed by value. Kernel
+    // scalar args (rmsNorm eps, RoPE start/step, …) were allocating a
+    // fresh 4-byte MTLBuffer on EVERY op call — ~5 rmsNorms/layer ×
+    // 43 layers = ~220 tiny allocations per token. Over a long
+    // (e.g. 32k) decode that churned millions of buffers and eventually
+    // tripped `makeBuffer returned nil`. Scalars are ~constant, so cache
+    // one reusable buffer per value.
+    nonisolated(unsafe) private var scalarBufCache: [Float: MTLBuffer] = [:]
+    private let scalarBufLock = NSLock()
+    public func scalarBuffer(_ value: Float) -> MTLBuffer {
+        scalarBufLock.lock()
+        defer { scalarBufLock.unlock() }
+        if let b = scalarBufCache[value] { return b }
+        guard let b = mtlDevice.makeBuffer(length: 4, options: .storageModeShared) else {
+            fatalError("Device.scalarBuffer: makeBuffer(4) returned nil")
+        }
+        var v = value
+        memcpy(b.contents(), &v, 4)
+        scalarBufCache[value] = b
+        bufferAllocCount += 1
+        bufferAllocBytes += 4
+        return b
+    }
+
     /// Make a new MTLCommandBuffer.
     public func makeCommandBuffer() -> MTLCommandBuffer {
         guard let cb = commandQueue.makeCommandBuffer() else {

diff --git a/Sources/FFAI/KVCache/AURACodebook.swift b/Sources/FFAI/KVCache/AURACodebook.swift
@@ -20,7 +20,7 @@
 // the coordinate distribution of unit-sphere vectors converges to a
 // near-Gaussian, so a fixed Lloyd-Max table is near-optimal.
 //
-// The reference values here are mined from llama.cpp's `k_quants`
+// The reference values here are mined from the reference C++ `k_quants`
 // tables (empirically optimal for unit-norm Gaussian data at d=128)
 // and scaled to other head dims by √(128 / dim) — a heuristic that
 // approximates the analytic 1/√d Beta-variance scaling from the
@@ -246,6 +246,56 @@ public enum AURACodebook {
         return base.map { $0 * scale }
     }
 
+    /// Allocate a codebook tensor in the requested activation dtype.
+    /// AURA cache stores codebook in the same dtype as the model
+    /// activations so both encode + decode kernels (which take
+    /// `Tensor<T>` for the codebook) read directly with no per-call
+    /// cast. The Lloyd-Max values themselves are computed in Float;
+    /// narrow dtypes (`bf16`/`f16`) round at the CPU-side host conversion.
+    public static func centroidsTensor(
+        dim: Int, bits: Int, dtype: DType, device: Device = .shared
+    ) -> Tensor {
+        let values = centroids(dim: dim, bits: bits)
+        return writeFloatsToTensor(values, shape: [values.count], dtype: dtype, device: device)
+    }
+
+    /// Allocate a boundaries tensor in the requested activation dtype.
+    /// Post-metaltile #226, `aura_encode` takes `boundaries: Tensor<T>`
+    /// — kernel-side bandwidth win (Π + boundaries dominate the encode
+    /// kernel's memory traffic). Lloyd-Max boundary values are computed
+    /// in Float; narrow dtypes (bf16/f16) round at the host-side
+    /// conversion. The bf16/f16 rounding (~1e-3) sits well below the
+    /// 2-4-bit quant bin so the matched-norm correction stays stable.
+    public static func boundariesTensor(
+        dim: Int, bits: Int, dtype: DType, device: Device = .shared
+    ) -> Tensor {
+        let values = boundaries(dim: dim, bits: bits)
+        return writeFloatsToTensor(values, shape: [values.count], dtype: dtype, device: device)
+    }
+
+    /// CPU-side host conversion from `[Float]` into a tensor of the
+    /// requested float dtype. Used by `centroidsTensor` and any caller
+    /// that needs Lloyd-Max-precise values landed into narrow storage.
+    private static func writeFloatsToTensor(
+        _ values: [Float], shape: [Int],
+        dtype: DType, device: Device
+    ) -> Tensor {
+        let t = Tensor.empty(shape: shape, dtype: dtype, device: device)
+        switch dtype {
+        case .f32:
+            t.copyIn(from: values)
+        case .f16:
+            t.copyIn(from: values.map { Float16($0) })
+        case .bf16:
+            t.copyIn(from: values.map { UInt16(truncatingIfNeeded: $0.bitPattern >> 16) })
+        default:
+            fatalError(
+                "AURACodebook.centroidsTensor: unsupported dtype \(dtype); "
+                    + "AURA cache supports f32 / f16 / bf16")
+        }
+        return t
+    }
+
     /// Bytes-per-token after AURA packing at this bit width and dim.
     /// `ceil(dim * bits / 32) * 4` for the packed u32 array, plus 4
     /// bytes for the f32 per-token norm. Excludes any per-vector DC