antirez/ds4 (DwarfStar 4) running on a single
NVIDIA DGX Spark (GB10 / SM121, 128 GiB unified memory), with measured
benchmarks and a roofline analysis grounded in the hardware ceiling.
Status: Working end-to-end. Single-prompt smoke test passes; ds4's decode
runs at ~70–75 % of the memory-bandwidth roofline for this quant on this
hardware, with real (non-bandwidth) headroom remaining. MTP speculative decode is shipped by the
donor but produces no speedup on CUDA today — root cause traced to a
quant-format gap in one CUDA kernel that silently rejects the MTP draft's
Q4_K experts. The fix is ~700–900 LOC in ds4_cuda.cu, scoped in
docs/MTP_PARITY_GAP.md. The Metal backend is unaffected.
- Reference:
antirez/ds4— MIT-licensed C+CUDA inference engine. CUDA backend landed 2026-05-11; this writeup uses HEAD920f987as of 2026-05-12. - Model:
antirez/deepseek-v4-gguf— 81 GiB asymmetric quant: IQ2_XXS for routed-expert gate/up, Q2_K for routed-expert down (these dominate model bytes), Q8_0 for everything else dense (shared expert, attention projections, output head, router), F16 for LoRA matrices and the compressor/indexer, F32 norms. (FP8 in ds4 is a runtime KV-cache quantization — E4M3FN round-trip — not a stored weight format.) Plus an optional 3.6 GiB MTP draft GGUF. - Hardware: NVIDIA DGX Spark, GB10, SM121, 128 GiB LPDDR5X unified. The donor's
Makefilehas amake cuda-sparktarget that builds nativesm_121, plusmake cuda CUDA_ARCH=sm_NNNfor an explicit override — both GB10-correct with no patches needed. (Building with an empty-archmeasured ~25% slower prefill on GB10, so the explicit arch matters.)
On a DGX Spark with CUDA 13 installed:
curl -sSL https://raw.githubusercontent.com/entrpi/ds4-on-spark/main/install.sh | bash -s -- --with-mtp --startThat one command:
- Verifies the host (aarch64, GB10/SM121, CUDA 13, ≥110 GiB free disk).
- Clones
antirez/ds4into~/code/ds4(or$DS4_SRC_DIR). - Builds
ds4,ds4-server,ds4-benchwithCUDA_ARCH=sm_121in ~8 s. - Downloads the Q2 GGUF (~81 GiB) and the MTP GGUF (~3.6 GiB) from
antirez/deepseek-v4-ggufinto~/gguf(or$DS4_GGUF_DIR). - Runs the "capital of France" smoke test and asserts "Paris" in the output.
- Starts
ds4-serveron:8000with-c 32768.
To preview without running:
curl -sSL https://raw.githubusercontent.com/entrpi/ds4-on-spark/main/install.sh | bash -s -- --helpCommon overrides: --cuda-arch sm_120 (datacenter Blackwell), --no-download
(reuse existing GGUF), --src-dir, --gguf-dir, --ctx, --port, --force
(skip host check).
| Validated on | NVIDIA DGX Spark (GB10, SM121, 128 GiB unified) |
| Likely to work | other Blackwell with --cuda-arch sm_120, untested |
| CUDA toolkit | 13.x (we tested 13.0.88) |
| Disk | ≥110 GiB free for the GGUFs |
| OS | aarch64 Linux (Grace) |
| RAM (system, unified) | 128 GiB is enough for the model + ~250 MB KV @ 16k context |
GB10 is detected via nvidia-smi --query-gpu=compute_cap returning 12.1.
Anything else gets a warning + --force override path.
| Binary | Purpose |
|---|---|
ds4 |
Interactive / one-shot CLI |
ds4-server |
OpenAI v1-compatible HTTP server (POST /v1/chat/completions, SSE streaming) |
ds4-bench |
Direct prefill + decode throughput sweep (no HTTP) |
ds4-server is the recommended runtime. It exposes:
POST /v1/chat/completions(OpenAI-compatible streaming, tool calls)POST /v1/completionsGET /v1/models
It also speaks Anthropic-shape on /v1/messages (see donor README).
Hardware: a single DGX Spark — GB10, sm_121, compute_cap=12.1, CUDA 13.0.88.
The ds4-bench throughput sweep below was refreshed 2026-05-21 against
ds4 at 1c4c5f0 (PR-prep branch: mmq Q8_0 dispatch + in-process VMM weight
arena + per-layer CUDA-graph decode capture, the last on by default). The
build/cold-start and llama-benchy HTTP numbers are from the earlier
920f987 (2026-05-12) snapshot and predate the CUDA-graph work.
Same model, same prompt corpus, both sides built at sm_121 and benched on the
same GB10 on 2026-05-21. The branch (mmq Q8_0 dispatch + in-process VMM weight
arena + per-layer CUDA-graph decode capture) is 1.16× → 1.09× faster prefill
and +12% → +13% faster decode than upstream antirez/ds4 (a365e44) across
the full 2k–64k context sweep.
| Step | Time |
|---|---|
make -j20 CUDA_ARCH=sm_121 |
7.9 s |
| Cold load: 80.76 GiB of tensors → GPU cache | ~20 s |
| Time-to-first-token (cold process, 12-token prompt) | ~21 s |
After cold start, all subsequent benchmarks here are on a warm process.
ds4 at 1c4c5f0, imatrix Q2 GGUF, --gen-tokens 128, layer-graph decode
capture on (default). Prefill is measured on a fixed 2,048-token chunk and is
prompt-sensitive, so the corpus is named: rendered_prompts_nothink.txt.
ctx 2k–64k:
| ctx | prefill t/s | decode t/s | KV size |
|---|---|---|---|
| 2,048 | 458.3 | 15.37 | 52 MB |
| 8,192 | 407.2 | 15.24 | 137 MB |
| 16,384 | 392.5 | 14.99 | 250 MB |
| 24,576 | 379.5 | 14.64 | 362 MB |
| 32,768 | 367.8 | 14.11 | 475 MB |
| 40,960 | 344.7 | 13.86 | 588 MB |
| 49,152 | 333.6 | 13.66 | 701 MB |
| 57,344 | 322.0 | 13.33 | 813 MB |
| 65,536 | 312.2 | 13.00 | 926 MB |
- Prefill ~310–460 t/s across 2k → 64k, tapering smoothly with context.
- Decode ~13–15 t/s, ~15 % falloff from 2k to 64k.
- Per-layer CUDA-graph decode capture (on by default) contributes +5 → +10 % of the decode rate vs. the eager path, the gain widening with context.
- KV stays compact — 926 MB at 64k — compressed KV doing its job.
Refreshed 2026-05-21 against ds4 at 1c4c5f0, imatrix Q2 GGUF, via
eugr/llama-benchy 0.3.8 through
ds4-server's OpenAI endpoint — --pp 2048 --tg 32 128 512 --depth 0 4096 16384,
3 runs each. This llama-benchy release reports tg as a near end-to-end
decode rate, so the HTTP figures below line up with the direct-CLI decode
sweep above (~15 t/s); earlier releases printed a steady-state-only tg
roughly 2× higher.
| test | t/s | peak t/s | ttfr (ms) |
|---|---|---|---|
| pp2048 (prefill) | 449.1 | — | 4791 |
| tg32 @ d=0 | 16.76 ± 0.26 | 18.0 | — |
| tg128 @ d=0 | 15.69 ± 0.03 | 18.7 | — |
| tg512 @ d=0 | 15.47 ± 0.01 | 18.3 | — |
| pp2048 @ d=4k | 416.4 | — | 15387 |
| tg32 @ d=4k | 19.75 ± 2.66 | 21.8 | — |
| tg128 @ d=4k | 15.52 ± 0.08 | 18.3 | — |
| tg512 @ d=4k | 15.35 ± 0.05 | 18.0 | — |
| pp2048 @ d=16k | 401.8 | — | 47505 |
| tg32 @ d=16k | 16.42 ± 0.49 | 17.0 | — |
| tg128 @ d=16k | 15.31 ± 0.34 | 17.3 | — |
| tg512 @ d=16k | 14.95 ± 0.02 | 18.0 | — |
Prefill holds ~400–450 t/s and decode ~15–16 t/s across 0–16k depth,
consistent with the direct-CLI sweep. (tg32 is the noisiest row — only 32
tokens, so first-token setup still skews its mean and variance.)
Reproduce:
scripts/run-bench.sh --pp 2048 --tg 32 128 512 --depth 0 4096 16384How far below the hardware ceiling is ds4 running?
| Probe | Bandwidth | Note |
|---|---|---|
nvbandwidth H2D / D2H CE |
59 GB/s | Copy-engine path, not relevant for kernels |
nvbandwidth device_local_copy |
111 GB/s | CE on single device |
bench/bw_bench.cu copy (R+W) |
215 GB/s | Kernel-driven, what matters |
bench/bw_bench.cu read-only |
227 GB/s | Pure read throughput |
| Published GB10 LPDDR5X peak | ~273 GB/s | 256-bit × 9400 MT/s theoretical |
The kernel-effective ~225 GB/s is the relevant ceiling — ~82 % of theoretical peak, normal for real workloads on LPDDR.
Aggregated across all 17 shards (88.4 GB total):
| Bucket | Total bytes | Active per token |
|---|---|---|
| Routed experts (IQ2_XXS + Q2_K) | 78.28 GB | 6/256 active → 1.83 GB |
| MLA attention + indexer + compressor | 7.05 GB | all active |
| Embed / head / final norm | 2.12 GB | ~1.0 GB (head projection) |
| Shared expert (1 per MoE layer) | 0.74 GB | 0.74 GB |
| MTP + HC + other | 0.30 GB | ~0.22 GB |
| KV cache reads (at 16k) | — | ~0.25 GB |
Effective bytes per token at steady state: ~11 GB — the sum of the Active-per-token column, derived bottom-up from the model index, independent of any timing measurement.
Bytes/token is the bottom-up figure above; the decode rate is measured. The two are independent — not cross-derived. (An earlier draft computed one from the other and reported the circular result as "95 % saturated".)
| Quantity | Value |
|---|---|
| Kernel-effective bandwidth | 225 GB/s |
| Effective bytes per token (bottom-up) | ~11 GB |
| Bandwidth roofline (BW ÷ bytes-per-token) | 225 / 11 ≈ 20.5 t/s |
| Measured decode, 0–16k ctx (ds4-bench + llama-benchy) | ~15 t/s |
| Decode efficiency | ~73 % of the bandwidth roofline |
Decode runs at roughly three-quarters of the bandwidth roofline — close, but not saturated. Around 1.4× of headroom sits between the measured ~15 t/s and the ~20 t/s ceiling, and that gap is not bandwidth: it is launch overhead, kernel-occupancy gaps, and non-overlapped work. The per-layer CUDA-graph decode capture this branch adds (see Benchmarks) reclaims part of it — precisely the launch-overhead component. Closing the rest is kernel-scheduling work, not a hardware wall.
Beyond the roofline itself, going faster needs a tighter quant (FP4 / 1.5-bit experts) or batched serving (amortise weight reads across users).
The bytes-per-token figure is a bottom-up estimate; ±20 % on it swings the efficiency to ~60–90 %. The qualitative result — real, non-bandwidth headroom — holds across that whole range.
A side-by-side analysis of ds4's two GPU backends — docs/METAL_VS_CUDA.md — covers the kernel surface, the command lifecycle, and the model-attach strategy on each platform. TL;DR for someone running on Spark and asking what is the implementation actually doing?:
ds4_cuda.cuis 9,666 LOC, 106__global__kernels, links-lcudart -lcublas. All compiled ahead of time bynvccfor the targetCUDA_ARCH— the binary is not portable across SM generations.- Three-tier model attach.
cudaHostRegister(... cudaHostRegisterMapped | ReadOnly)on the mmap'd 80 GiB GGUF is tried first to get a zero-copy device pointer. If pinning fails (orDS4_CUDA_COPY_MODELis set), the engine falls back to per-range pinning, then to chunkedcudaMalloc + cudaMemcpyin 64 MiB chunks. This is what the ~20 s cold load is. - Q8 → F16 weight cache for prefill. On startup, dense Q8_0 weights are
dequantised once on-device into an F16 buffer;
cublasGemmExthen uses tensor cores for multi-token prefill matmuls. That's why prefill is ~310–460 t/s while decode is ~15 t/s — they take different routes through the matmul stack. Decode (n_tok=1) skips cuBLAS and uses hand-written Q8_0 matvecs where the cuBLAS launch overhead wouldn't amortize. - Routed experts stay quantised. IQ2_XXS / Q2_K kernels dequantise inline
on every expert dot; the codebook lives in
__constant__memory viads4_iq2_tables_cuda.inc. Pre-converting all 256 experts to F16 would erase the q2 memory win. - Default stream, serial execution.
begin_commandsis a no-op;flush_commands,end_commands, andsynchronizeall reduce tocudaDeviceSynchronize(). Two named streams (g_model_prefetch_stream,g_model_upload_stream) exist only for async model staging at startup. Combined with the engine's single-session worker thread, this is whyds4-serverserialises concurrent clients (see next section). - No GDS / cuFile. Direct file reads (via
ds4_gpu_set_model_fd) use LinuxO_DIRECTon a registered FD — kernel DMA, not GPU-side DMA.
If you're considering writing a port, fork, or alternative serving layer,
the analysis doc lays out the kernel surface, the DS4_CUDA_* env-var knobs,
and the places where the Metal and CUDA backends diverge structurally
(model mapping, command-buffer batching, library use).
The OpenAI v1 server does not actually parallelize concurrent requests.
When llama-benchy hits it with --concurrency 2, ds4-server processes the
requests strictly sequentially: 168 s per request, second one waits for the
first to finish before starting prefill.
llama-benchy concurrency |
observed wall-clock behaviour |
|---|---|
| 1 | one request at a time (expected) |
| 2 | also one at a time — server queues, doesn't batch |
That means t/s (total) at c>1 in llama-benchy's output is misleading for
this engine: it's c × per-request t/s, but the wall time is also c ×
single-request wall time. If you need many concurrent users on one Spark
you need a different runtime (vLLM/SGLang with paged-attention batching).
ds4 is single-session by design.
The donor ships --mtp <draft.gguf> --mtp-draft N. The MTP support GGUF
is a separate 3.6 GiB file (DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf).
The donor README labels MTP as "alpha quality / experimental."
On the CUDA backend on Spark today, MTP produces no speedup because
the MTP draft kernel never produces a token. Empirically traced and
documented in detail in docs/MTP_PARITY_GAP.md;
the headline is:
routed_moe_launchinds4_cuda.cu:8849hard-codesgate_type == 16u (IQ2_XXS) && down_type == 10u (Q2_K)and returns failure for any other combination.- The MTP draft GGUF uses Q4_K (type 12) for its routed expert tensors.
- Every MTP draft attempt silently fails inside this kernel; the C-side speculative state machine treats that as "no draft available" and commits one token per cycle — indistinguishable from non-MTP decode.
- The Metal backend has the parity dispatch
(
g_moe_mul_mv_id_q4_k_pipeline,metal/moe.metal:413/:831); MTP works there.
Reproduce in one line:
DS4_MTP_PROBE=1 ./ds4 --cuda -m … --mtp … --mtp-draft 2 --temp 0 --nothink \
-p "List 20 prime numbers" 2>&1 | grep "mtp probe draft failed" | wc -l
# Prints 58 — one failure per generation step.The fix is scoped at ~700–900 LOC in a single file (ds4_cuda.cu); no
changes needed to the C-side state machine, MTP weight binding, batched
verifier, or KV/raw-cache plumbing — those are quant-agnostic and
already work. See docs/MTP_PARITY_GAP.md for
the full handoff: empirical chain, Metal reference, implementation order,
validation plan, effort estimate.
Because the MTP path is a no-op on CUDA, the numbers below are "target-model decode with extra startup cost for loading the MTP GGUF."
Measured at draft=2 against matching no-MTP baselines, four
high-predictability prompts (ds4 CLI, first-token-inclusive):
| Prompt | no-MTP t/s | MTP-2 t/s | Δ |
|---|---|---|---|
| Count 1 → 60 (fully deterministic) | 15.13 | 14.27 | −5.7 % |
| English / NATO / Greek alphabets | 15.02 | 14.32 | −4.7 % |
| Declaration of Independence + 10 Presidents | 14.64 | 14.37 | −1.8 % |
| 27 EU capitals alphabetical | 14.92 | 14.48 | −2.9 % |
| mean | 14.93 | 14.36 | −3.8 % |
The 3–6 % regression is the MTP support model's per-request setup cost
(loading and binding the extra 3.6 GiB GGUF, allocating MTP raw-cache
tensors), paid on every request, with no speculative gain to offset it.
Three separate --mtp-draft values (1, 2, 4) all land within noise:
| Config | decode t/s (QuickSort prompt) |
|---|---|
| no MTP | 13.5 (bench) / 14.88 (is_prime) |
--mtp-draft 1 |
13.81 |
--mtp-draft 2 |
13.62 |
--mtp-draft 4 |
13.63 |
Counting 1→60 is fully deterministic — every next token is forced —
so MTP acceptance rate should be ~100 % if MTP were producing drafts.
That it's still slightly slower confirms the path is doing setup work
and emitting no drafts. Consistent with DS4_MTP_PROBE=1 showing 100 %
draft-kernel failure.
⚠ Stale — pending re-bench. The llama-benchy figures in this subsection and the next predate the
tg-definition change covered in the Benchmarks section (the older release over-reported decode ~2×), and the "~95 % roofline" premise they lean on is superseded by the Roofline analysis above. The MTP finding — zero accepted drafts, no measurable speedup — is unaffected; only the absolute t/s and that premise need redoing.
Same hardware, three runs each, d=8192 tg=512:
| Config | tg512 @ d=8192 t/s | peak t/s | prefill t/s |
|---|---|---|---|
| no MTP | 22.14 ± 2.57 | 28.33 ± 1.25 | 328.09 ± 0.51 |
| MTP draft=2 | 23.61 ± 0.48 | 28.33 ± 0.47 | 328.49 ± 0.68 |
| Δ | +6.6 % (within noise) | identical | identical |
Peak t/s is bit-identical (28.33 in both runs). Because the CUDA MTP path produces zero accepted drafts, the two configurations are running the same target-decode kernels at the same rate; the small mean delta and tighter variance are setup-cost shadow plus run-to-run noise, not a speculative-decode effect.
With the CUDA Q4_K MoE kernel in place, MTP-2 on DSv4-Flash should
deliver a steady-state lift comparable to what vLLM+FlashInfer
delivers on Qwen3.5-122B-A10B on the same GB10 hardware: 28.3 →
38.4 t/s (+35.7 %) from MTP-2 alone, up to 51 t/s (+80 %) stacked
with other optimisations. DSv4 starts from a higher bandwidth-roof
saturation (~95 %), so the MTP gain there will come from FLOPs hidden
behind shared weight reads in the batched 2-row verifier rather than
from leftover bandwidth — but the absolute number should land in the
35–50 t/s range. See
docs/MTP_PARITY_GAP.md §1.2 and §9 for
the full argument.
ds4 produces clean output on first try across several probe types:
Factual recall — "What is the capital of France?" → "The capital of France is Paris." ✓
Code + reasoning — "is_prime(n) with 6k±1 optimization, list primes 100-130" →
def is_prime(n):
if n <= 1: return False
if n <= 3: return True
if n % 2 == 0 or n % 3 == 0: return False
i = 5
while i * i <= n:
if n % i == 0 or n % (i + 2) == 0:
return False
i += 6
return TruePrime numbers listed: 101, 103, 107, 109, 113, 127 — all six correct.
Long-form structured — "Explain QuickSort with worked example [38, 27, 43, 3, 9, 82, 10], full recursion, complexity analysis, optimizations" — clean multi-section response with correct partitioning steps and complexity bounds.
install.sh One-shot installer (curl | bash | --help)
scripts/
smoke-test.sh First-token sanity check
start-server.sh Start ds4-server (idempotent, with-MTP flag)
run-bench.sh Run llama-benchy via uvx
bench/
bw_bench.cu Kernel-side memory-bandwidth probe
docs/
STRATEGIC_CHECKPOINT.md Detailed analysis of how this benchmark
affects the "should we keep porting?" decision
METAL_VS_CUDA.md Side-by-side comparison of ds4_metal.m and
ds4_cuda.cu — kernel surface, command lifecycle,
model attach, quantisation, and where each
backend's design diverges
MTP_PARITY_GAP.md Root-cause of "MTP gives no speedup on CUDA":
one quant-format gap in ds4_cuda.cu's MoE
kernel. Empirical chain, Metal reference,
~700-900 LOC fix scope, validation plan.
# 1. Build + download + smoke test
./install.sh --with-mtp
# 2. Start the server
./scripts/start-server.sh --port 8000 --ctx 32768
# 3. Run llama-benchy (installs uvx automatically if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh # one-time
./scripts/run-bench.sh # default sweep
./scripts/run-bench.sh --depth 0 4096 16384 32768 --tg 32 128 512
# 4. Measure raw memory bandwidth ceiling
/usr/local/cuda/bin/nvcc -O3 -arch=sm_121 bench/bw_bench.cu -o /tmp/bw_bench
/tmp/bw_bench 8192
# 5. Compare MTP vs no-MTP yourself
./scripts/start-server.sh --port 8000 --with-mtp --draft 2
./scripts/run-bench.sh --depth 0 --tg 128 --runs 3| Piece | Role |
|---|---|
antirez/ds4 |
The C+CUDA inference engine itself — narrow, DSv4-Flash-only by design |
antirez/deepseek-v4-gguf |
The Q2/Q4 GGUFs ds4 is designed to consume |
| this repo | Spark-specific install + benchmark + analysis layer on top of the donor |
Entrpi/ds4-spark-vllm |
Alternative path: same model via vLLM. Different perf profile, more flexible serving, larger surface area. |
eugr/llama-benchy |
The benchmark methodology used here — generic, OpenAI v1, comparable to llama-bench / vllm bench |
antirez/ds4— the inference engine and the 2-bit recipe. MIT-licensed.llama.cppand GGML — the GGUF ecosystem, quant formats, and engineering knowledge ds4 stands on.deepseek-ai— DeepSeek-V4-Flash upstream weights and architecture.eugr/llama-benchy— benchmark methodology and tooling.
MIT.