ds4-on-spark

antirez/ds4 (DwarfStar 4) running on a single NVIDIA DGX Spark (GB10 / SM121, 128 GiB unified memory), with measured benchmarks and a roofline analysis grounded in the hardware ceiling.

Status: Working end-to-end. Single-prompt smoke test passes; ds4's decode runs at ~70–75 % of the memory-bandwidth roofline for this quant on this hardware, with real (non-bandwidth) headroom remaining. MTP speculative decode is shipped by the donor but produces no speedup on CUDA today — root cause traced to a quant-format gap in one CUDA kernel that silently rejects the MTP draft's Q4_K experts. The fix is ~700–900 LOC in ds4_cuda.cu, scoped in docs/MTP_PARITY_GAP.md. The Metal backend is unaffected.

Reference: antirez/ds4 — MIT-licensed C+CUDA inference engine. CUDA backend landed 2026-05-11; this writeup uses HEAD 920f987 as of 2026-05-12.
Model: antirez/deepseek-v4-gguf — 81 GiB asymmetric quant: IQ2_XXS for routed-expert gate/up, Q2_K for routed-expert down (these dominate model bytes), Q8_0 for everything else dense (shared expert, attention projections, output head, router), F16 for LoRA matrices and the compressor/indexer, F32 norms. (FP8 in ds4 is a runtime KV-cache quantization — E4M3FN round-trip — not a stored weight format.) Plus an optional 3.6 GiB MTP draft GGUF.
Hardware: NVIDIA DGX Spark, GB10, SM121, 128 GiB LPDDR5X unified. The donor's Makefile has a make cuda-spark target that builds native sm_121, plus make cuda CUDA_ARCH=sm_NNN for an explicit override — both GB10-correct with no patches needed. (Building with an empty -arch measured ~25% slower prefill on GB10, so the explicit arch matters.)

Quick start

On a DGX Spark with CUDA 13 installed:

curl -sSL https://raw.githubusercontent.com/entrpi/ds4-on-spark/main/install.sh | bash -s -- --with-mtp --start

That one command:

Verifies the host (aarch64, GB10/SM121, CUDA 13, ≥110 GiB free disk).
Clones antirez/ds4 into ~/code/ds4 (or $DS4_SRC_DIR).
Builds ds4, ds4-server, ds4-bench with CUDA_ARCH=sm_121 in ~8 s.
Downloads the Q2 GGUF (~81 GiB) and the MTP GGUF (~3.6 GiB) from antirez/deepseek-v4-gguf into ~/gguf (or $DS4_GGUF_DIR).
Runs the "capital of France" smoke test and asserts "Paris" in the output.
Starts ds4-server on :8000 with -c 32768.

To preview without running:

curl -sSL https://raw.githubusercontent.com/entrpi/ds4-on-spark/main/install.sh | bash -s -- --help

Common overrides: --cuda-arch sm_120 (datacenter Blackwell), --no-download (reuse existing GGUF), --src-dir, --gguf-dir, --ctx, --port, --force (skip host check).

Hardware requirements


Validated on	NVIDIA DGX Spark (GB10, SM121, 128 GiB unified)
Likely to work	other Blackwell with `--cuda-arch sm_120`, untested
CUDA toolkit	13.x (we tested 13.0.88)
Disk	≥110 GiB free for the GGUFs
OS	aarch64 Linux (Grace)
RAM (system, unified)	128 GiB is enough for the model + ~250 MB KV @ 16k context

GB10 is detected via nvidia-smi --query-gpu=compute_cap returning 12.1. Anything else gets a warning + --force override path.

What you get

Binary	Purpose
`ds4`	Interactive / one-shot CLI
`ds4-server`	OpenAI v1-compatible HTTP server (`POST /v1/chat/completions`, SSE streaming)
`ds4-bench`	Direct prefill + decode throughput sweep (no HTTP)

ds4-server is the recommended runtime. It exposes:

POST /v1/chat/completions (OpenAI-compatible streaming, tool calls)
POST /v1/completions
GET /v1/models

It also speaks Anthropic-shape on /v1/messages (see donor README).

Benchmarks

Hardware: a single DGX Spark — GB10, sm_121, compute_cap=12.1, CUDA 13.0.88.

The ds4-bench throughput sweep below was refreshed 2026-05-21 against ds4 at 1c4c5f0 (PR-prep branch: mmq Q8_0 dispatch + in-process VMM weight arena + per-layer CUDA-graph decode capture, the last on by default). The build/cold-start and llama-benchy HTTP numbers are from the earlier 920f987 (2026-05-12) snapshot and predate the CUDA-graph work.

Uplift vs. upstream

Same model, same prompt corpus, both sides built at sm_121 and benched on the same GB10 on 2026-05-21. The branch (mmq Q8_0 dispatch + in-process VMM weight arena + per-layer CUDA-graph decode capture) is 1.16× → 1.09× faster prefill and +12% → +13% faster decode than upstream antirez/ds4 (a365e44) across the full 2k–64k context sweep.

Build + cold start _{(920f987 snapshot — pre-CUDA-graph)}

Step	Time
`make -j20 CUDA_ARCH=sm_121`	7.9 s
Cold load: 80.76 GiB of tensors → GPU cache	~20 s
Time-to-first-token (cold process, 12-token prompt)	~21 s

After cold start, all subsequent benchmarks here are on a warm process.

Throughput sweep (`ds4-bench`, direct CLI, no HTTP)

ds4 at 1c4c5f0, imatrix Q2 GGUF, --gen-tokens 128, layer-graph decode capture on (default). Prefill is measured on a fixed 2,048-token chunk and is prompt-sensitive, so the corpus is named: rendered_prompts_nothink.txt. ctx 2k–64k:

ctx	prefill t/s	decode t/s	KV size
2,048	458.3	15.37	52 MB
8,192	407.2	15.24	137 MB
16,384	392.5	14.99	250 MB
24,576	379.5	14.64	362 MB
32,768	367.8	14.11	475 MB
40,960	344.7	13.86	588 MB
49,152	333.6	13.66	701 MB
57,344	322.0	13.33	813 MB
65,536	312.2	13.00	926 MB

Prefill ~310–460 t/s across 2k → 64k, tapering smoothly with context.
Decode ~13–15 t/s, ~15 % falloff from 2k to 64k.
Per-layer CUDA-graph decode capture (on by default) contributes +5 → +10 % of the decode rate vs. the eager path, the gain widening with context.
KV stays compact — 926 MB at 64k — compressed KV doing its job.

`llama-benchy`-style numbers (HTTP)

Refreshed 2026-05-21 against ds4 at 1c4c5f0, imatrix Q2 GGUF, via eugr/llama-benchy 0.3.8 through ds4-server's OpenAI endpoint — --pp 2048 --tg 32 128 512 --depth 0 4096 16384, 3 runs each. This llama-benchy release reports tg as a near end-to-end decode rate, so the HTTP figures below line up with the direct-CLI decode sweep above (~15 t/s); earlier releases printed a steady-state-only tg roughly 2× higher.

test	t/s	peak t/s	ttfr (ms)
pp2048 (prefill)	449.1	—	4791
tg32 @ d=0	16.76 ± 0.26	18.0	—
tg128 @ d=0	15.69 ± 0.03	18.7	—
tg512 @ d=0	15.47 ± 0.01	18.3	—
pp2048 @ d=4k	416.4	—	15387
tg32 @ d=4k	19.75 ± 2.66	21.8	—
tg128 @ d=4k	15.52 ± 0.08	18.3	—
tg512 @ d=4k	15.35 ± 0.05	18.0	—
pp2048 @ d=16k	401.8	—	47505
tg32 @ d=16k	16.42 ± 0.49	17.0	—
tg128 @ d=16k	15.31 ± 0.34	17.3	—
tg512 @ d=16k	14.95 ± 0.02	18.0	—

Prefill holds ~400–450 t/s and decode ~15–16 t/s across 0–16k depth, consistent with the direct-CLI sweep. (tg32 is the noisiest row — only 32 tokens, so first-token setup still skews its mean and variance.)

Reproduce:

scripts/run-bench.sh --pp 2048 --tg 32 128 512 --depth 0 4096 16384

Roofline analysis

How far below the hardware ceiling is ds4 running?

Memory bandwidth ceiling (measured on Spark)

Probe	Bandwidth	Note
`nvbandwidth` H2D / D2H CE	59 GB/s	Copy-engine path, not relevant for kernels
`nvbandwidth` device_local_copy	111 GB/s	CE on single device
`bench/bw_bench.cu` copy (R+W)	215 GB/s	Kernel-driven, what matters
`bench/bw_bench.cu` read-only	227 GB/s	Pure read throughput
Published GB10 LPDDR5X peak	~273 GB/s	256-bit × 9400 MT/s theoretical

The kernel-effective ~225 GB/s is the relevant ceiling — ~82 % of theoretical peak, normal for real workloads on LPDDR.

Bytes per token at decode (from safetensors index)

Aggregated across all 17 shards (88.4 GB total):

Bucket	Total bytes	Active per token
Routed experts (IQ2_XXS + Q2_K)	78.28 GB	6/256 active → 1.83 GB
MLA attention + indexer + compressor	7.05 GB	all active
Embed / head / final norm	2.12 GB	~1.0 GB (head projection)
Shared expert (1 per MoE layer)	0.74 GB	0.74 GB
MTP + HC + other	0.30 GB	~0.22 GB
KV cache reads (at 16k)	—	~0.25 GB

Effective bytes per token at steady state: ~11 GB — the sum of the Active-per-token column, derived bottom-up from the model index, independent of any timing measurement.

Roofline

Bytes/token is the bottom-up figure above; the decode rate is measured. The two are independent — not cross-derived. (An earlier draft computed one from the other and reported the circular result as "95 % saturated".)

Quantity	Value
Kernel-effective bandwidth	225 GB/s
Effective bytes per token (bottom-up)	~11 GB
Bandwidth roofline (BW ÷ bytes-per-token)	225 / 11 ≈ 20.5 t/s
Measured decode, 0–16k ctx (ds4-bench + llama-benchy)	~15 t/s
Decode efficiency	~73 % of the bandwidth roofline

Decode runs at roughly three-quarters of the bandwidth roofline — close, but not saturated. Around 1.4× of headroom sits between the measured ~15 t/s and the ~20 t/s ceiling, and that gap is not bandwidth: it is launch overhead, kernel-occupancy gaps, and non-overlapped work. The per-layer CUDA-graph decode capture this branch adds (see Benchmarks) reclaims part of it — precisely the launch-overhead component. Closing the rest is kernel-scheduling work, not a hardware wall.

Beyond the roofline itself, going faster needs a tighter quant (FP4 / 1.5-bit experts) or batched serving (amortise weight reads across users).

The bytes-per-token figure is a bottom-up estimate; ±20 % on it swings the efficiency to ~60–90 %. The qualitative result — real, non-bandwidth headroom — holds across that whole range.

Under the hood: how the CUDA backend works

A side-by-side analysis of ds4's two GPU backends — docs/METAL_VS_CUDA.md — covers the kernel surface, the command lifecycle, and the model-attach strategy on each platform. TL;DR for someone running on Spark and asking what is the implementation actually doing?:

ds4_cuda.cu is 9,666 LOC, 106 __global__ kernels, links -lcudart -lcublas. All compiled ahead of time by nvcc for the target CUDA_ARCH — the binary is not portable across SM generations.
Three-tier model attach. cudaHostRegister(... cudaHostRegisterMapped | ReadOnly) on the mmap'd 80 GiB GGUF is tried first to get a zero-copy device pointer. If pinning fails (or DS4_CUDA_COPY_MODEL is set), the engine falls back to per-range pinning, then to chunked cudaMalloc + cudaMemcpy in 64 MiB chunks. This is what the ~20 s cold load is.
Q8 → F16 weight cache for prefill. On startup, dense Q8_0 weights are dequantised once on-device into an F16 buffer; cublasGemmEx then uses tensor cores for multi-token prefill matmuls. That's why prefill is ~310–460 t/s while decode is ~15 t/s — they take different routes through the matmul stack. Decode (n_tok=1) skips cuBLAS and uses hand-written Q8_0 matvecs where the cuBLAS launch overhead wouldn't amortize.
Routed experts stay quantised. IQ2_XXS / Q2_K kernels dequantise inline on every expert dot; the codebook lives in __constant__ memory via ds4_iq2_tables_cuda.inc. Pre-converting all 256 experts to F16 would erase the q2 memory win.
Default stream, serial execution. begin_commands is a no-op; flush_commands, end_commands, and synchronize all reduce to cudaDeviceSynchronize(). Two named streams (g_model_prefetch_stream, g_model_upload_stream) exist only for async model staging at startup. Combined with the engine's single-session worker thread, this is why ds4-server serialises concurrent clients (see next section).
No GDS / cuFile. Direct file reads (via ds4_gpu_set_model_fd) use Linux O_DIRECT on a registered FD — kernel DMA, not GPU-side DMA.

If you're considering writing a port, fork, or alternative serving layer, the analysis doc lays out the kernel surface, the DS4_CUDA_* env-var knobs, and the places where the Metal and CUDA backends diverge structurally (model mapping, command-buffer batching, library use).

Concurrency on `ds4-server` — single-stream, serialized

The OpenAI v1 server does not actually parallelize concurrent requests. When llama-benchy hits it with --concurrency 2, ds4-server processes the requests strictly sequentially: 168 s per request, second one waits for the first to finish before starting prefill.

llama-benchy `concurrency`	observed wall-clock behaviour
1	one request at a time (expected)
2	also one at a time — server queues, doesn't batch

That means t/s (total) at c>1 in llama-benchy's output is misleading for this engine: it's c × per-request t/s, but the wall time is also c × single-request wall time. If you need many concurrent users on one Spark you need a different runtime (vLLM/SGLang with paged-attention batching). ds4 is single-session by design.

MTP (speculative decode) — broken on CUDA today; root cause known

The donor ships --mtp <draft.gguf> --mtp-draft N. The MTP support GGUF is a separate 3.6 GiB file (DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf). The donor README labels MTP as "alpha quality / experimental."

On the CUDA backend on Spark today, MTP produces no speedup because the MTP draft kernel never produces a token. Empirically traced and documented in detail in docs/MTP_PARITY_GAP.md; the headline is:

routed_moe_launch in ds4_cuda.cu:8849 hard-codes gate_type == 16u (IQ2_XXS) && down_type == 10u (Q2_K) and returns failure for any other combination.
The MTP draft GGUF uses Q4_K (type 12) for its routed expert tensors.
Every MTP draft attempt silently fails inside this kernel; the C-side speculative state machine treats that as "no draft available" and commits one token per cycle — indistinguishable from non-MTP decode.
The Metal backend has the parity dispatch (g_moe_mul_mv_id_q4_k_pipeline, metal/moe.metal:413 / :831); MTP works there.

Reproduce in one line:

DS4_MTP_PROBE=1 ./ds4 --cuda -m … --mtp … --mtp-draft 2 --temp 0 --nothink \
  -p "List 20 prime numbers" 2>&1 | grep "mtp probe draft failed" | wc -l
# Prints 58 — one failure per generation step.

The fix is scoped at ~700–900 LOC in a single file (ds4_cuda.cu); no changes needed to the C-side state machine, MTP weight binding, batched verifier, or KV/raw-cache plumbing — those are quant-agnostic and already work. See docs/MTP_PARITY_GAP.md for the full handoff: empirical chain, Metal reference, implementation order, validation plan, effort estimate.

What the throughput tables actually look like today

Because the MTP path is a no-op on CUDA, the numbers below are "target-model decode with extra startup cost for loading the MTP GGUF."

Measured at draft=2 against matching no-MTP baselines, four high-predictability prompts (ds4 CLI, first-token-inclusive):

Prompt	no-MTP t/s	MTP-2 t/s	Δ
Count 1 → 60 (fully deterministic)	15.13	14.27	−5.7 %
English / NATO / Greek alphabets	15.02	14.32	−4.7 %
Declaration of Independence + 10 Presidents	14.64	14.37	−1.8 %
27 EU capitals alphabetical	14.92	14.48	−2.9 %
mean	14.93	14.36	−3.8 %

The 3–6 % regression is the MTP support model's per-request setup cost (loading and binding the extra 3.6 GiB GGUF, allocating MTP raw-cache tensors), paid on every request, with no speculative gain to offset it. Three separate --mtp-draft values (1, 2, 4) all land within noise:

Config	decode t/s (QuickSort prompt)
no MTP	13.5 (bench) / 14.88 (`is_prime`)
`--mtp-draft 1`	13.81
`--mtp-draft 2`	13.62
`--mtp-draft 4`	13.63

Counting 1→60 is fully deterministic — every next token is forced — so MTP acceptance rate should be ~100 % if MTP were producing drafts. That it's still slightly slower confirms the path is doing setup work and emitting no drafts. Consistent with DS4_MTP_PROBE=1 showing 100 % draft-kernel failure.

MTP via llama-benchy

⚠ Stale — pending re-bench. The llama-benchy figures in this subsection and the next predate the tg-definition change covered in the Benchmarks section (the older release over-reported decode ~2×), and the "~95 % roofline" premise they lean on is superseded by the Roofline analysis above. The MTP finding — zero accepted drafts, no measurable speedup — is unaffected; only the absolute t/s and that premise need redoing.

Same hardware, three runs each, d=8192 tg=512:

Config	tg512 @ d=8192 t/s	peak t/s	prefill t/s
no MTP	22.14 ± 2.57	28.33 ± 1.25	328.09 ± 0.51
MTP draft=2	23.61 ± 0.48	28.33 ± 0.47	328.49 ± 0.68
Δ	+6.6 % (within noise)	identical	identical

Peak t/s is bit-identical (28.33 in both runs). Because the CUDA MTP path produces zero accepted drafts, the two configurations are running the same target-decode kernels at the same rate; the small mean delta and tighter variance are setup-cost shadow plus run-to-run noise, not a speculative-decode effect.

Expected behaviour once the parity gap is closed

With the CUDA Q4_K MoE kernel in place, MTP-2 on DSv4-Flash should deliver a steady-state lift comparable to what vLLM+FlashInfer delivers on Qwen3.5-122B-A10B on the same GB10 hardware: 28.3 → 38.4 t/s (+35.7 %) from MTP-2 alone, up to 51 t/s (+80 %) stacked with other optimisations. DSv4 starts from a higher bandwidth-roof saturation (~95 %), so the MTP gain there will come from FLOPs hidden behind shared weight reads in the batched 2-row verifier rather than from leftover bandwidth — but the absolute number should land in the 35–50 t/s range. See docs/MTP_PARITY_GAP.md §1.2 and §9 for the full argument.

Quality checks (qualitative)

ds4 produces clean output on first try across several probe types:

Factual recall — "What is the capital of France?" → "The capital of France is Paris." ✓

Code + reasoning — "is_prime(n) with 6k±1 optimization, list primes 100-130" →

def is_prime(n):
    if n <= 1: return False
    if n <= 3: return True
    if n % 2 == 0 or n % 3 == 0: return False
    i = 5
    while i * i <= n:
        if n % i == 0 or n % (i + 2) == 0:
            return False
        i += 6
    return True

Prime numbers listed: 101, 103, 107, 109, 113, 127 — all six correct.

Long-form structured — "Explain QuickSort with worked example [38, 27, 43, 3, 9, 82, 10], full recursion, complexity analysis, optimizations" — clean multi-section response with correct partitioning steps and complexity bounds.

Repo layout

install.sh                  One-shot installer (curl | bash | --help)
scripts/
  smoke-test.sh               First-token sanity check
  start-server.sh             Start ds4-server (idempotent, with-MTP flag)
  run-bench.sh                Run llama-benchy via uvx
bench/
  bw_bench.cu                 Kernel-side memory-bandwidth probe
docs/
  STRATEGIC_CHECKPOINT.md     Detailed analysis of how this benchmark
                              affects the "should we keep porting?" decision
  METAL_VS_CUDA.md            Side-by-side comparison of ds4_metal.m and
                              ds4_cuda.cu — kernel surface, command lifecycle,
                              model attach, quantisation, and where each
                              backend's design diverges
  MTP_PARITY_GAP.md           Root-cause of "MTP gives no speedup on CUDA":
                              one quant-format gap in ds4_cuda.cu's MoE
                              kernel. Empirical chain, Metal reference,
                              ~700-900 LOC fix scope, validation plan.

Reproducing the benchmarks

# 1. Build + download + smoke test
./install.sh --with-mtp

# 2. Start the server
./scripts/start-server.sh --port 8000 --ctx 32768

# 3. Run llama-benchy (installs uvx automatically if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh   # one-time
./scripts/run-bench.sh                            # default sweep
./scripts/run-bench.sh --depth 0 4096 16384 32768 --tg 32 128 512

# 4. Measure raw memory bandwidth ceiling
/usr/local/cuda/bin/nvcc -O3 -arch=sm_121 bench/bw_bench.cu -o /tmp/bw_bench
/tmp/bw_bench 8192

# 5. Compare MTP vs no-MTP yourself
./scripts/start-server.sh --port 8000 --with-mtp --draft 2
./scripts/run-bench.sh --depth 0 --tg 128 --runs 3

How this fits with related work

Piece	Role
`antirez/ds4`	The C+CUDA inference engine itself — narrow, DSv4-Flash-only by design
`antirez/deepseek-v4-gguf`	The Q2/Q4 GGUFs ds4 is designed to consume
this repo	Spark-specific install + benchmark + analysis layer on top of the donor
`Entrpi/ds4-spark-vllm`	Alternative path: same model via vLLM. Different perf profile, more flexible serving, larger surface area.
`eugr/llama-benchy`	The benchmark methodology used here — generic, OpenAI v1, comparable to llama-bench / vllm bench

Acknowledgements

antirez/ds4 — the inference engine and the 2-bit recipe. MIT-licensed.
llama.cpp and GGML — the GGUF ecosystem, quant formats, and engineering knowledge ds4 stands on.
deepseek-ai — DeepSeek-V4-Flash upstream weights and architecture.
eugr/llama-benchy — benchmark methodology and tooling.

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
bench		bench
docs		docs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ds4-on-spark

Quick start

Hardware requirements

What you get

Benchmarks

Uplift vs. upstream

Build + cold start _{(920f987 snapshot — pre-CUDA-graph)}

Throughput sweep (`ds4-bench`, direct CLI, no HTTP)

`llama-benchy`-style numbers (HTTP)

Roofline analysis

Memory bandwidth ceiling (measured on Spark)

Bytes per token at decode (from safetensors index)

Roofline

Under the hood: how the CUDA backend works

Concurrency on `ds4-server` — single-stream, serialized

MTP (speculative decode) — broken on CUDA today; root cause known

What the throughput tables actually look like today

MTP via llama-benchy

Expected behaviour once the parity gap is closed

Quality checks (qualitative)

Repo layout

Reproducing the benchmarks

How this fits with related work

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ds4-on-spark

Quick start

Hardware requirements

What you get

Benchmarks

Uplift vs. upstream

Build + cold start (920f987 snapshot — pre-CUDA-graph)

Throughput sweep (ds4-bench, direct CLI, no HTTP)

llama-benchy-style numbers (HTTP)

Roofline analysis

Memory bandwidth ceiling (measured on Spark)

Bytes per token at decode (from safetensors index)

Roofline

Under the hood: how the CUDA backend works

Concurrency on ds4-server — single-stream, serialized

MTP (speculative decode) — broken on CUDA today; root cause known

What the throughput tables actually look like today

MTP via llama-benchy

Expected behaviour once the parity gap is closed

Quality checks (qualitative)

Repo layout

Reproducing the benchmarks

How this fits with related work

Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Build + cold start _{(920f987 snapshot — pre-CUDA-graph)}

Throughput sweep (`ds4-bench`, direct CLI, no HTTP)

`llama-benchy`-style numbers (HTTP)

Concurrency on `ds4-server` — single-stream, serialized

Packages