Skip to content

Entrpi/ds4-on-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ds4-on-spark

antirez/ds4 (DwarfStar 4) running on a single NVIDIA DGX Spark (GB10 / SM121, 128 GiB unified memory), with measured benchmarks and a roofline analysis grounded in the hardware ceiling.

Status: Working end-to-end. Single-prompt smoke test passes; ds4's decode runs at ~70–75 % of the memory-bandwidth roofline for this quant on this hardware, with real (non-bandwidth) headroom remaining. MTP speculative decode is shipped by the donor but produces no speedup on CUDA today — root cause traced to a quant-format gap in one CUDA kernel that silently rejects the MTP draft's Q4_K experts. The fix is ~700–900 LOC in ds4_cuda.cu, scoped in docs/MTP_PARITY_GAP.md. The Metal backend is unaffected.

  • Reference: antirez/ds4 — MIT-licensed C+CUDA inference engine. CUDA backend landed 2026-05-11; this writeup uses HEAD 920f987 as of 2026-05-12.
  • Model: antirez/deepseek-v4-gguf — 81 GiB asymmetric quant: IQ2_XXS for routed-expert gate/up, Q2_K for routed-expert down (these dominate model bytes), Q8_0 for everything else dense (shared expert, attention projections, output head, router), F16 for LoRA matrices and the compressor/indexer, F32 norms. (FP8 in ds4 is a runtime KV-cache quantization — E4M3FN round-trip — not a stored weight format.) Plus an optional 3.6 GiB MTP draft GGUF.
  • Hardware: NVIDIA DGX Spark, GB10, SM121, 128 GiB LPDDR5X unified. The donor's Makefile has a make cuda-spark target that builds native sm_121, plus make cuda CUDA_ARCH=sm_NNN for an explicit override — both GB10-correct with no patches needed. (Building with an empty -arch measured ~25% slower prefill on GB10, so the explicit arch matters.)

Quick start

On a DGX Spark with CUDA 13 installed:

curl -sSL https://raw.githubusercontent.com/entrpi/ds4-on-spark/main/install.sh | bash -s -- --with-mtp --start

That one command:

  1. Verifies the host (aarch64, GB10/SM121, CUDA 13, ≥110 GiB free disk).
  2. Clones antirez/ds4 into ~/code/ds4 (or $DS4_SRC_DIR).
  3. Builds ds4, ds4-server, ds4-bench with CUDA_ARCH=sm_121 in ~8 s.
  4. Downloads the Q2 GGUF (~81 GiB) and the MTP GGUF (~3.6 GiB) from antirez/deepseek-v4-gguf into ~/gguf (or $DS4_GGUF_DIR).
  5. Runs the "capital of France" smoke test and asserts "Paris" in the output.
  6. Starts ds4-server on :8000 with -c 32768.

To preview without running:

curl -sSL https://raw.githubusercontent.com/entrpi/ds4-on-spark/main/install.sh | bash -s -- --help

Common overrides: --cuda-arch sm_120 (datacenter Blackwell), --no-download (reuse existing GGUF), --src-dir, --gguf-dir, --ctx, --port, --force (skip host check).

Hardware requirements

Validated on NVIDIA DGX Spark (GB10, SM121, 128 GiB unified)
Likely to work other Blackwell with --cuda-arch sm_120, untested
CUDA toolkit 13.x (we tested 13.0.88)
Disk ≥110 GiB free for the GGUFs
OS aarch64 Linux (Grace)
RAM (system, unified) 128 GiB is enough for the model + ~250 MB KV @ 16k context

GB10 is detected via nvidia-smi --query-gpu=compute_cap returning 12.1. Anything else gets a warning + --force override path.

What you get

Binary Purpose
ds4 Interactive / one-shot CLI
ds4-server OpenAI v1-compatible HTTP server (POST /v1/chat/completions, SSE streaming)
ds4-bench Direct prefill + decode throughput sweep (no HTTP)

ds4-server is the recommended runtime. It exposes:

  • POST /v1/chat/completions (OpenAI-compatible streaming, tool calls)
  • POST /v1/completions
  • GET /v1/models

It also speaks Anthropic-shape on /v1/messages (see donor README).

Benchmarks

Hardware: a single DGX Spark — GB10, sm_121, compute_cap=12.1, CUDA 13.0.88.

The ds4-bench throughput sweep below was refreshed 2026-05-21 against ds4 at 1c4c5f0 (PR-prep branch: mmq Q8_0 dispatch + in-process VMM weight arena + per-layer CUDA-graph decode capture, the last on by default). The build/cold-start and llama-benchy HTTP numbers are from the earlier 920f987 (2026-05-12) snapshot and predate the CUDA-graph work.

Uplift vs. upstream

GB10 DGX Spark throughput, upstream antirez/ds4 vs the ds4-on-spark branch: prefill 1.16x to 1.09x faster, generation +12% to +13% faster, gains across the full 2k-64k context sweep

Same model, same prompt corpus, both sides built at sm_121 and benched on the same GB10 on 2026-05-21. The branch (mmq Q8_0 dispatch + in-process VMM weight arena + per-layer CUDA-graph decode capture) is 1.16× → 1.09× faster prefill and +12% → +13% faster decode than upstream antirez/ds4 (a365e44) across the full 2k–64k context sweep.

Build + cold start (920f987 snapshot — pre-CUDA-graph)

Step Time
make -j20 CUDA_ARCH=sm_121 7.9 s
Cold load: 80.76 GiB of tensors → GPU cache ~20 s
Time-to-first-token (cold process, 12-token prompt) ~21 s

After cold start, all subsequent benchmarks here are on a warm process.

Throughput sweep (ds4-bench, direct CLI, no HTTP)

ds4 at 1c4c5f0, imatrix Q2 GGUF, --gen-tokens 128, layer-graph decode capture on (default). Prefill is measured on a fixed 2,048-token chunk and is prompt-sensitive, so the corpus is named: rendered_prompts_nothink.txt. ctx 2k–64k:

ctx prefill t/s decode t/s KV size
2,048 458.3 15.37 52 MB
8,192 407.2 15.24 137 MB
16,384 392.5 14.99 250 MB
24,576 379.5 14.64 362 MB
32,768 367.8 14.11 475 MB
40,960 344.7 13.86 588 MB
49,152 333.6 13.66 701 MB
57,344 322.0 13.33 813 MB
65,536 312.2 13.00 926 MB
  • Prefill ~310–460 t/s across 2k → 64k, tapering smoothly with context.
  • Decode ~13–15 t/s, ~15 % falloff from 2k to 64k.
  • Per-layer CUDA-graph decode capture (on by default) contributes +5 → +10 % of the decode rate vs. the eager path, the gain widening with context.
  • KV stays compact — 926 MB at 64k — compressed KV doing its job.

llama-benchy-style numbers (HTTP)

Refreshed 2026-05-21 against ds4 at 1c4c5f0, imatrix Q2 GGUF, via eugr/llama-benchy 0.3.8 through ds4-server's OpenAI endpoint — --pp 2048 --tg 32 128 512 --depth 0 4096 16384, 3 runs each. This llama-benchy release reports tg as a near end-to-end decode rate, so the HTTP figures below line up with the direct-CLI decode sweep above (~15 t/s); earlier releases printed a steady-state-only tg roughly 2× higher.

test t/s peak t/s ttfr (ms)
pp2048 (prefill) 449.1 4791
tg32 @ d=0 16.76 ± 0.26 18.0
tg128 @ d=0 15.69 ± 0.03 18.7
tg512 @ d=0 15.47 ± 0.01 18.3
pp2048 @ d=4k 416.4 15387
tg32 @ d=4k 19.75 ± 2.66 21.8
tg128 @ d=4k 15.52 ± 0.08 18.3
tg512 @ d=4k 15.35 ± 0.05 18.0
pp2048 @ d=16k 401.8 47505
tg32 @ d=16k 16.42 ± 0.49 17.0
tg128 @ d=16k 15.31 ± 0.34 17.3
tg512 @ d=16k 14.95 ± 0.02 18.0

Prefill holds ~400–450 t/s and decode ~15–16 t/s across 0–16k depth, consistent with the direct-CLI sweep. (tg32 is the noisiest row — only 32 tokens, so first-token setup still skews its mean and variance.)

Reproduce:

scripts/run-bench.sh --pp 2048 --tg 32 128 512 --depth 0 4096 16384

Roofline analysis

How far below the hardware ceiling is ds4 running?

Memory bandwidth ceiling (measured on Spark)

Probe Bandwidth Note
nvbandwidth H2D / D2H CE 59 GB/s Copy-engine path, not relevant for kernels
nvbandwidth device_local_copy 111 GB/s CE on single device
bench/bw_bench.cu copy (R+W) 215 GB/s Kernel-driven, what matters
bench/bw_bench.cu read-only 227 GB/s Pure read throughput
Published GB10 LPDDR5X peak ~273 GB/s 256-bit × 9400 MT/s theoretical

The kernel-effective ~225 GB/s is the relevant ceiling — ~82 % of theoretical peak, normal for real workloads on LPDDR.

Bytes per token at decode (from safetensors index)

Aggregated across all 17 shards (88.4 GB total):

Bucket Total bytes Active per token
Routed experts (IQ2_XXS + Q2_K) 78.28 GB 6/256 active → 1.83 GB
MLA attention + indexer + compressor 7.05 GB all active
Embed / head / final norm 2.12 GB ~1.0 GB (head projection)
Shared expert (1 per MoE layer) 0.74 GB 0.74 GB
MTP + HC + other 0.30 GB ~0.22 GB
KV cache reads (at 16k) ~0.25 GB

Effective bytes per token at steady state: ~11 GB — the sum of the Active-per-token column, derived bottom-up from the model index, independent of any timing measurement.

Roofline

Bytes/token is the bottom-up figure above; the decode rate is measured. The two are independent — not cross-derived. (An earlier draft computed one from the other and reported the circular result as "95 % saturated".)

Quantity Value
Kernel-effective bandwidth 225 GB/s
Effective bytes per token (bottom-up) ~11 GB
Bandwidth roofline (BW ÷ bytes-per-token) 225 / 11 ≈ 20.5 t/s
Measured decode, 0–16k ctx (ds4-bench + llama-benchy) ~15 t/s
Decode efficiency ~73 % of the bandwidth roofline

Decode runs at roughly three-quarters of the bandwidth roofline — close, but not saturated. Around 1.4× of headroom sits between the measured ~15 t/s and the ~20 t/s ceiling, and that gap is not bandwidth: it is launch overhead, kernel-occupancy gaps, and non-overlapped work. The per-layer CUDA-graph decode capture this branch adds (see Benchmarks) reclaims part of it — precisely the launch-overhead component. Closing the rest is kernel-scheduling work, not a hardware wall.

Beyond the roofline itself, going faster needs a tighter quant (FP4 / 1.5-bit experts) or batched serving (amortise weight reads across users).

The bytes-per-token figure is a bottom-up estimate; ±20 % on it swings the efficiency to ~60–90 %. The qualitative result — real, non-bandwidth headroom — holds across that whole range.

Under the hood: how the CUDA backend works

A side-by-side analysis of ds4's two GPU backends — docs/METAL_VS_CUDA.md — covers the kernel surface, the command lifecycle, and the model-attach strategy on each platform. TL;DR for someone running on Spark and asking what is the implementation actually doing?:

  • ds4_cuda.cu is 9,666 LOC, 106 __global__ kernels, links -lcudart -lcublas. All compiled ahead of time by nvcc for the target CUDA_ARCH — the binary is not portable across SM generations.
  • Three-tier model attach. cudaHostRegister(... cudaHostRegisterMapped | ReadOnly) on the mmap'd 80 GiB GGUF is tried first to get a zero-copy device pointer. If pinning fails (or DS4_CUDA_COPY_MODEL is set), the engine falls back to per-range pinning, then to chunked cudaMalloc + cudaMemcpy in 64 MiB chunks. This is what the ~20 s cold load is.
  • Q8 → F16 weight cache for prefill. On startup, dense Q8_0 weights are dequantised once on-device into an F16 buffer; cublasGemmEx then uses tensor cores for multi-token prefill matmuls. That's why prefill is ~310–460 t/s while decode is ~15 t/s — they take different routes through the matmul stack. Decode (n_tok=1) skips cuBLAS and uses hand-written Q8_0 matvecs where the cuBLAS launch overhead wouldn't amortize.
  • Routed experts stay quantised. IQ2_XXS / Q2_K kernels dequantise inline on every expert dot; the codebook lives in __constant__ memory via ds4_iq2_tables_cuda.inc. Pre-converting all 256 experts to F16 would erase the q2 memory win.
  • Default stream, serial execution. begin_commands is a no-op; flush_commands, end_commands, and synchronize all reduce to cudaDeviceSynchronize(). Two named streams (g_model_prefetch_stream, g_model_upload_stream) exist only for async model staging at startup. Combined with the engine's single-session worker thread, this is why ds4-server serialises concurrent clients (see next section).
  • No GDS / cuFile. Direct file reads (via ds4_gpu_set_model_fd) use Linux O_DIRECT on a registered FD — kernel DMA, not GPU-side DMA.

If you're considering writing a port, fork, or alternative serving layer, the analysis doc lays out the kernel surface, the DS4_CUDA_* env-var knobs, and the places where the Metal and CUDA backends diverge structurally (model mapping, command-buffer batching, library use).

Concurrency on ds4-server — single-stream, serialized

The OpenAI v1 server does not actually parallelize concurrent requests. When llama-benchy hits it with --concurrency 2, ds4-server processes the requests strictly sequentially: 168 s per request, second one waits for the first to finish before starting prefill.

llama-benchy concurrency observed wall-clock behaviour
1 one request at a time (expected)
2 also one at a time — server queues, doesn't batch

That means t/s (total) at c>1 in llama-benchy's output is misleading for this engine: it's c × per-request t/s, but the wall time is also c × single-request wall time. If you need many concurrent users on one Spark you need a different runtime (vLLM/SGLang with paged-attention batching). ds4 is single-session by design.

MTP (speculative decode) — broken on CUDA today; root cause known

The donor ships --mtp <draft.gguf> --mtp-draft N. The MTP support GGUF is a separate 3.6 GiB file (DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf). The donor README labels MTP as "alpha quality / experimental."

On the CUDA backend on Spark today, MTP produces no speedup because the MTP draft kernel never produces a token. Empirically traced and documented in detail in docs/MTP_PARITY_GAP.md; the headline is:

  • routed_moe_launch in ds4_cuda.cu:8849 hard-codes gate_type == 16u (IQ2_XXS) && down_type == 10u (Q2_K) and returns failure for any other combination.
  • The MTP draft GGUF uses Q4_K (type 12) for its routed expert tensors.
  • Every MTP draft attempt silently fails inside this kernel; the C-side speculative state machine treats that as "no draft available" and commits one token per cycle — indistinguishable from non-MTP decode.
  • The Metal backend has the parity dispatch (g_moe_mul_mv_id_q4_k_pipeline, metal/moe.metal:413 / :831); MTP works there.

Reproduce in one line:

DS4_MTP_PROBE=1 ./ds4 --cuda -m … --mtp … --mtp-draft 2 --temp 0 --nothink \
  -p "List 20 prime numbers" 2>&1 | grep "mtp probe draft failed" | wc -l
# Prints 58 — one failure per generation step.

The fix is scoped at ~700–900 LOC in a single file (ds4_cuda.cu); no changes needed to the C-side state machine, MTP weight binding, batched verifier, or KV/raw-cache plumbing — those are quant-agnostic and already work. See docs/MTP_PARITY_GAP.md for the full handoff: empirical chain, Metal reference, implementation order, validation plan, effort estimate.

What the throughput tables actually look like today

Because the MTP path is a no-op on CUDA, the numbers below are "target-model decode with extra startup cost for loading the MTP GGUF."

Measured at draft=2 against matching no-MTP baselines, four high-predictability prompts (ds4 CLI, first-token-inclusive):

Prompt no-MTP t/s MTP-2 t/s Δ
Count 1 → 60 (fully deterministic) 15.13 14.27 −5.7 %
English / NATO / Greek alphabets 15.02 14.32 −4.7 %
Declaration of Independence + 10 Presidents 14.64 14.37 −1.8 %
27 EU capitals alphabetical 14.92 14.48 −2.9 %
mean 14.93 14.36 −3.8 %

The 3–6 % regression is the MTP support model's per-request setup cost (loading and binding the extra 3.6 GiB GGUF, allocating MTP raw-cache tensors), paid on every request, with no speculative gain to offset it. Three separate --mtp-draft values (1, 2, 4) all land within noise:

Config decode t/s (QuickSort prompt)
no MTP 13.5 (bench) / 14.88 (is_prime)
--mtp-draft 1 13.81
--mtp-draft 2 13.62
--mtp-draft 4 13.63

Counting 1→60 is fully deterministic — every next token is forced — so MTP acceptance rate should be ~100 % if MTP were producing drafts. That it's still slightly slower confirms the path is doing setup work and emitting no drafts. Consistent with DS4_MTP_PROBE=1 showing 100 % draft-kernel failure.

MTP via llama-benchy

Stale — pending re-bench. The llama-benchy figures in this subsection and the next predate the tg-definition change covered in the Benchmarks section (the older release over-reported decode ~2×), and the "~95 % roofline" premise they lean on is superseded by the Roofline analysis above. The MTP finding — zero accepted drafts, no measurable speedup — is unaffected; only the absolute t/s and that premise need redoing.

Same hardware, three runs each, d=8192 tg=512:

Config tg512 @ d=8192 t/s peak t/s prefill t/s
no MTP 22.14 ± 2.57 28.33 ± 1.25 328.09 ± 0.51
MTP draft=2 23.61 ± 0.48 28.33 ± 0.47 328.49 ± 0.68
Δ +6.6 % (within noise) identical identical

Peak t/s is bit-identical (28.33 in both runs). Because the CUDA MTP path produces zero accepted drafts, the two configurations are running the same target-decode kernels at the same rate; the small mean delta and tighter variance are setup-cost shadow plus run-to-run noise, not a speculative-decode effect.

Expected behaviour once the parity gap is closed

With the CUDA Q4_K MoE kernel in place, MTP-2 on DSv4-Flash should deliver a steady-state lift comparable to what vLLM+FlashInfer delivers on Qwen3.5-122B-A10B on the same GB10 hardware: 28.3 → 38.4 t/s (+35.7 %) from MTP-2 alone, up to 51 t/s (+80 %) stacked with other optimisations. DSv4 starts from a higher bandwidth-roof saturation (~95 %), so the MTP gain there will come from FLOPs hidden behind shared weight reads in the batched 2-row verifier rather than from leftover bandwidth — but the absolute number should land in the 35–50 t/s range. See docs/MTP_PARITY_GAP.md §1.2 and §9 for the full argument.

Quality checks (qualitative)

ds4 produces clean output on first try across several probe types:

Factual recall — "What is the capital of France?" → "The capital of France is Paris." ✓

Code + reasoning — "is_prime(n) with 6k±1 optimization, list primes 100-130" →

def is_prime(n):
    if n <= 1: return False
    if n <= 3: return True
    if n % 2 == 0 or n % 3 == 0: return False
    i = 5
    while i * i <= n:
        if n % i == 0 or n % (i + 2) == 0:
            return False
        i += 6
    return True

Prime numbers listed: 101, 103, 107, 109, 113, 127 — all six correct.

Long-form structured — "Explain QuickSort with worked example [38, 27, 43, 3, 9, 82, 10], full recursion, complexity analysis, optimizations" — clean multi-section response with correct partitioning steps and complexity bounds.

Repo layout

install.sh                  One-shot installer (curl | bash | --help)
scripts/
  smoke-test.sh               First-token sanity check
  start-server.sh             Start ds4-server (idempotent, with-MTP flag)
  run-bench.sh                Run llama-benchy via uvx
bench/
  bw_bench.cu                 Kernel-side memory-bandwidth probe
docs/
  STRATEGIC_CHECKPOINT.md     Detailed analysis of how this benchmark
                              affects the "should we keep porting?" decision
  METAL_VS_CUDA.md            Side-by-side comparison of ds4_metal.m and
                              ds4_cuda.cu — kernel surface, command lifecycle,
                              model attach, quantisation, and where each
                              backend's design diverges
  MTP_PARITY_GAP.md           Root-cause of "MTP gives no speedup on CUDA":
                              one quant-format gap in ds4_cuda.cu's MoE
                              kernel. Empirical chain, Metal reference,
                              ~700-900 LOC fix scope, validation plan.

Reproducing the benchmarks

# 1. Build + download + smoke test
./install.sh --with-mtp

# 2. Start the server
./scripts/start-server.sh --port 8000 --ctx 32768

# 3. Run llama-benchy (installs uvx automatically if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh   # one-time
./scripts/run-bench.sh                            # default sweep
./scripts/run-bench.sh --depth 0 4096 16384 32768 --tg 32 128 512

# 4. Measure raw memory bandwidth ceiling
/usr/local/cuda/bin/nvcc -O3 -arch=sm_121 bench/bw_bench.cu -o /tmp/bw_bench
/tmp/bw_bench 8192

# 5. Compare MTP vs no-MTP yourself
./scripts/start-server.sh --port 8000 --with-mtp --draft 2
./scripts/run-bench.sh --depth 0 --tg 128 --runs 3

How this fits with related work

Piece Role
antirez/ds4 The C+CUDA inference engine itself — narrow, DSv4-Flash-only by design
antirez/deepseek-v4-gguf The Q2/Q4 GGUFs ds4 is designed to consume
this repo Spark-specific install + benchmark + analysis layer on top of the donor
Entrpi/ds4-spark-vllm Alternative path: same model via vLLM. Different perf profile, more flexible serving, larger surface area.
eugr/llama-benchy The benchmark methodology used here — generic, OpenAI v1, comparable to llama-bench / vllm bench

Acknowledgements

  • antirez/ds4 — the inference engine and the 2-bit recipe. MIT-licensed.
  • llama.cpp and GGML — the GGUF ecosystem, quant formats, and engineering knowledge ds4 stands on.
  • deepseek-ai — DeepSeek-V4-Flash upstream weights and architecture.
  • eugr/llama-benchy — benchmark methodology and tooling.

License

MIT.

About

antirez/ds4 (DwarfStar 4) on NVIDIA DGX Spark — install, benchmarks, and roofline analysis. Steady-state decode at ~95% of bandwidth ceiling; MTP and concurrency analyzed.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors