Skip to content

Offload prefill entirely to CUDA and use Apple Silicon for generation #304

@lobanov

Description

@lobanov

I wonder if it's possible to disagregate the inference and use DGX Spark to offload some of even all the prefill and then a Mac for generation? DGX Spark has a very fast GPU, and Mac has a very fast memory, so offloading prefill to the former while keeping all the generation on the latter could be the way to speed up large-context generation.

Edit: After some research and back-of-the-napkin calculations here's what I've got.

Goal: use DGX Spark to run X% of DS4 Flash prefill layers, from partial layer offload up to 100% prefill, then have the Mac handle all decode/generation locally.

Requirement Likely status in DS4
Same model weights, tokenizer, prompt template, context settings on Mac and DGX ✅ DS4 is model-specific and uses its own DeepSeek V4 Flash GGUF layout.
CUDA backend on DGX Spark and Metal backend on Mac ✅ DS4 targets Metal and has CUDA/DGX Spark work underway.
Distributed prefill by layer splitting ✅ DS4 distributed mode splits layers across workers; each worker owns its layer slice and KV.
Activation transfer over network during prefill ✅ DS4 sends activations worker-to-worker over TCP; default distributed prefill chunk is 4096 tokens.
Per-worker / per-layer KV ownership ✅ DS4 docs say each worker maintains its own KV cache for assigned layers.
Compact KV/session state ✅ DS4’s compressed KV and disk-persistent cache are core features.
Save/load session or layer payloads ✅ DS4 exposes session/KV persistence and public API concepts around mutable session state; exact cross-backend layer-payload portability needs testing.
Move Spark-owned KV/cache back to Mac after prefill ❌ Missing, but needed if Spark does any layer prefill but Mac alone decodes afterward.
Cross-backend cache compatibility: CUDA-produced cache usable by Metal ⚠️ Key unknown. Same code/weights helps, but backend-neutral serialization must be verified.
Resume generation on Mac from merged Mac+Spark prefill state ⚠️ Key missing/unknown workflow. Requires Mac to hold complete KV/session state for all layers before decode.
Efficient async transfer of Spark cache during/after chunks ❌ Missing. Useful to avoid prefill stalls.
Fast wired connection ✅ DGX Spark has 10GbE port and can be connected directly to Mac with a cat 6+ cable and Thunderbolt USB-C to Ethernet adapter. 2.5GbE is practical and inexpensive. Activation traffic plus cache return should be tens of seconds over a 250k prefill, not minutes, if implemented cleanly.

I'll do further research, but any insights into the unknowns above are appreciated in the meantime.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions