I wonder if it's possible to disagregate the inference and use DGX Spark to offload some of even all the prefill and then a Mac for generation? DGX Spark has a very fast GPU, and Mac has a very fast memory, so offloading prefill to the former while keeping all the generation on the latter could be the way to speed up large-context generation.
Edit: After some research and back-of-the-napkin calculations here's what I've got.
Goal: use DGX Spark to run X% of DS4 Flash prefill layers, from partial layer offload up to 100% prefill, then have the Mac handle all decode/generation locally.
| Requirement |
Likely status in DS4 |
| Same model weights, tokenizer, prompt template, context settings on Mac and DGX |
✅ DS4 is model-specific and uses its own DeepSeek V4 Flash GGUF layout. |
| CUDA backend on DGX Spark and Metal backend on Mac |
✅ DS4 targets Metal and has CUDA/DGX Spark work underway. |
| Distributed prefill by layer splitting |
✅ DS4 distributed mode splits layers across workers; each worker owns its layer slice and KV. |
| Activation transfer over network during prefill |
✅ DS4 sends activations worker-to-worker over TCP; default distributed prefill chunk is 4096 tokens. |
| Per-worker / per-layer KV ownership |
✅ DS4 docs say each worker maintains its own KV cache for assigned layers. |
| Compact KV/session state |
✅ DS4’s compressed KV and disk-persistent cache are core features. |
| Save/load session or layer payloads |
✅ DS4 exposes session/KV persistence and public API concepts around mutable session state; exact cross-backend layer-payload portability needs testing. |
| Move Spark-owned KV/cache back to Mac after prefill |
❌ Missing, but needed if Spark does any layer prefill but Mac alone decodes afterward. |
| Cross-backend cache compatibility: CUDA-produced cache usable by Metal |
⚠️ Key unknown. Same code/weights helps, but backend-neutral serialization must be verified. |
| Resume generation on Mac from merged Mac+Spark prefill state |
⚠️ Key missing/unknown workflow. Requires Mac to hold complete KV/session state for all layers before decode. |
| Efficient async transfer of Spark cache during/after chunks |
❌ Missing. Useful to avoid prefill stalls. |
| Fast wired connection |
✅ DGX Spark has 10GbE port and can be connected directly to Mac with a cat 6+ cable and Thunderbolt USB-C to Ethernet adapter. 2.5GbE is practical and inexpensive. Activation traffic plus cache return should be tens of seconds over a 250k prefill, not minutes, if implemented cleanly. |
I'll do further research, but any insights into the unknowns above are appreciated in the meantime.
I wonder if it's possible to disagregate the inference and use DGX Spark to offload some of even all the prefill and then a Mac for generation? DGX Spark has a very fast GPU, and Mac has a very fast memory, so offloading prefill to the former while keeping all the generation on the latter could be the way to speed up large-context generation.
Edit: After some research and back-of-the-napkin calculations here's what I've got.
Goal: use DGX Spark to run X% of DS4 Flash prefill layers, from partial layer offload up to 100% prefill, then have the Mac handle all decode/generation locally.
I'll do further research, but any insights into the unknowns above are appreciated in the meantime.