Skip to content

metal: simdgroup MMA mini-GEMM for decode MoE [experimental]#306

Open
hexxyan wants to merge 1 commit into
antirez:mainfrom
hexxyan:codex/decode-moe-minigemm
Open

metal: simdgroup MMA mini-GEMM for decode MoE [experimental]#306
hexxyan wants to merge 1 commit into
antirez:mainfrom
hexxyan:codex/decode-moe-minigemm

Conversation

@hexxyan
Copy link
Copy Markdown
Contributor

@hexxyan hexxyan commented May 31, 2026

Summary

Experimental optimization for the DeepSeek-V4 decode MoE path: replace scalar dot-product kernels with tiled simdgroup_multiply_accumulate mini-GEMM for the 6 selected experts.

New Metal kernels (metal/moe.metal):

  • kernel_mul_mm_selected_pair_swiglu — fused gate+up projection with shared RHS activation tile and SwiGLU activation. Template instantiations for Q2_K, Q4_K, IQ2_XXS × F32/F16.
  • kernel_mul_mm_selected_sum — down projection that accumulates all 6 experts directly into the output row, eliminating the separate sum kernel.

Host dispatch (ds4_metal.m):

  • Opt-in via DS4_METAL_DECODE_MOE_MINIGEMM=1 environment variable
  • Default decode path completely unchanged
  • Stage profiling via DS4_METAL_MOE_STAGE_PROFILE for A/B comparison

Validation tool (tools/validate_metal.sh):

  • Offline Metal shader compilation check using xcrun metal
  • Mirrors the default source concatenation used by ds4_gpu_full_source() (base header + all .metal files)

Verification

  • C/ObjC build passes (make ds4)
  • ds4_test --server passes
  • test_q4k_dot passes
  • ds4-eval --self-test-extractors passes
  • Full concatenated Metal source (~9.6k lines) compiles cleanly (unused-function warnings only)
  • Apple Silicon runtime pipeline creation
  • Numerical correctness vs scalar path
  • tok/s benchmark (baseline vs mini-GEMM)

How to Test

# Build
make ds4

# Validate Metal shaders offline
tools/validate_metal.sh

# Baseline (scalar path)
./ds4-server -m <model.gguf>

# Mini-GEMM prototype
DS4_METAL_DECODE_MOE_MINIGEMM=1 ./ds4-server -m <model.gguf>

# With stage profiling for A/B comparison
DS4_METAL_MOE_STAGE_PROFILE=1 ./ds4-server -m <model.gguf>
DS4_METAL_DECODE_MOE_MINIGEMM=1 DS4_METAL_MOE_STAGE_PROFILE=1 ./ds4-server -m <model.gguf>

Notes

This is an experimental prototype. The mini-GEMM path is disabled by default and only activates when the env var is set. No performance claims yet — Apple Silicon testing is needed to confirm whether the simdgroup MMA tiles outperform the existing fused scalar dot-product path for the decode MoE workload (6 experts × small batch).

The hypothesis is that for decode (small M, large K), simdgroup MMA tiles can better utilize the GPU matrix hardware than per-expert scalar dot products. But the tile overhead may dominate at small problem sizes, so empirical measurement is essential.

Experimental decode-path optimization: replace scalar dot-product MoE
with tiled simdgroup_multiply_accumulate mini-GEMM for the 6 selected
experts in DeepSeek-V4 MoE layers.

New kernels:
- kernel_mul_mm_selected_pair_swiglu: fused gate+up projection with
  shared RHS activation tile and SwiGLU, Q2_K/Q4_K/IQ2_XXS templates
- kernel_mul_mm_selected_sum: down projection accumulating all experts
  directly into the output row (eliminates separate sum6 kernel)

Host dispatch (ds4_metal.m):
- Opt-in via DS4_METAL_DECODE_MOE_MINIGEMM=1 env var
- Legacy path unchanged when env var is not set
- Stage profiling via DS4_METAL_MOE_STAGE_PROFILE for A/B comparison

Also adds tools/validate_metal.sh for offline Metal shader compilation
checking (requires Metal Toolchain component).

Status: compile-verified (C/ObjC + 9662 lines Metal shader pass).
Runtime correctness and tok/s benchmark pending Apple Silicon testing.
Not yet recommended for production use.
@hexxyan
Copy link
Copy Markdown
Contributor Author

hexxyan commented May 31, 2026

@nhwaani would you be willing to give this experimental PR a quick spin on Apple Silicon?

This adds an opt-in decode MoE mini-GEMM path using simdgroup_multiply_accumulate. It is disabled by default, so the existing path should be unchanged.

The most useful checks would be:

  1. Does it build and run without crashing?
DS4_METAL_DECODE_MOE_MINIGEMM=1 ./ds4-server -m <model>
  1. Do you see any obvious numerical/output issues compared with the default path?

  2. If you have time, a quick tok/s comparison would be amazing:

./ds4-server -m <model>
DS4_METAL_DECODE_MOE_MINIGEMM=1 ./ds4-server -m <model>

Optional stage profile:

DS4_METAL_MOE_STAGE_PROFILE=1 ./ds4-server -m <model>
DS4_METAL_DECODE_MOE_MINIGEMM=1 DS4_METAL_MOE_STAGE_PROFILE=1 ./ds4-server -m <model>

No pressure at all. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant