-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[MoE] CuteDSL MoE with Nvfp4 DeepEP dispatch #27141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Shu Wang <[email protected]>
Signed-off-by: Shu Wang <[email protected]>
Signed-off-by: Shu Wang <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Shu Wang. <[email protected]>
Signed-off-by: Shu Wang. <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| def has_flashinfer_cutedsl_grouped_gemm_nt_masked() -> bool: | ||
| """Return ``True`` if FlashInfer CUTLASS fused MoE is available.""" | ||
| if not has_flashinfer_cutedsl(): | ||
| return False | ||
|
|
||
| # Check if all required functions are available | ||
| required_functions = [ | ||
| ("flashinfer.cute_dsl.blockscaled_gemm", "grouped_gemm_nt_masked"), | ||
| ("flashinfer", "scaled_fp4_grouped_quantize"), | ||
| ("flashinfer", "silu_and_scaled_nvfp4_experts_quantize"), | ||
| ] | ||
|
|
||
| for module_name, attr_name in required_functions: | ||
| mod = _get_submodule(module_name) | ||
| if not mod or not hasattr(mod, attr_name): | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typo in CuteDSL availability check
The new has_flashinfer_cutedsl_grouped_gemm_nt_masked guard always returns False because the third required symbol is spelled "silu_and_scaled_nvfp4_experts_quantize", but every other place in this commit (and in the FlashInfer API) refers to silu_and_mul_scaled_nvfp4_experts_quantize. As written the attribute lookup will fail even when the kernel is correctly installed, so the capability probe disables the entire CuteDSL path and the nvfp4 DeepEP dispatch can never be selected. Please rename the checked attribute to match the actual import.
Useful? React with 👍 / 👎.
Purpose
Dispatch with nvfp4 DeepEP low latency mode. The dispatch is a fusion of quantization and dispatch.
deps on deepseek-ai/DeepEP#341 and #25990. Should rebase after 25990 is merged.
Test Plan
NVFP4 dispatch:
VLLM_DEEPEPLL_NVFP4_DISPATCH=1
VLLM_USE_FLASHINFER_MOE_FP4=1
VLLM_USE_STANDALONE_COMPILE=0
VLLM_FLASHINFER_MOE_BACKEND="cutedsl"
VLLM_WORKER_MULTIPROC_METHOD=spawn
VLLM_ALL2ALL_BACKEND="deepep_low_latency"
lm_eval --model vllm --model_args pretrained=nvidia/DeepSeek-R1-0528-FP4,data_parallel_size=4,enable_expert_parallel=True,tensor_parallel_size=1,enforce_eager=True,max_model_len=2048 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
Test Result
BF16 dispatch:
with
VLLM_DEEPEPLL_NVFP4_DISPATCH=0:Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.