[Kernel] Integrate CUTLASS MoE kernel with PPLX #18762

ElizaWszola · 2025-05-27T12:35:22Z

Integrate CUTLASS MoE fp8 kernels with PPLX.

Unit tests:

tests/kernels/moe/test_pplx_cutlass_moe.py

E2E testing:

export MASTER_ADDR=127.0.0.1
export MASTER_PORT=29500
export VLLM_ALL2ALL_BACKEND=pplx
python3 examples/offline_inference/data_parallel.py \
        --model="nm-testing/DeepSeek-Coder-V2-Lite-Instruct-FP8" \
        --dp-size=2 \
        --tp-size=1 \
        --trust-remote-code

Signed-off-by: ElizaWszola <[email protected]>

github-actions · 2025-05-27T12:35:33Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: ElizaWszola <[email protected]>

…d benchmarks Signed-off-by: ElizaWszola <[email protected]>

Signed-off-by: ElizaWszola <[email protected]>

bnellnm · 2025-05-28T20:31:51Z

tests/kernels/moe/test_pplx_cutlass_moe.py

+NUM_EXPERTS = [40, 64]
+TOP_KS = [6, 8]
+
+P = ParamSpec("P")


We should probably put all these multiprocess utilities in a separate file now since they are also used by test_pplx_moe.py

bnellnm · 2025-05-28T20:33:37Z

tests/kernels/moe/test_pplx_cutlass_moe.py

+    (lambda x: x is None or not ops.cutlass_group_gemm_supported(x.to_int()))(
+        current_platform.get_device_capability()),
+    reason="Grouped gemm is not supported on this GPU type.")
+def test_cutlass_moe_pptx(


Type pptx -> pplx

bnellnm · 2025-05-28T20:49:12Z

vllm/model_executor/layers/fused_moe/layer.py

@@ -812,6 +847,7 @@ def __init__(
        assert quant_method is not None
        assert isinstance(quant_method, FusedMoEMethodBase)
        self.quant_method = quant_method
+        self.quant_method.moe = moe


This seems a bit sketchy to me. If the quant_method needs a MoEConfig it should be part of the constructor or passed around as an argument.

bnellnm · 2025-05-28T20:51:30Z

vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py

@@ -140,7 +153,7 @@ def finalize(
            topk_weights = torch.ones_like(topk_weights)

        self.a2a.combine(out_tokens=output,
-                         indices=topk_ids,
+                         indices=topk_ids.to(torch.uint32),


Ditto using indices_type

bnellnm · 2025-05-28T20:51:49Z

vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py

@@ -110,9 +121,11 @@ def prepare(
            out_expert_x_scale=expert_x_scale,
            dp_x=a1q,
            dp_x_scale=a1q_scale,
-            indices=rank_topk_ids,
+            indices=rank_topk_ids.to(torch.uint32),


You shouldn't need this cast anymore. The type of the topk_ids can be controlled by passing torch.uint32 via indices_type to select_experts.

bnellnm · 2025-05-28T20:52:43Z

vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py

+        if expert_x_scale is not None:
+            expert_x_scale = expert_x_scale[:, :, 0:1]


Is this the same as expert_x_scale.view(-1, -1, 1)?

No, this is taking only one slice from expert_x_scale's last dim. this is related to the scale format required by dispatch().

Can you elaborate on that? I copied the setup of expert_x_scales directly from pplx's test_all_to_all.py test so I assumed it would have the proper format already.

bnellnm · 2025-05-28T20:56:21Z

vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py

+        if self.per_act_token:
+            repeat_rows = 1
+            a1q, a1q_scale = moe_kernel_quantize_input(a1, None,
+                                                       self.quant_dtype,
+                                                       self.per_act_token,
+                                                       self.block_shape)
+        else:
+            repeat_rows = a1.shape[0]
+            a1q, a1q_scale = moe_kernel_quantize_input(a1, a1_scale,
+                                                       self.quant_dtype,
+                                                       self.per_act_token,
+                                                       self.block_shape)
+


nit: can you collapse these branches together? e.g.

repeat_rows = 1 if self.per_act_token else a1.shape[0] a1q, a1q_scale = moe_kernel_quantize_input( a1, a1_scale if not self.per_act_token else None, self.quant_dtype, self.per_act_token, self.block_shape)

or even simpler if a1_scale is None iff self.per_act_token you can just pass a1_scale directly.

tlrmchlsmth

Left a few comments, but looks good overall -- lets try to get it landed once those and Bill's comments are addressed!

tlrmchlsmth · 2025-05-28T21:05:10Z

csrc/quantization/cutlass_w8a8/moe/moe_data.cu

+  int expert_idx_in = non_zero_expert_idxs[expert_idx_out];
+  expert_offsets[expert_idx_out] = expert_idx_in * padded_m;


Suggested change

int expert_idx_in = non_zero_expert_idxs[expert_idx_out];

expert_offsets[expert_idx_out] = expert_idx_in * padded_m;

int expert_idx_in = static_cast<int32_t>(non_zero_expert_idxs[expert_idx_out]);

expert_offsets[expert_idx_out] = expert_idx_in * padded_m;

Could you doublecheck and try to not add any warnings to the build? (The implicit down-conversion here looks safe enough to me, but best to avoid implicit conversions)

I don't see this producing any new warnings, I'll make the change

tlrmchlsmth · 2025-05-28T21:07:24Z

csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu

+#if (defined ENABLE_CUTLASS_MOE_SM90 && ENABLE_CUTLASS_MOE_SM90) || \
+    (defined ENABLE_SCALED_MM_SM100 && ENABLE_SCALED_MM_SM90)


Should this be the following?

Suggested change

#if (defined ENABLE_CUTLASS_MOE_SM90 && ENABLE_CUTLASS_MOE_SM90) || \

(defined ENABLE_SCALED_MM_SM100 && ENABLE_SCALED_MM_SM90)

#if (defined ENABLE_CUTLASS_MOE_SM90 && ENABLE_CUTLASS_MOE_SM90)) \

I'm not sure why we're looking at ENABLE_SCALED_MM_SM90, but the check for ENABLE_SCALED_MM_SM100 definitely looks wrong

Copy-paste issue, I didn't notice that the old MoE data kernel I copied it from has SM100 support for fp4 now

tlrmchlsmth · 2025-05-28T21:09:40Z

tests/kernels/moe/test_pplx_cutlass_moe.py

+from vllm.platforms import current_platform
+
+try:
+    from pplx_kernels import AllToAll  # or AllToAllInternode?


IIUC, AllToAll dispatches to AllToAllInternode under the hood, so we shouldn't need to interact with it directly

tlrmchlsmth · 2025-05-28T21:12:29Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

+            ((self.moe.num_experts + prepare_finalize.world_size - 1) //
+             prepare_finalize.world_size), self.moe.in_dtype,


I think it would be nice to factor out (self.moe.num_experts + prepare_finalize.world_size - 1) // prepare_finalize.world_size into a local variable for clarity, as it would help explain what it's doing.

bnellnm · 2025-05-28T21:23:23Z

vllm/model_executor/layers/fused_moe/modular_kernel.py

+            padded_M = (self.prepare_finalize.max_num_tokens *
+                        self.prepare_finalize.world_size)


Should this be max_num_tokens * (world_size // dp_size)?

It should, thanks! I'll be getting padded_M directly from a1q now though

bnellnm · 2025-05-28T21:24:13Z

vllm/model_executor/layers/fused_moe/modular_kernel.py

+        # compute padded_M for PplxPrepareAndFinalize
+        if (hasattr(self.prepare_finalize, 'max_num_tokens')
+                and hasattr(self.prepare_finalize, 'world_size')):
+            padded_M = (self.prepare_finalize.max_num_tokens *
+                        self.prepare_finalize.world_size)
+        else:
+            padded_M = M
+


I think the logic to get padded_M would be better as an abstract method on FusedMoePrepareAndFinalize that was specialized for pplx that returns M otherwise.

Another alternative might be to move the workspace allocation after the call to prepare. That way a1q would have the proper padded_M shape information. Although some of the existing workspace_shape methods might need some adjustment then.

I'll go for the second suggestion

bnellnm · 2025-05-28T21:32:03Z

vllm/model_executor/layers/fused_moe/layer.py

@@ -192,6 +195,7 @@ class MoEConfig:
    moe_parallel_config: FusedMoEParallelConfig

    in_dtype: torch.dtype  # The activation type.
+    quant_dtype: torch.dtype = None


in_dtype was intended to be the post quantization activation type (this wasn't quite clear though). I'm fine with adding another field as long as we still need both types. Otherwise, we should just keep one.

Disregard this comment. I think we'll need both of these types.

bnellnm · 2025-05-28T21:35:26Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

+
+        if expert_num_tokens is not None:
+            non_zero_mask = expert_num_tokens[:] != 0
+            masked_local_E = int(non_zero_mask.sum().item())


Is this going to interfere with cudagraphs?

potentially... I think I can circumvent it with a custom CUDA kernel and extra mapping for expert_offsets and problem_sizes1/2 if needed

bnellnm · 2025-05-28T21:36:27Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

+        if expert_num_tokens is not None:
+            non_zero_mask = expert_num_tokens[:] != 0
+            masked_local_E = int(non_zero_mask.sum().item())
+            non_zero_expert_idxs = torch.nonzero(non_zero_mask).flatten()


I think nonzero might cause trouble also.

bnellnm · 2025-05-28T21:43:03Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

+    ws1 = a.size(0) * topk_ids.size(1) * max(w1_q.size(1), w2_q.size(1))
+    ws2 = a.size(0) * topk_ids.size(1) * w2_q.size(2)
+    workspace13 = torch.zeros(ws1, device=a.device, dtype=out_dtype)
+    workspace2 = torch.zeros(ws2, device=a.device, dtype=out_dtype)
+
+    if apply_router_weight_on_input:
+        assert topk_ids.shape[
+            1] == 1, "topk_ids must be 1 for apply_router_weight_on_input"
+        a = a * topk_weights.to(a.dtype)
+
+    from vllm.model_executor.layers.fused_moe.utils import _fp8_quantize
+    a1q, a1q_scale = _fp8_quantize(a, a1_scale, per_act_token)


Why is this bit pulled out? It should be handled by whatever PrepareAndFinalze object is used w/CutlassExpertsFp8.

This should be the function that runs the old version of cutlass MoE when no PrepareAndFinalize is being run. I changed the structure of functions/classes in this file, is it less messy now?

I was thinking more in terms of not duplicating code and using the new modular classes to serve as the implementation of cutlass_moe_fp8. MoEPrepareAndFinalizeNoEP should be able to do all the preparation/finalization and doesn't do any communication.

bnellnm · 2025-05-28T21:43:11Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

+    if not apply_router_weight_on_input:
+        out = out * topk_weights.view(topk_weights.shape[0],
+                                      topk_weights.shape[1], 1).to(out_dtype)


Signed-off-by: ElizaWszola <[email protected]>

bnellnm · 2025-05-30T03:07:11Z

vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py

@@ -92,7 +95,7 @@ def prepare(
                          else 1) * float32_size
            expert_x_scale = torch.empty(


I don't know if you've got correctness yet but I had to use torch.zeros here to get some of my fp8 + pplx tests working. Probably due to the alignment/padding requirements of the scale bytes.

I didn't see any issues with keeping empty here, but I'll try testing with more param configurations and see if some errors shows up

bnellnm · 2025-05-30T03:21:49Z

vllm/model_executor/layers/fused_moe/layer.py

+            input_activations = get_quant_config_input_activations(
+                quant_config)


Is this guaranteed to always exist? I've tried running Qwen/Qwen3-30B-A3B-FP8 and it didn't appear to have an input_activations field in its quant_config.

I think this is probably fine for now but maybe with a None check. There's a more general problem of finding the proper quantization info for each MoE layer that needs to be solved.

ElizaWszola added 7 commits May 19, 2025 13:39

Cutlass MoE pplx - working unit tests

888177a

Signed-off-by: ElizaWszola <[email protected]>

Working e2e, but there are some hacks and it needs cleaning

6486345

Signed-off-by: ElizaWszola <[email protected]>

Set the correct workspace shapes, padded and unpadded c1,c2,c3

5d4751f

Signed-off-by: ElizaWszola <[email protected]>

format

9cb7802

Signed-off-by: ElizaWszola <[email protected]>

Merge branch 'main' into cutlass-moe-pplx-integration

f34d6b1

Signed-off-by: ElizaWszola <[email protected]>

uncomment quant_method selection

9499f74

Signed-off-by: ElizaWszola <[email protected]>

Working e2e after merge

df1a014

Signed-off-by: ElizaWszola <[email protected]>

ElizaWszola changed the title ~~[Kernel] Integrate CUTLASS MoE kernel with PPLX~~ [WIP][Kernel] Integrate CUTLASS MoE kernel with PPLX May 27, 2025

ElizaWszola added 3 commits May 27, 2025 15:01

Nuke output map codepath, clean up a bit

503a9b3

Signed-off-by: ElizaWszola <[email protected]>

Fix the non-pplx codepath

8c1d57b

Signed-off-by: ElizaWszola <[email protected]>

CUDA kernel for pplx data computation, cleanups, fixing unit tests an…

268bbea

…d benchmarks Signed-off-by: ElizaWszola <[email protected]>

mergify bot added the ci/build label May 28, 2025

ElizaWszola added 3 commits May 28, 2025 14:13

Various cleanups

a6236bf

Signed-off-by: ElizaWszola <[email protected]>

Better types and attribute checks

ab46919

Signed-off-by: ElizaWszola <[email protected]>

Missing return, type check ignore

3e436fd

Signed-off-by: ElizaWszola <[email protected]>

ElizaWszola changed the title ~~[WIP][Kernel] Integrate CUTLASS MoE kernel with PPLX~~ [Kernel] Integrate CUTLASS MoE kernel with PPLX May 28, 2025

ElizaWszola marked this pull request as ready for review May 28, 2025 16:10

ElizaWszola requested review from tlrmchlsmth, WoosukKwon, mgoin and robertgshaw2-redhat as code owners May 28, 2025 16:10

bnellnm reviewed May 28, 2025

View reviewed changes

tlrmchlsmth reviewed May 28, 2025

View reviewed changes

bnellnm reviewed May 28, 2025

View reviewed changes

ElizaWszola added 2 commits May 30, 2025 02:30

Various feedback

787a660

Signed-off-by: ElizaWszola <[email protected]>

Fix workspace_shapes in deep gemm moe

34d4410

Signed-off-by: ElizaWszola <[email protected]>

bnellnm reviewed May 30, 2025

View reviewed changes

		if expert_x_scale is not None:
		expert_x_scale = expert_x_scale[:, :, 0:1]

		int expert_idx_in = non_zero_expert_idxs[expert_idx_out];
		expert_offsets[expert_idx_out] = expert_idx_in * padded_m;

		#if (defined ENABLE_CUTLASS_MOE_SM90 && ENABLE_CUTLASS_MOE_SM90) \|\| \
		(defined ENABLE_SCALED_MM_SM100 && ENABLE_SCALED_MM_SM90)

		((self.moe.num_experts + prepare_finalize.world_size - 1) //
		prepare_finalize.world_size), self.moe.in_dtype,

		padded_M = (self.prepare_finalize.max_num_tokens *
		self.prepare_finalize.world_size)

		@@ -92,7 +95,7 @@ def prepare(
		else 1) * float32_size
		expert_x_scale = torch.empty(

		input_activations = get_quant_config_input_activations(
		quant_config)

Uh oh!

[Kernel] Integrate CUTLASS MoE kernel with PPLX #18762

Are you sure you want to change the base?

[Kernel] Integrate CUTLASS MoE kernel with PPLX #18762

Uh oh!

Conversation

ElizaWszola commented May 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bnellnm May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bnellnm May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ElizaWszola commented May 27, 2025 •

edited by github-actions bot

Loading

bnellnm May 28, 2025 •

edited

Loading

bnellnm May 28, 2025 •

edited

Loading