Skip to content

Conversation

@amirkl94
Copy link
Contributor

@amirkl94 amirkl94 commented Oct 23, 2025

📌 Description

Fixed a few compilation issues for L40, and removed 1 gemm tactic for sm == 89 that crashes due to:

Assertion failed: GPU lacks the shared memory resources to run GroupedGEMM kernel

🧪 Tests

Ran pytest tests/moe/test_trtllm_cutlass_fused_moe.py manually on an L40 GPU and verified all tests passed.

Summary by CodeRabbit

  • New Features

    • Added official support for SM89 GPU target: build flags, JIT NVCC flags, and a public generation path to target it.
  • Bug Fixes / Compatibility

    • Clarified FP4/FP8 dispatch rules: FP8 enabled on SM89 where supported; FP4 remains restricted and requires explicit enablement to avoid unsupported runtime paths.
  • Performance

    • Preferential dispatch to SM89-optimized kernels for improved inference kernel selection.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @amirkl94, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for L40 GPUs (SM89 architecture) within the CUTLASS FusedMoE kernels. It resolves critical compilation issues and a runtime crash related to shared memory allocation for a specific GEMM tactic on SM89, ensuring stable and efficient operation. The changes also include updates to the Python JIT compilation infrastructure to enable proper module generation and loading for L40.

Highlights

  • L40 FusedMoE Support: Dedicated support has been added for NVIDIA L40 GPUs (SM89 architecture) within the CUTLASS FusedMoE path, enabling optimized performance for these devices.
  • GEMM Tactic Refinement: A specific GroupedGEMM kernel tactic that caused shared memory assertion failures on SM89 (L40) has been identified and removed from the candidate tile selection, enhancing stability.
  • Compilation and FP Handling Fixes: Several compilation issues related to the handling of FP4 and FP8 data types across different Streaming Multiprocessor (SM) architectures have been resolved, improving code robustness.
  • Python API Extension: The Python API has been extended to allow explicit selection and loading of SM89-specific FusedMoE modules, integrating the new L40 support into the existing framework.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 23, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Add explicit SM89 support and FP4/FP8 gating adjustments across CUTLASS heuristics, MOE GEMM dispatch, and Flashinfer JIT: new SM89 NVCC flags and JIT module generator, tightened ENABLE_FP4 guards, and split SM89 vs SM>=120 handling for grouped FP8 heuristics.

Changes

Cohort / File(s) Summary
CUTLASS Heuristic
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
Split the previous combined `(sm == 89
MOE GEMM Dispatch Logic
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h
Reorganized SM80–90 dispatch: FP8 logic now prioritized when use_fp8 or use_w4afp8 true; FP4 fallback is gated by ENABLE_FP4 and constrained to appropriate SM ranges; static_assert/validation placement adjusted.
MOE Specialization Guards
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h
Added ENABLE_FP4 requirement to isValidSM120MOESpecialisation and isValidBlackwellMOESpecialisation guards so those return true only when CUTLASS arch macros AND ENABLE_FP4 are defined.
Flashinfer: fused_moe Backend Dispatch
flashinfer/fused_moe/core.py
Added support for backend "89" by importing gen_cutlass_fused_moe_sm89_module and dispatching to its build_and_load() when backend == "89".
Flashinfer JIT NVCC Flags
flashinfer/jit/core.py
Added sm89_nvcc_flags constant including compute_89 / sm_89 and FP8 E8M0 feature flags.
Flashinfer SM89 JIT Module Generator
flashinfer/jit/fused_moe.py
Added gen_cutlass_fused_moe_sm89_module(use_fast_build: bool = False) -> JitSpec; composes NVCC flags from sm89_nvcc_flags plus BF16/FP8 (and optional FP8_BLOCK_SCALE) then delegates to generic generator with device_arch="89"; imported sm89_nvcc_flags.
Build Manifest
CMakeLists.txt
(Listed in manifest) no content changes summarized beyond presence in manifest.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant PyCore as flashinfer/fused_moe/core.py
    participant JITGen as fused_moe.gen_cutlass_fused_moe_sm89_module
    participant JITCore as flashinfer/jit/core.py
    participant CppDisp as moe_gemm_template_dispatch.h
    participant CUTLASS as cutlass_kernels

    User->>PyCore: get_cutlass_fused_moe_module(backend="89")
    PyCore->>JITGen: build_and_load()
    JITGen->>JITCore: use sm89_nvcc_flags + BF16/FP8 (+FP8_BLOCK_SCALE?)
    JITGen->>PyCore: return compiled module
    PyCore-->>User: loaded module

    Note over CppDisp,CUTLASS: SM dispatch for FP formats
    alt FP8 or grouped FP8
        CppDisp->>CppDisp: if sm == 89 (new explicit branch)
        CppDisp->>CUTLASS: dispatch SM89-specific tile configs
    else if sm >= 120
        CppDisp->>CUTLASS: dispatch SM>=120 tile configs
    else
        CppDisp->>CUTLASS: default/fallback path
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Potential attention points:

  • cutlass_heuristic.cpp: verify returned CutlassTileConfig differences for SM89 vs SM>=120.
  • moe_gemm_template_dispatch.h: ensure FP4/FP8 gating and static_assert placements preserve intended runtime behavior across SM ranges.
  • JIT additions (flashinfer/jit/*) and flashinfer/fused_moe/core.py: confirm NVCC flag composition (especially FP8_BLOCK_SCALE conditional) and that new backend dispatch integrates with build/test infra.

Suggested reviewers

  • yzh119
  • cyx-6
  • yongwww

Poem

🐰 I hopped through flags and dispatch lines so bright,
SM89 now wakens, ready for the fight.
FP8 and FP4 sorted, each with a gate,
JIT builds hum softly — kernels celebrate.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "Feature: Add support for L40 FusedMoE in cutlass path" directly relates to the changeset, which adds SM89 (L40 GPU architecture) support to the fused MoE implementation by introducing backend "89" dispatch, SM89-specific NVCC flags, and a new module generator function. While the implementation involves fixing compilation issues and removing a problematic GEMM tactic rather than purely additive features, the title accurately captures the end result: enabling FusedMoE functionality on the L40 GPU. The title is specific and clear enough for a developer scanning history to understand the primary goal of the changes.
Description Check ✅ Passed The pull request description includes the core required sections: a clear 📌 Description section explaining what was fixed (compilation issues for L40 and removal of a problematic GEMM tactic with the specific assertion error), and a 🧪 Tests section documenting that pytest was run manually on an L40 GPU with all tests passing. However, the description is incomplete relative to the template: the 🔍 Related Issues section is missing (no linked issues provided despite PR objectives mentioning CI failures), and the ✅ Pre-commit Checks subsection of the checklist is not shown or verified. Despite these gaps, the essential information for understanding what changed and how it was validated is present.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for L40 GPUs (sm_89) in the CUTLASS FusedMoE path. The changes include fixing compilation issues, removing a problematic GEMM tactic for sm_89 that was causing crashes, and adding the necessary build configurations for this architecture. The changes are logical and well-implemented. I have one suggestion to improve code clarity when constructing compiler flags.

Comment on lines +81 to +86
nvcc_flags = sm89_nvcc_flags + [
"-DENABLE_BF16",
"-DENABLE_FP8",
"-DENABLE_FP8_BLOCK_SCALE" if is_cuda_version_at_least("12.8") else "",
"-DUSING_OSS_CUTLASS_MOE_GEMM",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Conditionally adding an empty string to the list of compiler flags can be slightly confusing and less clean. It's better to conditionally append the flag to the list to avoid adding empty elements. This improves code readability and maintainability.

Suggested change
nvcc_flags = sm89_nvcc_flags + [
"-DENABLE_BF16",
"-DENABLE_FP8",
"-DENABLE_FP8_BLOCK_SCALE" if is_cuda_version_at_least("12.8") else "",
"-DUSING_OSS_CUTLASS_MOE_GEMM",
]
nvcc_flags = sm89_nvcc_flags + [
"-DENABLE_BF16",
"-DENABLE_FP8",
"-DUSING_OSS_CUTLASS_MOE_GEMM",
]
if is_cuda_version_at_least("12.8"):
nvcc_flags.append("-DENABLE_FP8_BLOCK_SCALE")

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (1)

42-42: Fix misleading comment about SM120 vs SM100.

The comment mentions CUTLASS_ARCH_MMA_SM100_SUPPORTED but this function checks for CUTLASS_ARCH_MMA_SM120_SUPPORTED (line 35). Update the comment to accurately reflect the macro being checked.

Apply this diff:

-  return false;  // CUTLASS_ARCH_MMA_SM100_SUPPORTED is set when Blackwell kernels are enabled
+  return false;  // CUTLASS_ARCH_MMA_SM120_SUPPORTED is set when Blackwell kernels are enabled
🧹 Nitpick comments (2)
flashinfer/jit/fused_moe.py (1)

80-88: SM89 module generation is correctly implemented.

The function appropriately:

  • Uses sm89_nvcc_flags which excludes FP4 support for L40
  • Omits Hopper-specific TMA GEMM flags (correct for Ada architecture)
  • Includes conditional FP8 block scale support for CUDA ≥12.8

Optional: Consider iterable unpacking for cleaner syntax.

As suggested by Ruff, you could use iterable unpacking instead of concatenation:

-    nvcc_flags = sm89_nvcc_flags + [
+    nvcc_flags = [
+        *sm89_nvcc_flags,
         "-DENABLE_BF16",
         "-DENABLE_FP8",
         "-DENABLE_FP8_BLOCK_SCALE" if is_cuda_version_at_least("12.8") else "",
         "-DUSING_OSS_CUTLASS_MOE_GEMM",
     ]

This is a minor style improvement and can be deferred.

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)

696-709: LGTM: Correctly reorganizes SM89 dispatch to avoid shared memory issues.

The reorganized control flow properly addresses the L40 (SM89) issue by:

  1. Routing FP8 workloads to Sm89 kernels with runtime validation (line 703)
  2. Routing non-FP8 workloads to Sm80 kernels (lines 707-708)

This aligns with the kernel implementation in moe_cutlass_kernel.h which shows SM89 architecture reusing Sm80 kernels for non-FP8 types, and prevents the "GPU lacks the shared memory resources" assertion mentioned in the PR objectives.

Optional suggestion: Consider adding a brief comment explaining why non-FP8 on SM89 uses the Sm80 path, to help future maintainers understand the shared memory constraint that motivated this design.

Apply this diff to add a clarifying comment:

     } else {
+      // Non-FP8 workloads on SM89 (L40) reuse Sm80 kernels to avoid
+      // Sm89-specific tactics that exceed shared memory limits
       dispatchMoeGemmToCutlass<T, WeightType, ScaleBiasType, cutlass::arch::Sm80, EpilogueTag>(
           inputs, multi_processor_count_);
     }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 739df61 and d490c3f.

📒 Files selected for processing (6)
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1 hunks)
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1 hunks)
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (2 hunks)
  • flashinfer/fused_moe/core.py (2 hunks)
  • flashinfer/jit/core.py (1 hunks)
  • flashinfer/jit/fused_moe.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
flashinfer/fused_moe/core.py (2)
flashinfer/jit/fused_moe.py (1)
  • gen_cutlass_fused_moe_sm89_module (80-87)
flashinfer/jit/core.py (1)
  • build_and_load (272-284)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (2)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_type_conversion.h (6)
  • __nv_fp8_e5m2 (91-93)
  • cutlass (114-116)
  • cutlass (120-122)
  • cutlass (127-129)
  • cutlass (132-134)
  • cutlass (140-142)
csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/moe_cutlass_kernel.h (1)
  • cutlass (40-677)
flashinfer/jit/fused_moe.py (2)
flashinfer/jit/core.py (2)
  • JitSpec (185-284)
  • gen_jit_spec (287-353)
flashinfer/jit/cpp_ext.py (1)
  • is_cuda_version_at_least (86-87)
🪛 Ruff (0.14.1)
flashinfer/jit/fused_moe.py

81-86: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🔇 Additional comments (6)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)

161-168: LGTM! Targeted fix for L40 shared memory constraints.

The separation of SM89 from SM >= 120 handling correctly removes CtaShape16x256x128_WarpShape16x64x128 for L40 GPUs. This config's larger K dimension (128 vs 64) would exceed L40's shared memory capacity in GROUPED_GEMM mode, causing the assertion failure mentioned in the PR. The fix is minimal, well-scoped, and maintains full functionality for newer architectures.

flashinfer/jit/core.py (1)

93-96: SM89 flags correctly omit FP4 support.

The implementation appropriately excludes -DFLASHINFER_ENABLE_FP4_E2M1 for SM89 (L40), unlike other architectures that use common_nvcc_flags. This aligns with the PR objective of fixing compilation issues for L40 by removing problematic GEMM tactics.

flashinfer/fused_moe/core.py (2)

40-40: LGTM!

Import added correctly to support SM89 module generation.


289-290: LGTM!

Backend dispatch for SM89 follows the established pattern and correctly invokes the new module generation function.

flashinfer/jit/fused_moe.py (1)

21-27: LGTM!

Import correctly updated to include sm89_nvcc_flags needed for the new SM89 module generation function.

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)

692-694: LGTM: Proper FP4 validation for SM89.

The explicit compile-time check correctly prevents FP4 usage on SM < 90, consistent with the earlier validation for SM75-80 at lines 685-688.

Signed-off-by: Amir Klein <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d490c3f and 50a1bb0.

📒 Files selected for processing (1)
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Deploy Docs
🔇 Additional comments (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h (1)

35-36: LGTM: Guard is appropriate for FP4-only SM120 specialization.

The ENABLE_FP4 guard is correct here. Since isValidSM120MOESpecialisation only supports __nv_fp4_e2m1 types (lines 37-40), requiring ENABLE_FP4 at the preprocessor level prevents compilation errors when FP4 support is not available.

Comment on lines +51 to +52
#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED) && \
defined(ENABLE_FP4) // TODO Is there a better choice
Copy link
Contributor

@coderabbitai coderabbitai bot Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Overly restrictive guard blocks non-FP4 Blackwell specializations.

Requiring ENABLE_FP4 at the top level prevents ALL Blackwell MOE specializations from being valid when FP4 support is not compiled, including non-FP4 cases like FP8/FP8 or BF16/BF16 that should work regardless. Line 53 explicitly supports cutlass::platform::is_same<T, WeightType>::value, which matches any same-type pair.

The Hopper implementation (lines 68-89) uses a better pattern: conditional compilation guards only the FP4-specific parts inside the return statement, allowing non-FP4 specializations to work without ENABLE_FP4.

Apply this diff to follow the Hopper pattern:

-#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED) && \
-    defined(ENABLE_FP4)  // TODO Is there a better choice
+#if defined(CUTLASS_ARCH_MMA_SM100_SUPPORTED)
   return (cutlass::platform::is_same<T, WeightType>::value ||
+#ifdef ENABLE_FP4
           (cutlass::platform::is_same<T, __nv_fp8_e4m3>::value &&
-           cutlass::platform::is_same<WeightType, __nv_fp4_e2m1>::value)) &&
+           cutlass::platform::is_same<WeightType, __nv_fp4_e2m1>::value) ||
+#else
+          false ||
+#endif
+          false) &&
          cutlass::platform::is_same<EpilogueTag, cutlass_extensions::EpilogueOpDefault>::value &&
          Fusion == TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::NONE;

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_tma_warp_specialized_traits.h
around lines 51-52, the current top-level guard requires ENABLE_FP4 and blocks
all Blackwell (SM100) MOE specializations when FP4 is not enabled; change this
to only guard FP4-specific logic: keep the CUTLASS_ARCH_MMA_SM100_SUPPORTED
check on the specialization but remove ENABLE_FP4 from the #if, then wrap only
the FP4-specific condition(s) inside the specialization (the return expression)
with #if defined(ENABLE_FP4) / #endif as the Hopper implementation does (lines
~68-89), ensuring the same-type WeightType check remains unguarded so non-FP4
types (FP8/BF16/etc.) compile normally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is:

cutlass::platform::is_same<WeightType, __nv_fp4_e2m1>::value

Which I think isn't defined unless ENABLE_FP4 is defined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amirkl94, you're absolutely right that __nv_fp4_e2m1 is only defined when ENABLE_FP4 is set! That's why my suggestion included the #ifdef ENABLE_FP4 guard specifically around the line that references that type:

#ifdef ENABLE_FP4
          (cutlass::platform::is_same<T, __nv_fp8_e4m3>::value &&
           cutlass::platform::is_same<WeightType, __nv_fp4_e2m1>::value) ||
#endif

The key insight is that the first condition in the return statement:

cutlass::platform::is_same<T, WeightType>::value

doesn't reference __nv_fp4_e2m1 at all—it just checks if T and WeightType are the same type (e.g., FP8/FP8, BF16/BF16). This condition should work regardless of whether ENABLE_FP4 is defined.

By moving the top-level ENABLE_FP4 requirement into the return expression (guarding only the FP4-specific logic), non-FP4 Blackwell specializations can compile and work correctly even when FP4 support is disabled. This matches the pattern already used in isValidHopperMOESpecialisation at lines 68-89.

Let me know if you'd like me to refine the suggestion further!

Copy link

@djns99 djns99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, mainly just want to understand why we need to disable the tile shape.
Coderabbit's comment about more granular Fp4 guard might make sense, but I assume that if we are compiling with blackwell support we should also have FP4

CutlassTileConfig::CtaShape64x64x128_WarpShape32x64x64,
CutlassTileConfig::CtaShape128x64x64_WarpShape64x32x64,
CutlassTileConfig::CtaShape128x256x64_WarpShape64x64x64,
CutlassTileConfig::CtaShape256x128x64_WarpShape64x64x64};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any SM89 GPUs that can support the CtaShape16x256x128_WarpShape16x64x128, or is this an SM120 addition?
Having a small M value like this helps with low latency cases, so I'd want to understand why its not supported before disabling it

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least can you leave a comment saying what the difference between the two lists are, so people dont have to manually compare the items

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any SM89 GPUs that can support the CtaShape16x256x128_WarpShape16x64x128, or is this an SM120 addition? Having a small M value like this helps with low latency cases, so I'd want to understand why its not supported before disabling it

This is a removal from the sm89 path. When I tested it on an L40 GPU I got Assertion failed: GPU lacks the shared memory resources to run GroupedGEMM kernel .
It might be that on other sm89 GPUs it will pass, the main issue is that this was the default tactic that was chosen when trying to use FusedMoE, I believe that moving it to be the last will also fix my issue.

Copy link
Contributor Author

@amirkl94 amirkl94 Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried moving this tile config to be the last one and now the default tactic won't fail on l40. The issue is that if autotuner is on then the tactics that use this tile config will report an error with a stacktrace which looks bad.
@yzh119 do you think it'll be ok to change the errors that happen in the autotuner to be debug logs? Otherwise it means users will get spammed with error messages when they run autotuning on L40 FusedMoE.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the autotuner should still output warnings, but just make them say "Skipping tactic x due to error. This tactic may not be supported by the current GPU architecture".
That said I know there is a difference of opinion on whether we should proactively filter them as you have done here, the argument being that we should be able to do the due diligence to determine what tactics are supported so that we can raise an error when a tactic fails when it shouldn't. So I can see either side.

@yzh119
Copy link
Collaborator

yzh119 commented Oct 24, 2025

There are some compilation errors on CI, @amirkl94 would you mind taking a look?
https://ci.tlcpack.ai/blue/organizations/jenkins/flashinfer-ci/detail/PR-1973/4/pipeline

Signed-off-by: Amir Klein <[email protected]>
@amirkl94 amirkl94 requested a review from djmmoss as a code owner October 27, 2025 14:49
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)

691-699: LGTM: FP8/W4AFP8 routing to SM89 is correct.

The restructured logic correctly routes FP8 and W4AFP8 workloads to the SM89 architecture for the SM 80-89 range, which is essential for L40 support. The runtime check at line 697 appropriately prevents FP8 usage on SM80-88 hardware that lacks FP8 support. The static_assert placement within the ENABLE_FP8 block is consistent with the codebase pattern (see lines 244-247).

Minor: Consider clarifying the error message.

The error message at line 697 mentions "fp8" but the condition includes use_w4afp8. For clarity, consider updating the message to reflect both supported types.

-  TLLM_CHECK_WITH_INFO(sm_ == 89, "For sm >= 80 and < 90, fp8 is only supported with sm == 89");
+  TLLM_CHECK_WITH_INFO(sm_ == 89, "For sm >= 80 and < 90, fp8 and w4afp8 are only supported with sm == 89");
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50a1bb0 and 2302b53.

📒 Files selected for processing (1)
  • csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (3)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_type_conversion.h (6)
  • __nv_fp8_e5m2 (91-93)
  • cutlass (114-116)
  • cutlass (120-122)
  • cutlass (127-129)
  • cutlass (132-134)
  • cutlass (140-142)
csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/moe_cutlass_kernel.h (1)
  • cutlass (40-677)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h (1)
  • multi_processor_count_ (341-341)
🔇 Additional comments (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)

700-712: LGTM: Non-FP8 handling and FP4 guard are correct.

The else branch correctly handles non-FP8 cases for SM 80-89:

  • FP4 types are appropriately blocked on SM < 90 (line 703)
  • Non-FP8 workloads dispatch to SM80 architecture, which is an intentional kernel reuse strategy for SM89 as documented in the kernel implementation

The duplication of the dispatch call (lines 705-706 and 709-710) is necessary for conditional compilation when ENABLE_FP4 is not defined.

@amirkl94
Copy link
Contributor Author

There are some compilation errors on CI, @amirkl94 would you mind taking a look? https://ci.tlcpack.ai/blue/organizations/jenkins/flashinfer-ci/detail/PR-1973/4/pipeline

Pushed a commit to fix this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants