-
Notifications
You must be signed in to change notification settings - Fork 558
update trtllm cutlass moe #2020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
β¦to feature/cutlass_moe_3xfp4
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThreads swizzled_input_sf, unpadded_hidden_size, router_scales, permuted_row_to_unpermuted_row, swap_ab and finalize-fusion flags through MOE/CUTLASS flows; adds SM90 scatter epilogue visitor; extends tile/cluster enums and SM100/SM120 candidate generation; renames many kernel namespaces to cutlass_kernels_oss; adds explicit template instantiations and launcher/signature updates. Changes
Sequence Diagram(s)sequenceDiagram
participant App
participant Runner as CutlassMoeFCRunner
participant Heuristic
participant Profiler
participant Dispatcher
Note over App,Runner: runMoe(..., swizzled_input_sf, unpadded_hidden_size, router_scales, permuted_row_to_unpermuted_row, swap_ab)
App->>Runner: runMoe(...)
Runner->>Heuristic: getTactics(gemm_id, sm, supports_finalize_fusion)
Heuristic-->>Runner: candidate CutlassGemmConfig (may include FINALIZE, swap_ab, dynamic cluster shapes)
Runner->>Profiler: profile/select (uses unpadded_hidden_size, stage-specific tactic counts)
Profiler-->>Runner: selected gemm_config
Runner->>Dispatcher: dispatch(gemm_config, router_scales, permuted_row_to_unpermuted_row, swizzled_input_sf, swap_ab)
Dispatcher-->>Runner: launches kernel (TMA warp specialized / finalize fused / scatter epilogue)
Runner-->>App: results
Estimated code review effortπ― 4 (Complex) | β±οΈ ~60 minutes Areas to focus during review:
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touchesβ Failed checks (2 warnings, 1 inconclusive)
β¨ Finishing touches
π§ͺ Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @nv-yunzheq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request primarily focuses on enhancing the TensorRT-LLM (TRTLLM) CUTLASS Mixture-of-Experts (MoE) implementation, particularly for Hopper and Blackwell architectures. The main objective is to introduce a new Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with π and π on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/bot run |
| layout_info2.default_epilogue.ptr_d[expert] = nullptr; | ||
| } | ||
| } | ||
| bias2, gemm2_output, router_scales, permuted_row_to_unpermuted_row, expert); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theres was a bug identified in TRT-LLM that the asm volatile("griddepcontrol.launch_dependents;"); on line 1302 is incorrect and needs to be moved to the end of the kernel. There should be no observable perf difference from doing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to the end of the kernel
| auto id1 = profile_ids.value()[0]; | ||
| if (id1 != -1) { | ||
| TVM_FFI_ICHECK(id1 >= 0 && id1 < static_cast<int64_t>(mAllProfiles.size())) | ||
| << "Invalid gemm1 profile id: " << id1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check that the tactic is not in the gemm 2 tactic range. Not all GEMM 2 tactics are valid for GEMM 1.
I think in general this combined approach is dangerous, I didn't implement MOE with any particular thought that we could get GEMM2 tactics for GEMM1, this may break or have other subtle failures such as the profiler picking a worse implementation. Any chance we could separate them in the proper API (happy for this to be a later PR though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to make sure id1 is smaller than mGemm1TacticCount. Agree to separate them them in the future
|
Do we need to update: https://github.com/flashinfer-ai/flashinfer/tree/main/flashinfer/tuning_configs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (2)
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu (1)
401-417: Critical: Non-OSS path still missing parameters and hardcodesenable_alltoall=false.Despite being marked as addressed in commit b09efb0, the non-OSS branch still has critical issues:
- Line 413: Missing
unpadded_hidden_sizeparameter (compare to OSS line 395)- Line 415: Hardcodes
falseinstead of passingenable_alltoall- Line 415: Missing
use_loraflag beforelora_paramsThis causes functional regressions: all-to-all communication is disabled and LoRA configuration is lost in non-OSS builds.
- quant_params, num_rows, hidden_size, inter_size, num_experts_total, + quant_params, num_rows, hidden_size, unpadded_hidden_size, inter_size, num_experts_total, static_cast<int>(experts_per_token), static_cast<char*>(workspace_info.workspace.data_ptr()), output.data_ptr(), - static_cast<int*>(workspace_info.src_to_dest_map), parallelism_config, false, lora_params, + static_cast<int*>(workspace_info.src_to_dest_map), parallelism_config, enable_alltoall, + use_lora, lora_params, mUseDeepSeekFP8BlockScaling, min_latency_mode, min_latency_params, enable_pdl, stream);csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh (1)
351-355: Move griddepcontrol.launch_dependents to the end of three kernel functions to prevent data races.Three instances of
griddepcontrol.launch_dependentsare called before all memory store operations complete, violating the PDL requirement to ensure stores finish in flight:
- Lines 352-354 in
fusedBuildExpertMapsSortFirstTokenKernel: followed by memory writes at lines 356-372 before function end- Lines 1508-1510: followed by padding writes at lines 1512+
- Lines 2215-2217: followed by padding writes at lines 2219+
Since
launch_dependentsprovides no memory visibility guarantee, dependent kernels may see stale data. Move each call to immediately before the enclosing function's closing brace.
β»οΈ Duplicate comments (5)
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu (1)
585-602: Critical: Non-OSS min-latency path hardcodesenable_alltoall=false.Line 599 passes
falseinstead ofenable_alltoall, disabling all-to-all communication in non-OSS builds. While this path correctly includesuse_lora_ml(unlike the regular runMoe), it still has the critical enable_alltoall bug.- static_cast<int*>(workspace_info.src_to_dest_map), parallelism_config, false, use_lora_ml, + static_cast<int*>(workspace_info.src_to_dest_map), parallelism_config, enable_alltoall, use_lora_ml, lora_params, mUseDeepSeekFP8BlockScaling, min_latency_mode, min_latency_params, enable_pdl, stream);csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh (4)
1011-1050: Deduplicate SF layout branch in writeSF; compute layout once and call helper once.Removes duplicate cvt_quant_get_sf_out_offset calls and branches on layout only.
if (sf_out) { if (input_sf) { - if (swizzled_input_sf) { - auto const sf_in = - cvt_quant_get_sf_out_offset<TmaWarpSpecializedGroupedGemmInput::ElementSF, - NumThreadsPerSF>( - std::nullopt /* batchIdx */, source_token_id, elem_idx, std::nullopt /* numRows */, - num_cols / VecSize, - const_cast<TmaWarpSpecializedGroupedGemmInput::ElementSF*>(input_sf), - QuantizationSFLayout::SWIZZLED_128x4); - *sf_out = *sf_in; - } else { - auto const sf_in = - cvt_quant_get_sf_out_offset<TmaWarpSpecializedGroupedGemmInput::ElementSF, - NumThreadsPerSF>( - std::nullopt /* batchIdx */, source_token_id, elem_idx, std::nullopt /* numRows */, - num_cols / VecSize, - const_cast<TmaWarpSpecializedGroupedGemmInput::ElementSF*>(input_sf), - QuantizationSFLayout::LINEAR); - *sf_out = *sf_in; - } + auto const layout = swizzled_input_sf ? QuantizationSFLayout::SWIZZLED_128x4 + : QuantizationSFLayout::LINEAR; + auto const sf_in = + cvt_quant_get_sf_out_offset<TmaWarpSpecializedGroupedGemmInput::ElementSF, + NumThreadsPerSF>( + std::nullopt /* batchIdx */, source_token_id, elem_idx, std::nullopt /* numRows */, + num_cols / VecSize, + const_cast<TmaWarpSpecializedGroupedGemmInput::ElementSF*>(input_sf), layout); + *sf_out = *sf_in; } else { *sf_out = 0x00; } }
1684-1686: Finalize kernel alignment checks must use element-width-based stride, not hard-coded 4.Tie asserts to FINALIZE_ELEM_PER_THREAD for correctness across dtypes.
- assert(padded_cols % 4 == 0); - assert(unpadded_cols % 4 == 0); - assert(unpadded_cols <= padded_cols); + // Load 128-bits per thread, according to smallest IO type constexpr int64_t FINALIZE_ELEM_PER_THREAD = 128 / std::min(sizeof_bits<OutputType>::value, sizeof_bits<GemmOutputType>::value); + assert(padded_cols % FINALIZE_ELEM_PER_THREAD == 0); + assert(unpadded_cols % FINALIZE_ELEM_PER_THREAD == 0); + assert(unpadded_cols <= padded_cols); - int64_t const start_offset = threadIdx.x; - int64_t const stride = FINALIZE_THREADS_PER_BLOCK; - int64_t const num_elems_in_padded_col = padded_cols / FINALIZE_ELEM_PER_THREAD; - int64_t const num_elems_in_orig_col = unpadded_cols / FINALIZE_ELEM_PER_THREAD; + int64_t const start_offset = threadIdx.x; + int64_t const stride = FINALIZE_THREADS_PER_BLOCK; + int64_t const num_elems_in_padded_col = padded_cols / FINALIZE_ELEM_PER_THREAD; + int64_t const num_elems_in_orig_col = unpadded_cols / FINALIZE_ELEM_PER_THREAD;Also applies to: 1693-1700
1761-1764: Same alignment issue in no-filling finalize kernel; mirror FINALIZE_ELEM_PER_THREAD asserts.Apply element-width-based asserts and derived counts.
- assert(padded_cols % 4 == 0); - assert(unpadded_cols % 4 == 0); - assert(unpadded_cols <= padded_cols); + // Alignment checks moved below after FINALIZE_ELEM_PER_THREAD is known ... constexpr int64_t FINALIZE_ELEM_PER_THREAD = 128 / std::min(sizeof_bits<OutputType>::value, sizeof_bits<GemmOutputType>::value); + assert(padded_cols % FINALIZE_ELEM_PER_THREAD == 0); + assert(unpadded_cols % FINALIZE_ELEM_PER_THREAD == 0); + assert(unpadded_cols <= padded_cols); - int64_t const num_elems_in_padded_col = padded_cols / FINALIZE_ELEM_PER_THREAD; - int64_t const num_elems_in_orig_col = unpadded_cols / FINALIZE_ELEM_PER_THREAD; + int64_t const num_elems_in_padded_col = padded_cols / FINALIZE_ELEM_PER_THREAD; + int64_t const num_elems_in_orig_col = unpadded_cols / FINALIZE_ELEM_PER_THREAD;Also applies to: 1799-1803
3951-3952: Min-latency path currently throws; restore functional fallback.Throwing breaks SM90 min-latency flows. Route LL to non-LL compute until LL is ready.
Option A (minimal): switch setupTmaWarpSpecializedInputs to non-LL compute in LL mode:
@@ if (min_latency_mode) { @@ - return Self::computeStridesTmaWarpSpecializedLowLatency( - gemm1_tma_ws_input, gemm2_tma_ws_input, num_rows, fc1_out_size, hidden_size, hidden_size, - inter_size, num_experts_per_node, reinterpret_cast<T const*>(gemm1_input), - reinterpret_cast<T const*>(gemm2_input), fc1_expert_weights, fc2_expert_weights, - quant_params.fp8.dequant_fc1, quant_params.fp8.dequant_fc2, input_sf, fc2_fp4_act_scale_, - quant_params, nullptr, nullptr, reinterpret_cast<UnfusedGemmOutputType*>(gemm1_output), - reinterpret_cast<UnfusedGemmOutputType*>(fc2_result_), - min_latency_params.num_active_experts_per_node, min_latency_params.active_expert_global_ids, - start_expert, enable_pdl, stream); + // Temporary fallback: use non-LL path to keep correctness + return Self::computeStridesTmaWarpSpecialized( + expert_first_token_offset_, gemm1_tma_ws_input, gemm2_tma_ws_input, num_rows, + expanded_num_rows, fc1_out_size, hidden_size, hidden_size, inter_size, num_experts_per_node, + reinterpret_cast<T const*>(gemm1_input), reinterpret_cast<T const*>(gemm2_input), + fc1_expert_weights, fc2_expert_weights, quant_params.fp8.dequant_fc1, + quant_params.fp8.dequant_fc2, input_sf, fc2_fp4_act_scale_, quant_params, + /*bias1=*/nullptr, /*bias2=*/nullptr, + reinterpret_cast<UnfusedGemmOutputType*>(gemm1_output), + reinterpret_cast<UnfusedGemmOutputType*>(fc2_result_), + /*router_scales=*/permuted_token_final_scales_, /*permuted_row_to_unpermuted_row=*/permuted_row_to_unpermuted_row_, + enable_pdl, stream); }Alternatively, if you prefer keeping call sites unchanged, implement a similar fallback inside computeStridesTmaWarpSpecializedLowLatency with a warning.
#!/bin/bash rg -n "computeStridesTmaWarpSpecializedLowLatency" -C2
π§Ή Nitpick comments (1)
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu (1)
226-232: Address djns99's architectural concern about combined GEMM1/GEMM2 tactics.While the current implementation now correctly tracks the split between GEMM1 and GEMM2 tactics, djns99's comment highlights a fundamental issue: "I didn't implement MOE with any particular thought that we could get GEMM2 tactics for GEMM1, this may break or have other subtle failures such as the profiler picking a worse implementation."
Consider separating the GEMM1 and GEMM2 tactic APIs completely rather than concatenating them, which could help prevent cross-contamination and make the API clearer.
Based on past review comments.
π Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
π Files selected for processing (3)
csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh(55 hunks)csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu(10 hunks)tests/moe/test_trtllm_cutlass_fused_moe.py(1 hunks)
π§° Additional context used
𧬠Code graph analysis (1)
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu (1)
include/flashinfer/trtllm/fused_moe/runner.h (3)
hidden_size(265-265)num_experts(263-263)top_k(270-270)
β° Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
π Additional comments (13)
tests/moe/test_trtllm_cutlass_fused_moe.py (1)
1089-1090: LGTM! SM120 support correctly added.The skip condition now includes compute capability 12 (SM120) and the reason string has been updated accordingly, addressing the previous review feedback. This change is consistent with the similar update in
test_moe_nvfp4(lines 366-369) and aligns with the broader PR objective of extending SM120 support.csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu (7)
384-400: LGTM: OSS path correctly threads new parameters.The OSS branch properly passes
swizzled_input_sf,unpadded_hidden_size, anduse_lorato the kernel runner.
568-584: LGTM: OSS min-latency path correctly uses new parameters.The parameters are properly threaded through, matching the regular runMoe pattern.
714-717: LGTM: Exposing GEMM tactic counts improves API clarity.These new functions allow callers to understand the GEMM1/GEMM2 tactic split, which is essential for correct tactic selection.
785-786: LGTM: Member variables properly track tactic counts.The zero-initialization and int64_t type are appropriate.
799-804: LGTM: Default GEMM2 profile now correctly selected from GEMM2 subrange.The fallback logic properly uses
mGemm1TacticCountto index into the GEMM2 tactics, fixing the previous critical bug where both GEMMs defaulted to the same GEMM1 tactic.
805-813: LGTM: GEMM1 index validation prevents cross-tactic contamination.The range check
id1 < mGemm1TacticCountensures GEMM1 can only select from its own tactics, addressing part of djns99's concern.
380-383: Clarify handling of hardcoded kernel parameters as documented temporary constraints.These three values are marked with HACK/TODO comments indicating they are known limitations:
use_lora = falsealigns with the "TODO: support lora in the future" comment β acceptable for nowswizzled_input_sf = trueassumes input scale factors are always swizzled; verify this matches all actual inputsunpadded_hidden_size = hidden_sizeassumes no padding; this assumption must hold for correctnessThe same pattern (hardcoded
unpadded_hidden_size = hidden_size) repeats inrunMoevariants and profiler code throughout this file. If padding is ever used, this will silently compute incorrect results. Either verify padding is never applied in practice, or expose these as parameters so callers can provide correct values.csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh (5)
1093-1111: swap_ab-aware FP4/MXFPX stride setup looks correct.Good: symmetrical SFA/SFB mapping and explicit transpose handling.
1196-1210: Fused finalize epilogue pointers are wired per-expert correctly.Setting ptr_source_token_index, ptr_router_scales, and optional bias per expert is sound.
3598-3617: NoSmem epilogue + finalize fusion guardrails are correct.Appropriate TLLM_CHECKs preventing unsupported combinations.
Consider adding a unit/profiler case asserting this path errors cleanly.
4018-4040: Finalize fusion gating matches runner config and excludes w4_groupwise/LoRA.Consistent with prepareTmaWsInputs; good.
Ensure mInterface->use_fused_finalize_ is set coherently where GemmProfilerBackend is constructed.
1641-1647: Public macro instantiations: signature/arg ordering consistent.Launchersβ new params (input_sf, swizzled_input_sf, padded/unpadded) flow through correctly.
Also applies to: 1912-1921
csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu
Show resolved
Hide resolved
|
[SUCCESS] Pipeline #37963389: 13/17 passed |
|
@nv-yunzheq there are some compilation errors in CI: https://ci.tlcpack.ai/blue/organizations/jenkins/flashinfer-ci/detail/PR-2020/22/pipeline/23, can you double check? (I'm not sure if it's because of cutlass versions). |
|
/bot run |
|
[CANCELING] Pipeline #37982790: canceled |
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (1)
csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh (1)
221-236: Fix OOB read: s_local_experts indexed with global expert id.In min-latency map build,
s_local_expertsis sized to local experts but is indexed by the globalexpert, causing OOB when expert β [start_expert, end_expert). Guard and subtractstart_expert.Apply:
- bool is_valid_expert = - smart_routing ? s_local_experts[expert] : (expert >= start_expert && expert < end_expert); + bool const expert_in_node = (expert >= start_expert && expert < end_expert); + bool is_valid_expert = smart_routing + ? (expert_in_node && s_local_experts[expert - start_expert]) + : expert_in_node;Also consider mirroring this guard wherever
s_store_experts[expert - start_expert]is used to avoid underflow whenexpert_in_node == false.
β»οΈ Duplicate comments (5)
csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh (5)
1724-1734: Restore defensive check for invalid permutation indices (debug-only OK).
expanded_permuted_row = unpermuted_row_to_permuted_row[...]has no validity guard. If upstream builds ever leave sentinel values, this will read OOB fromexpanded_permuted_rows.- int64_t const expanded_permuted_row = unpermuted_row_to_permuted_row[expanded_original_row]; + int64_t const expanded_permuted_row = unpermuted_row_to_permuted_row[expanded_original_row]; +#ifndef NDEBUG + if (expanded_permuted_row < 0) { continue; } +#endifAlternatively add an unconditional
if (expanded_permuted_row < 0) continue;if negative is a valid sentinel in production.
1031-1050: De-duplicate swizzled vs linear SF input handling.Simplify by computing layout once and a single call to
cvt_quant_get_sf_out_offset.- if (swizzled_input_sf) { - auto const sf_in = - cvt_quant_get_sf_out_offset<TmaWarpSpecializedGroupedGemmInput::ElementSF, - NumThreadsPerSF>( - std::nullopt /* batchIdx */, source_token_id, elem_idx, std::nullopt /* numRows */, - num_cols / VecSize, - const_cast<TmaWarpSpecializedGroupedGemmInput::ElementSF*>(input_sf), - QuantizationSFLayout::SWIZZLED_128x4); - *sf_out = *sf_in; - } else { - auto const sf_in = - cvt_quant_get_sf_out_offset<TmaWarpSpecializedGroupedGemmInput::ElementSF, - NumThreadsPerSF>( - std::nullopt /* batchIdx */, source_token_id, elem_idx, std::nullopt /* numRows */, - num_cols / VecSize, - const_cast<TmaWarpSpecializedGroupedGemmInput::ElementSF*>(input_sf), - QuantizationSFLayout::LINEAR); - *sf_out = *sf_in; - } + auto const layout = swizzled_input_sf ? QuantizationSFLayout::SWIZZLED_128x4 + : QuantizationSFLayout::LINEAR; + auto const sf_in = + cvt_quant_get_sf_out_offset<TmaWarpSpecializedGroupedGemmInput::ElementSF, + NumThreadsPerSF>( + std::nullopt /* batchIdx */, source_token_id, elem_idx, std::nullopt /* numRows */, + num_cols / VecSize, + const_cast<TmaWarpSpecializedGroupedGemmInput::ElementSF*>(input_sf), + layout); + *sf_out = *sf_in;
3937-3955: Minβlatency path currently throws; add safe fallback or gate dispatch.
computeStridesTmaWarpSpecializedLowLatencyunconditionally throws, breaking callers (setupTmaWarpSpecializedInputsminβlatency branch).Options:
- Route minβlatency to the nonβLL
computeStridesTmaWarpSpecializedwith a temporaryexpert_first_token_offsetbuilt fromnum_active_experts_per/active_expert_global_ids, or- Gate all LL dispatch sites behind
TLLM_CHECK_WITH_INFO(!min_latency_mode)to avoid calling this until LL is reintroduced. Do you want a minimal fallback drafted?
1684-1700: Align checks to element width, not constant 4.Hardcoding
% 4can break for dtypes whereFINALIZE_ELEM_PER_THREAD != 4. Use the computed constant.- assert(padded_cols % 4 == 0); - assert(unpadded_cols % 4 == 0); - assert(unpadded_cols <= padded_cols); + assert(unpadded_cols <= padded_cols); + constexpr int64_t FINALIZE_ELEM_PER_THREAD = + 128 / std::min(sizeof_bits<OutputType>::value, sizeof_bits<GemmOutputType>::value); + assert(padded_cols % FINALIZE_ELEM_PER_THREAD == 0); + assert(unpadded_cols % FINALIZE_ELEM_PER_THREAD == 0);As per earlier feedback.
1761-1764: Same alignment issue + simplify loop bound.Mirror the
FINALIZE_ELEM_PER_THREAD-based asserts and iterate tonum_elems_in_orig_colto avoid per-iteration branch.- assert(padded_cols % 4 == 0); - assert(unpadded_cols % 4 == 0); - assert(unpadded_cols <= padded_cols); + assert(unpadded_cols <= padded_cols); + constexpr int64_t FINALIZE_ELEM_PER_THREAD = + 128 / std::min(sizeof_bits<OutputType>::value, sizeof_bits<GemmOutputType>::value); + assert(padded_cols % FINALIZE_ELEM_PER_THREAD == 0); + assert(unpadded_cols % FINALIZE_ELEM_PER_THREAD == 0); @@ - for (int elem_index = start_offset; elem_index < num_elems_in_padded_col; - elem_index += stride) { - if (elem_index >= num_elems_in_orig_col) continue; // Skip writing beyond original columns + for (int elem_index = start_offset; elem_index < num_elems_in_orig_col; + elem_index += stride) {As per earlier feedback.
Also applies to: 1799-1806
π§Ή Nitpick comments (2)
csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh (2)
280-286: CUDA dynamic shared memory attr check: allow == max.Guard rejects
shared_size >= max_smem_per_block; typically equality is valid. Prefer>to avoid unnecessary fallback.- if (shared_size >= static_cast<size_t>(max_smem_per_block)) { + if (shared_size > static_cast<size_t>(max_smem_per_block)) {Also applies to: 606-620
197-217: Smart routing:active_expert_global_idssemantics.In the smart-routing branch, the stored id is
i(local expert index), while in the else-branch it isi + start_expert(global id). If consumers expect global ids in both modes (per comment), convert to global viai + start_expert. Otherwise, please add a comment clarifying that LL mode uses local ids.
π Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
π Files selected for processing (1)
csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh(56 hunks)
β° Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
π Additional comments (3)
csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh (3)
1886-1894: Incorrect reference in line citation.The concern about
padded_colsandunpadded_colsconsistency applies only to lines 1886-1894 and 1901-1909 (the actual kernel launches withinfinalizeMoeRoutingKernelLauncher). Lines 3850-3858 reference a different function call (Self::gemm2()) and should not be included. All threefinalizeMoeRoutingKernelLaunchercall sites (2925-2930, 3302-3308, 3309-3315) correctly passhidden_sizeandunpadded_hidden_sizewith consistent semantic mapping to the kernel parameters.Likely an incorrect or invalid review comment.
270-286: Incorrect line references in review comment.Lines 637-645 and 686-696 are kernel device code (globalExpertPrefixSumLargeKernel and globalExpertPrefixSumKernel function implementations), not cudaLaunchKernelEx call sites. The actual launches in
cutlass_fused_moe_kernels.cuhat lines 281, 617, 738, 743, and 801 all consistently setattrs[0].id = cudaLaunchAttributeProgrammaticStreamSerializationwith.programmaticStreamSerializationAllowed = enable_pdl.Likely an incorrect or invalid review comment.
4040-4051: Aliasing and memset size are already correctly handled.The code at lines 2811-2812 allocates both
fc1_fp4_act_scale_andfc2_fp4_act_scale_from the same workspace buffer key ("fp4_act_scale"), ensuring they point to identical memory. The workspace allocation (lines 2606-2612) usesstd::max(fc1_fp4_act_scale_size, fc2_fp4_act_scale_size)to reserve space, and the memset at lines 4066-4069 uses the identicalstd::max(fc1_sf_offset, fc2_sf_offset)logic to compute the fill size. The assertion at line 4049 confirms the aliasing invariant. Since both allocation and memset employ the same max-size calculation, the memset is guaranteed to fit within the workspace allocation.
|
[CANCELING] Pipeline #37985869: canceled |
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failed UT on gb300 is not relevant, LGTM on my side.
...v_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/util/gather_tensor.hpp
Show resolved
Hide resolved
|
[FAILED] Pipeline #37989907: 12/17 passed |
|
There are still some remaining cu126 compilation issues such as: Likely because we didn't add guard on the usage of We will retire cu126 at some point, but not now (considering cu126 is still one of the three supported cuda version of pytorch). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)
99-102: FP4 guards insufficient for CUDA 12.6 compatibilityThe guard only checks
ENABLE_FP4, but__nv_fp4_e2m1requires CUDA 12.8+. The CI failure on cu126 (reported in PR objectives) confirms this: the identifier is undefined because CUDA 12.6 doesn't provide it. Same issue exists at lines 249-253, 742-746, and 755-759.Apply guards that also check CUDA version:
-#if defined(ENABLE_FP4) +#if defined(ENABLE_FP4) && CUDA_VERSION >= 12080 cutlass::platform::is_same<WeightType, __nv_fp4_e2m1>::value || #endifRepeat for all FP4 type references at lines 249-253, 742-746, and 755-759.
β»οΈ Duplicate comments (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)
672-676: Fix zero-argument call to supportsTmaWarpSpecializedThis duplicates a past review concern:
isTmaWarpSpecializedcallssupportsTmaWarpSpecialized()without arguments on line 675, but the signature at lines 679-688 now requires anint smparameter. The same issue occurs at line 920 incalcMaxWorkspaceSize.Apply this diff to forward the member's
sm_:- return supportsTmaWarpSpecialized() && config_is_tma_warp_specialized; + return supportsTmaWarpSpecialized(sm_) && config_is_tma_warp_specialized;Also fix line 920:
- if (!supportsTmaWarpSpecialized()) { + if (!supportsTmaWarpSpecialized(sm_)) {Alternatively, add a const wrapper in the class:
bool supportsTmaWarpSpecialized() const { return supportsTmaWarpSpecialized(sm_); }Based on learnings
π§Ή Nitpick comments (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)
953-956: Consider extending FINALIZE fusion workspace calculation beyond SM90FINALIZE fusion workspace size is currently only calculated for SM90 (line 954). If other architectures (e.g., SM100+) support finalize fusion, they should also be included in this calculation to avoid underestimating workspace requirements.
π Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
π Files selected for processing (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h(14 hunks)
π§° Additional context used
𧬠Code graph analysis (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (6)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (3)
tensorrt_llm(63-112)std(81-95)calcMaxWorkspaceSizeTmaWarpSpecialized(490-502)csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/include/common.h (1)
tensorrt_llm(19-34)csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws_mixed_dtype.h (1)
tensorrt_llm(60-274)csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_type_conversion.h (9)
tensorrt_llm(33-150)kernels(34-149)cutlass(114-116)cutlass(120-122)cutlass(127-129)cutlass(132-134)cutlass(140-142)cutlass_kernels(35-148)__nv_fp8_e5m2(91-93)csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (2)
get_candidate_configs(638-689)get_candidate_configs(638-640)csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
EpilogueScheduleType(197-433)
β° Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
π Additional comments (4)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (4)
530-544: LGTM: Clean signature updates for finalize fusion supportThe addition of the
supports_finalize_fusionparameter to both the const member and staticgetConfigsmethods properly threads this capability flag through the config selection pipeline.
624-629: Verify SM103 FP4 config selection strategyThe code explicitly adds SM100 configs when running on SM103 with FP4. Ensure this cross-architecture config reuse is validated and doesn't cause performance regressions or compatibility issues.
631-666: Well-structured finalize fusion and swap_ab config expansionThe logic correctly:
- Duplicates configs and marks them with FINALIZE fusion type when supported (lines 631-640)
- Removes FINALIZE configs that lack epilogue SMEM (lines 642-650)
- Adds swap_ab variants for all configs (lines 653-659) with a defensive check
- Filters to swap_ab=true only for w4_groupwise mode (lines 661-666)
978-1007: Activation type dispatch looks correctThe switch statement appropriately handles the supported activation types (Relu, Gelu, Silu, Identity, Swiglu, Geglu) and throws for invalid types. Note that
Relu2from theActivationTypeenum is not handled, which appears intentional per the AI summary noting "Relu2 path removed (no longer supported)".
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Perhaps just add the additional tests for DSR1 and autotuner we discussed.
| cute::make_shape(gemm_n, gemm_k, 1)); | ||
| } | ||
| if (layout_info.stride_c) { | ||
| // TODO Enable 1xN bias matrix as C |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we don't support batch size = 1 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's just the bias tensor could not be 1xN
π Description
π Related Issues
π Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
β Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.π§ͺ Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Tests