[TRTLLM-8511][feat] AutoDeploy: optimize fused_mlp_moe_kernel tiles #8597

nzmora-nvidia · 2025-10-22T23:32:52Z

For the triton fused_moe_kernel, search for a device-specific (skew) tile size configuration using the batch size as key. Each device has it's own configuration file in JSON format. If the config file is not found then we revert to the default tile size configuration.

Summary by CodeRabbit

Chores
- Added comprehensive performance tuning configurations for fused Mixture-of-Experts operations across multiple NVIDIA GPU architectures (A100, H100, H200, B200, etc.) and data type variants, enabling optimized kernel parameter selection for different workload sizes.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-10-22T23:39:10Z

📝 Walkthrough

Walkthrough

This pull request adds approximately 100+ JSON configuration files for Triton fused Mixture-of-Experts (MoE) kernel tuning across multiple NVIDIA GPU devices, data types, and model configurations. Each file maps numeric keys to kernel launch parameters (BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, num_stages). No code logic is altered—this is purely configuration data for the auto-deploy mechanism.

Changes

Cohort / File(s)	Summary
Triton Fused MoE Configuration Files `tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=.json` `tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=,N=.json` `tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=,N=,device_name=.json` `tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=,N=,device_name=,dtype=.json` `tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=,N=,device_name=,dtype=,block_shape=*.json`	Adds 100+ JSON configuration files providing pre-tuned kernel parameters for different combinations of expert count (E), hidden dimension (N), GPU device (NVIDIA A100, H100, H200, B200, GB200, H20, A800, L20, etc.), data type (int8_w8a16, fp8_w8a8, float8), and optional block shape. Each file contains a dictionary mapping configuration keys (numeric strings) to tuning parameters with consistent schema: BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, and num_stages.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Rationale: While the changes are homogeneous (repetitive JSON configuration files following identical structure), the large volume (~100+ files) requires systematic spot-checking for JSON validity, parameter consistency, and plausibility of tuning values across devices. The review benefits from the consistent pattern but demands verification of coverage, proper naming conventions, and absence of obvious configuration errors or duplicates.

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description Check	❓ Inconclusive	The PR description provides a brief explanation of the core functionality: implementing device-specific configuration lookup for the Triton fused MoE kernel using batch size as a key, with fallback to default configuration if device-specific files are not found. However, the description template is largely incomplete. The "Test Coverage" section is empty with no test cases listed, and the "PR Checklist" section contains mostly unchecked items with minimal completion. While the core description content is present and reasonably specific about what the change does, the overall PR description lacks the comprehensive coverage expected by the template, particularly around testing strategy and pre-submission verification items.	To resolve this, the author should complete the PR description template by: (1) providing specific test cases in the "Test Coverage" section that validate the device-specific configuration lookup functionality and fallback behavior, (2) filling out all relevant items in the "PR Checklist" to confirm the change meets coding guidelines, has appropriate test coverage, and includes necessary documentation updates, and (3) ensuring all sections are adequately filled to demonstrate thorough pre-submission review.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.
Title Check	✅ Passed	The PR title "[TRTLLM-8511][feat] AutoDeploy: optimize fused_mlp_moe_kernel tiles" follows the required format with a JIRA ticket, type descriptor, and concise description. It clearly summarizes the primary change—optimizing tile configurations for the fused MoE kernel in the AutoDeploy system. The title is specific enough that reviewers scanning history would understand this involves configuration optimization rather than code logic changes. While there is a minor naming inconsistency (files reference "fused_moe" not "fused_mlp_moe"), the title accurately captures the main objective of the changeset.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (9)

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1792,device_name=NVIDIA_H100_80GB_HBM3.json (1)

1-146: Confirm configuration tuning and validity.

Since this PR introduces 100+ JSON config files without code changes, please verify:

These configurations were generated through an automated tuning process on H100 hardware.

The schema is validated at runtime when loading these configs.

Fallback/default behavior is documented if a batch size key is not present in this file.

Consider adding a README or schema validation to document expected parameter ranges and any inter-parameter constraints (e.g., GROUP_SIZE_M must divide BLOCK_SIZE_M).
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json (1)
1-146: Verify batch size lookup strategy and fallback handling in loader code.

This JSON configuration file is syntactically valid and appears well-structured for mapping batch sizes to kernel tuning parameters. However, several aspects require verification in the loading/lookup mechanism:

Batch size coverage: The file defines configurations for specific batch sizes (1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128, 256, 512, 1024, 1536, 2048, 3072, 4096). How are requests for batch sizes not in this set handled? (e.g., batch size 5, 10, etc.)

Fallback mechanism: The PR mentions fallback to a default configuration if the device config is not found. Confirm that the fallback also handles batch sizes not defined in this file.

Parameter validation: Ensure the loading code validates that required fields (BLOCK_SIZE_M/N/K, GROUP_SIZE_M, num_warps, num_stages) are present and have reasonable numeric values.

Please verify the configuration loading mechanism by examining the code that reads these JSON files. The verification should confirm:

How batch size lookups are performed (exact match vs. closest match vs. fallback to default)

Error handling for missing batch sizes

Parameter validation before use

Consider adding a top-level _metadata object to document the configuration schema and version:
{
    "_metadata": {
        "version": "1.0",
        "device": "NVIDIA_H200",
        "dtype": "fp8_w8a8",
        "E": 128,
        "N": 768,
        "block_shape": [128, 128],
        "description": "Triton MoE kernel parameters indexed by batch size"
    },
    "1": {
        "BLOCK_SIZE_M": 64,
        ...
    }
}
This would improve maintainability and self-documentation without changing the core lookup logic.
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_H20-3e,dtype=fp8_w8a8,block_shape=[128,128].json (1)

2-145: Configuration patterns appear reasonable but lack tuning documentation.

The file shows sensible progression: small batches (1–128) use conservative block sizes (BLOCK_SIZE_M=16), while larger batches (256+) scale up to BLOCK_SIZE_M=64 with increasing GROUP_SIZE_M, which aligns with typical Triton kernel tuning heuristics. However, num_warps (4) and num_stages (3) remain constant across all batch sizes—clarify whether this is device-specific tuning wisdom or a missed opportunity for further optimization.

Additionally, include a brief comment (either in the filename, a README, or the PR description) documenting the tuning methodology: Which benchmark/workload was used? What performance metric was optimized? This context is valuable for future maintenance and validation.
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1792,device_name=NVIDIA_A100-SXM4-80GB.json (2)
1-218: Consider adding schema metadata for maintainability and validation.

With 100+ configuration files across multiple devices and configurations, adding metadata to the JSON would improve robustness:
{
  "_schema_version": "1.0",
  "_generated_at": "2025-10-22",
  "_device": "NVIDIA_A100-SXM4-80GB",
  "_experts": 16,
  "_hidden_dim": 1792,
  "_notes": "Batch sizes are keys; values are Triton kernel launch parameters.",
  "1": { ... },
  ...
}
This enables:

Validation of file structure and compatibility

Tracking of config age and maintenance

Documentation of filename encoding (what do E and N represent?)

Easier debugging of issues across the configuration suite

1-218: Establish validation and testing strategy for the configuration suite.

With 100+ device-specific configuration files, consider:

Schema validation: Add a CI check that validates all JSON files conform to the expected schema (required fields, value ranges).

Coverage testing: Verify that batch sizes in each config file cover the expected range and that interpolation/fallback logic is tested.

Consistency checks: Flag suspicious patterns (e.g., identical configs across different device types) that might indicate copy-paste errors.

Configuration generation process: Document how these configs were tuned/generated (benchmark tool, hyperparameter search strategy, reproducibility) so future maintainers can regenerate or update them.

Integration tests: Load each config file and verify it produces valid kernel launch parameters when queried.
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=1792,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1)

1-146: Valid configuration file; note indentation inconsistency.

JSON syntax and structure are correct. Extended batch-size coverage (26 keys) is appropriate for larger model scenarios. However, this file uses 2-space indentation while earlier files (files 1–4) use 4-space indentation. This is a minor formatting inconsistency across the configuration set.

Consider normalizing indentation to 4 spaces across all configuration files for consistency.

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=7168,device_name=NVIDIA_A100-SXM4-80GB.json (1)

1-146: Valid configuration file; note indentation inconsistency.

JSON syntax and structure are correct. Configuration values are within Triton valid ranges. However, this file uses 2-space indentation while most other files use 4-space indentation, reinforcing the formatting inconsistency observed across the configuration set.

Normalize indentation to 4 spaces across all configuration files for consistency.

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=3072,device_name=NVIDIA_H200,dtype=int8_w8a16.json (1)

1-146: JSON is valid, but indentation is inconsistent with the companion configuration file.

This file uses 4-space indentation while E=128,N=1024,device_name=NVIDIA_H200,dtype=fp8_w8a8.json uses 2-space indentation. While not a functional issue, normalizing indentation across all configuration files would improve consistency and maintainability.

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1792,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1)

2-17: Minor: Identical configurations for batch sizes 1 and 2.

Batch sizes "1" and "2" share identical kernel parameters. While this may be intentional (treating both as "small batch" cases), verify this was not an unintended duplication during configuration generation or tuning.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 00c2b81 and e8c21ce.

📒 Files selected for processing (107)

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=14336,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=14336,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=1792,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=1792,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=1792,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=3072,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=3072,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=3072,device_name=NVIDIA_H100_80GB_HBM3.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=3072,device_name=NVIDIA_H200,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=3584,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=3584,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=3584,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=7168,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=7168,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=7168,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1024,device_name=NVIDIA_H100,dtype=fp8_w8a8.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1024,device_name=NVIDIA_H200,dtype=fp8_w8a8.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1024,device_name=NVIDIA_H200.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1856,device_name=NVIDIA_H100_80GB_HBM3.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=192,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=192,device_name=NVIDIA_H100_80GB_HBM3.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=192,device_name=NVIDIA_H20-3e.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=192,device_name=NVIDIA_H20.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=192,device_name=NVIDIA_H200.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=352,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=384,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=384,device_name=NVIDIA_GB200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=384,device_name=NVIDIA_H20,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=384,device_name=NVIDIA_H20-3e,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=384,device_name=NVIDIA_H20-3e.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=384,device_name=NVIDIA_H20.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=384,device_name=NVIDIA_H200.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=512,device_name=NVIDIA_H100_80GB_HBM3.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=704,device_name=NVIDIA_B200,dtype=fp8_w8a8.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=704,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_GB200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_H20,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_H20-3e,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_H20.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_H200.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=96,device_name=NVIDIA_H20.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1024,device_name=NVIDIA_B200,dtype=fp8_w8a8.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1024,device_name=NVIDIA_B200.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1024,device_name=NVIDIA_H100.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1024,device_name=NVIDIA_H200,dtype=fp8_w8a8.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1024,device_name=NVIDIA_H200.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1344,device_name=NVIDIA_A100-SXM4-40GB.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1344,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1344,device_name=NVIDIA_H100_80GB_HBM3.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=14336,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=14336,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=14336,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1792,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1792,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1792,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1792,device_name=NVIDIA_H100_80GB_HBM3.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=2048,device_name=NVIDIA_H200,dtype=fp8_w8a8.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=2048,device_name=NVIDIA_H200.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=2688,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=2688,device_name=NVIDIA_H100_80GB_HBM3.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=3072,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=3072,device_name=NVIDIA_H100_80GB_HBM3,dtype=float8.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=3072,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=3072,device_name=NVIDIA_H200,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=3200,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=3584,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=3584,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=3584,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=6400,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=7168,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=7168,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=7168,device_name=NVIDIA_H100_80GB_HBM3,dtype=float8.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=7168,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=800,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=160,N=192,device_name=NVIDIA_A800-SXM4-80GB.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=160,N=192,device_name=NVIDIA_H20-3e.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=160,N=320,device_name=NVIDIA_H20-3e.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=160,N=640,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=160,N=640,device_name=NVIDIA_GB200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=160,N=640,device_name=NVIDIA_H100,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=20,N=2560,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=20,N=2560,device_name=NVIDIA_GB200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=20,N=2560,device_name=NVIDIA_H100,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=20,N=2560,device_name=NVIDIA_H20-3e,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=128,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=128,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a8.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=128,device_name=NVIDIA_A800-SXM4-80GB,dtype=int8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=128,device_name=NVIDIA_A800-SXM4-80GB,dtype=int8_w8a8.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=128,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=128,device_name=NVIDIA_H20,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=128,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=256,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=256,device_name=NVIDIA_H20,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=256,device_name=NVIDIA_H20,dtype=int8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=256,device_name=NVIDIA_H20-3e,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=256,device_name=NVIDIA_L20,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=512,device_name=NVIDIA_H100_80GB_HBM3.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=64,device_name=NVIDIA_A800-SXM4-80GB.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=32,N=1408,device_name=NVIDIA_B200.json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=32,N=2048,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=32,N=2048,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=384,N=128,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=384,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)

⛔ Files not processed due to max files limit (24)

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=384,N=128,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=384,N=256,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=384,N=256,device_name=NVIDIA_GB200,dtype=fp8_w8a8,block_shape=[128,128].json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=40,N=1536,device_name=NVIDIA_B200,dtype=fp8_w8a8.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=40,N=2560,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=40,N=2560,device_name=NVIDIA_GB200,dtype=fp8_w8a8,block_shape=[128,128].json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=40,N=2560,device_name=NVIDIA_H100,dtype=fp8_w8a8,block_shape=[128,128].json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=128,device_name=NVIDIA_B200.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=128,device_name=NVIDIA_H100_80GB_HBM3.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=128,device_name=NVIDIA_H20-3e.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=128,device_name=NVIDIA_H200.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=256,device_name=NVIDIA_B200.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=256,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=256,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128,128].json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=256,device_name=NVIDIA_H100_80GB_HBM3.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=256,device_name=NVIDIA_H20-3e.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=256,device_name=NVIDIA_H200.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=512,device_name=NVIDIA_B200.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=512,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=512,device_name=NVIDIA_H100_80GB_HBM3.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=512,device_name=NVIDIA_H20-3e.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=512,device_name=NVIDIA_H200.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=64,device_name=NVIDIA_B200.json

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

coderabbitai · 2025-10-22T23:39:14Z

...m_ops/fused_moe/triton_fused_moe_configs/E=128,N=1856,device_name=NVIDIA_H100_80GB_HBM3.json

+{
+    "triton_version": "3.4.0",
+    "1": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 32,
+        "BLOCK_SIZE_K": 64,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 4
+    },
+    "2": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 32,
+        "BLOCK_SIZE_K": 64,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "4": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 64,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 8,
+        "num_stages": 4
+    },
+    "8": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 8,
+        "num_stages": 5
+    },
+    "16": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 8,
+        "num_stages": 5
+    },
+    "24": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 32,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 64,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "32": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 32,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 5
+    },
+    "48": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 64,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 5
+    },
+    "64": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 64,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 8,
+        "num_stages": 5
+    },
+    "96": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 32,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 5
+    },
+    "128": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 32,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 5
+    },
+    "256": {
+        "BLOCK_SIZE_M": 32,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 8,
+        "num_stages": 5
+    },
+    "512": {
+        "BLOCK_SIZE_M": 64,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 8,
+        "num_stages": 4
+    },
+    "1024": {
+        "BLOCK_SIZE_M": 64,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 8,
+        "num_stages": 4
+    },
+    "1536": {
+        "BLOCK_SIZE_M": 128,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 64,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 8,
+        "num_stages": 5
+    },
+    "2048": {
+        "BLOCK_SIZE_M": 128,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 64,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 8,
+        "num_stages": 5
+    },
+    "3072": {
+        "BLOCK_SIZE_M": 256,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 64,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 8,
+        "num_stages": 4
+    },
+    "4096": {
+        "BLOCK_SIZE_M": 256,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 64,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 8,
+        "num_stages": 4
+    }
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Validate batch-size key format and configuration file contract.

The JSON uses string keys (e.g., "1", "2", ..., "4096") to represent batch sizes. Ensure that:

The consuming code correctly parses batch-size keys as strings and performs the intended lookup/matching logic

All numeric parameter values (BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, num_stages) are within valid ranges for Triton kernel configuration

The triton_version field matches the version expected by the deployment system

Run a verification to check:

Are batch-size keys consistently treated as strings across all similar config files?

Are there any out-of-bounds or malformed entries across all device config files that follow this pattern?

🏁 Script executed:

#!/bin/bash # Description: Validate JSON structure and key format consistency across all triton_fused_moe_configs files # Check if all config files are valid JSON echo "Validating JSON syntax..." find tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs -name "*.json" -exec sh -c ' jq empty "$1" 2>&1 || echo "Invalid JSON: $1" ' _ {} \; # Sample check: Verify that all files have triton_version field echo -e "\nChecking triton_version presence..." find tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs -name "*.json" -exec sh -c ' if ! jq -e ".triton_version" "$1" > /dev/null 2>&1; then echo "Missing triton_version: $1" fi ' _ {} \; # Check for inconsistent GROUP_SIZE_M values relative to BLOCK_SIZE_M echo -e "\nChecking GROUP_SIZE_M <= BLOCK_SIZE_M constraint..." find tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs -name "*.json" -exec sh -c ' jq -r "to_entries[] | select(.value | type == \"object\" and has(\"GROUP_SIZE_M\") and has(\"BLOCK_SIZE_M\")) | select(.value.GROUP_SIZE_M > .value.BLOCK_SIZE_M) | \"\(.key): GROUP_SIZE_M=\(.value.GROUP_SIZE_M) > BLOCK_SIZE_M=\(.value.BLOCK_SIZE_M)\"" "$1" | while read line; do [ -n "$line" ] && echo "$1: $line" done ' _ {} \;

Length of output: 100555

Fix systematic configuration constraint violations across triton_fused_moe_configs.

Verification reveals two critical issues affecting this file and nearly all others in the directory:

Missing triton_version field: 243+ config files lack the required triton_version field. The reviewed file correctly includes "triton_version": "3.4.0", but this is exceptional.

GROUP_SIZE_M > BLOCK_SIZE_M constraint violations: Widespread violations detected. In the reviewed file, batch size key "24" violates the constraint: GROUP_SIZE_M=64 > BLOCK_SIZE_M=16. This pattern is repeated across hundreds of entries in all device config files, affecting batch sizes 1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128, 256, 512, 1024, and higher.

The scale and consistency of these violations across the entire config directory suggest either a broken config generation pipeline or a schema/validation gap. Audit the generation process and ensure all configs satisfy kernel constraints before deployment.

🤖 Prompt for AI Agents

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1856,device_name=NVIDIA_H100_80GB_HBM3.json lines 1-147: this file (and many others) either miss the required triton_version field or contain entries where GROUP_SIZE_M > BLOCK_SIZE_M (e.g. key "24" has GROUP_SIZE_M=64 while BLOCK_SIZE_M=16). Fix by ensuring every config has a "triton_version" key set (add if missing) and enforce GROUP_SIZE_M <= BLOCK_SIZE_M for every batch-size entry (adjust violating values to be <= BLOCK_SIZE_M, e.g. change GROUP_SIZE_M from 64 to 16 or recompute to a valid divisor), and add/restore a generation-time validation step (and CI check) that rejects config files that fail these schema constraints so the generator cannot emit invalid configs.

coderabbitai · 2025-10-22T23:39:14Z

...e/triton_fused_moe_configs/E=128,N=352,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json

+{
+    "2": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 64,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 64,
+        "num_warps": 4,
+        "num_stages": 5
+    },
+    "4": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 64,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 4
+    },
+    "8": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 64,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 32,
+        "num_warps": 4,
+        "num_stages": 5
+    },
+    "16": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 64,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "32": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 64,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 5
+    },
+    "64": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 64,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 5
+    },
+    "128": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 64,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 16,
+        "num_warps": 4,
+        "num_stages": 5
+    },
+    "256": {
+        "BLOCK_SIZE_M": 64,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "512": {
+        "BLOCK_SIZE_M": 64,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 8,
+        "num_stages": 3
+    },
+    "1024": {
+        "BLOCK_SIZE_M": 64,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 16,
+        "num_warps": 8,
+        "num_stages": 3
+    },
+    "2048": {
+        "BLOCK_SIZE_M": 128,
+        "BLOCK_SIZE_N": 256,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 16,
+        "num_warps": 8,
+        "num_stages": 4
+    },
+    "3072": {
+        "BLOCK_SIZE_M": 128,
+        "BLOCK_SIZE_N": 256,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 16,
+        "num_warps": 8,
+        "num_stages": 4
+    },
+    "4096": {
+        "BLOCK_SIZE_M": 128,
+        "BLOCK_SIZE_N": 256,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 16,
+        "num_warps": 8,
+        "num_stages": 4
+    },
+    "8192": {
+        "BLOCK_SIZE_M": 128,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "16384": {
+        "BLOCK_SIZE_M": 64,
+        "BLOCK_SIZE_N": 256,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 32,
+        "num_warps": 8,
+        "num_stages": 2
+    }
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify JSON structure and batch-size key lookup in the consuming code.

The JSON file is syntactically valid and well-formatted, with consistent parameter structures across all batch sizes. However, without access to the Python code that loads this file, I cannot confirm critical integration details.

Please verify the following using the code that loads these configuration files:

Batch size key type handling: Are the string keys (e.g., "2", "4") correctly parsed? Does the code convert the incoming batch size to a string before lookup, or does it expect integer keys?

Error handling & fallback: If a batch size is not found in the config, does the code gracefully fall back to a default configuration as mentioned in the PR objectives?

File path resolution: Confirm that the filename convention (especially the device_name, E, N, and dtype parameters) is correctly resolved at runtime.

Run the following verification script to locate and inspect the code that consumes these configuration files:

🏁 Script executed:

#!/bin/bash # Search for code that loads triton_fused_moe_configs JSON files # Find Python files that reference the config directory or load JSON configs rg -n "triton_fused_moe_configs|fused_moe.*config" --type=py -A 5 -B 2 # Search for JSON loading logic to understand key type handling rg -n "json\.load|json\.loads" --type=py tensorrt_llm/_torch/auto_deploy -A 3 -B 1 | head -50

Length of output: 2554

🏁 Script executed:

#!/bin/bash # Get full function context around the config loading code sed -n '190,240p' tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py # Find the function definition and caller context rg -n "def.*config|batch_size" tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py -A 2 -B 1 | head -100 # Search for error handling and KeyError catches rg -n "except|KeyError|get\(|fallback|default" tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py -B 2 -A 2

Length of output: 4805

Fix dtype parameter not being passed to get_moe_configs().

The JSON file structure is valid and batch-size key handling is correct: string keys are converted to integers at line 220, and lookup uses the closest matching batch size to avoid KeyError.

However, a critical bug was found in the calling code: _get_kernel_config() at line 234 in tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py hard-codes dtype=None instead of using the dtype parameter passed to the function. Since the config filename includes dtype (e.g., dtype=fp8_w8a8.json), this causes the wrong configuration file to be selected.

Change line 234 from:

configs = get_moe_configs(E, N, dtype=None)

to:

configs = get_moe_configs(E, N, dtype=dtype)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py around line 234, the call to get_moe_configs(...) hard-codes dtype=None; update the call to pass the function's dtype parameter (i.e., use dtype=dtype) so the correct dtype-specific JSON config file is selected.

coderabbitai · 2025-10-22T23:39:14Z

...ed_moe_configs/E=128,N=384,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json

+{
+    "1": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 4
+    },
+    "2": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 4
+    },
+    "4": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 4
+    },
+    "8": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "16": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 4
+    },
+    "24": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "32": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 5
+    },
+    "48": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 5
+    },
+    "64": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 5
+    },
+    "96": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 5
+    },
+    "128": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 5
+    },
+    "256": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 4
+    },
+    "512": {
+        "BLOCK_SIZE_M": 64,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 16,
+        "num_warps": 4,
+        "num_stages": 4
+    },
+    "1024": {
+        "BLOCK_SIZE_M": 64,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 16,
+        "num_warps": 4,
+        "num_stages": 4
+    },
+    "1536": {
+        "BLOCK_SIZE_M": 64,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 16,
+        "num_warps": 4,
+        "num_stages": 4
+    },
+    "2048": {
+        "BLOCK_SIZE_M": 64,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 16,
+        "num_warps": 4,
+        "num_stages": 4
+    },
+    "3072": {
+        "BLOCK_SIZE_M": 64,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 32,
+        "num_warps": 4,
+        "num_stages": 4
+    },
+    "4096": {
+        "BLOCK_SIZE_M": 64,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 32,
+        "num_warps": 4,
+        "num_stages": 4
+    }
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

JSON structure is valid; parameter ranges appear reasonable for B200 FP8 tuning.

The configuration file is well-formed with strategically spaced batch size entries (sparse at small batch sizes, exponential-like spacing at larger sizes). Parameter transitions are coherent: BLOCK_SIZE_M and GROUP_SIZE_M scale with batch size, and num_stages/BLOCK_SIZE_K vary appropriately. All values fall within expected ranges for Triton kernel tuning (num_warps=4, BLOCK_SIZE_N=128 fixed; BLOCK_SIZE_M∈{16,64}, GROUP_SIZE_M∈{1,16,32}, num_stages∈{3,4,5}).

However, I need to verify the integration with the calling code and tuning methodology.

Run the following script to verify the batch size lookup strategy and confirm that fallback logic handles missing configs gracefully:

Additionally, verify that these configurations were generated/tuned for the B200 device. Please provide or link to:

Documentation on how these parameters were auto-tuned or benchmarked

Evidence that this config was validated for correctness and performance on B200

[/request_verification]

🏁 Script executed:

#!/bin/bash # Description: Inspect the config loading code for batch size lookup and fallback strategy. # Find Python files that load these JSON configs fd -e py -x grep -l 'triton_fused_moe_configs' {} \; 2>/dev/null # Search for the config loader function/class rg -nP 'def.*(?:load|select).*config|class.*ConfigLoader' --type=py -g 'fused_moe*' # Verify fallback mechanism for missing batch sizes rg -nP 'fallback|default.*config|nearest|batch.*size.*lookup' --type=py -g 'fused_moe*' -A 3 -B 1

Length of output: 68

🏁 Script executed:

#!/bin/bash # Examine the config loading code in triton_moe.py cat ./tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py

Length of output: 16987

🏁 Script executed:

#!/bin/bash # Verify how _get_kernel_config is called and whether there are other callers rg -n '_get_kernel_config' --type=py

Length of output: 226

JSON file is properly formatted, but config loading code has critical bugs preventing this file from being used.

The JSON structure is valid and parameter ranges are reasonable for B200 FP8 tuning. However, the configuration will never be loaded due to:

dtype hardcoded to None (line 234): get_moe_configs(E, N, dtype=None) causes get_config_file_name() to generate filenames without dtype/block_shape. The file named with dtype=fp8_w8a8,block_shape=[128,128] will never be found; the system falls back to generic defaults.

Parameter type mismatch (line 410): _get_kernel_config(M, E, inter_size2, H, top_k) passes H (tensor dimension) and top_k (int) where the function expects dtype: str | None and block_shape: list[int]. While these parameters are currently unused in the function body, this indicates incorrect API usage and prevents dtype/block_shape from being passed through correctly.

Fix the dtype hardcoding in get_moe_configs() call and correct the parameter types in the caller to enable device/dtype-specific configurations.

coderabbitai · 2025-10-22T23:39:14Z

...ed_moe_configs/E=160,N=640,device_name=NVIDIA_H100,dtype=fp8_w8a8,block_shape=[128,128].json

+{
+    "1": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 4
+    },
+    "2": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "4": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "8": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 16,
+        "num_warps": 4,
+        "num_stages": 5
+    },
+    "16": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 16,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "24": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 16,
+        "num_warps": 4,
+        "num_stages": 5
+    },
+    "32": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "48": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "64": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "96": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "128": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "256": {
+        "BLOCK_SIZE_M": 16,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "512": {
+        "BLOCK_SIZE_M": 32,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 1,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "1024": {
+        "BLOCK_SIZE_M": 64,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 16,
+        "num_warps": 4,
+        "num_stages": 3
+    },
+    "1536": {
+        "BLOCK_SIZE_M": 64,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 16,
+        "num_warps": 4,
+        "num_stages": 4
+    },
+    "2048": {
+        "BLOCK_SIZE_M": 64,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 16,
+        "num_warps": 4,
+        "num_stages": 4
+    },
+    "3072": {
+        "BLOCK_SIZE_M": 64,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 256,
+        "GROUP_SIZE_M": 16,
+        "num_warps": 4,
+        "num_stages": 4
+    },
+    "4096": {
+        "BLOCK_SIZE_M": 64,
+        "BLOCK_SIZE_N": 128,
+        "BLOCK_SIZE_K": 128,
+        "GROUP_SIZE_M": 16,
+        "num_warps": 4,
+        "num_stages": 4
+    }
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

JSON is valid; flag device name inconsistency in documentation.

The H100 configuration is well-formed and shows appropriate device-specific tuning divergence from the GB200 variant (e.g., different GROUP_SIZE_M and num_stages for mid-range batch sizes), which reflects distinct GPU architectures. However, the AI summary claims this file is named ...H100_80GB_HBM3... but the actual filename is NVIDIA_H100.

Clarify the device naming convention. Across the ~100+ configuration files, should device identifiers include memory/variant details (e.g., H100_80GB_HBM3) or remain generic (e.g., NVIDIA_H100)? Ensure the naming is consistent with the device lookup key used at runtime. Run the following verification to check naming consistency across the config directory:

🏁 Script executed:

#!/bin/bash # List all device config files and their naming patterns find tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs -name "*.json" -type f | \ sed 's/.*device_name=//; s/,dtype.*//' | sort | uniq -c | sort -rn

Length of output: 598

Rename file: NVIDIA_H100 → NVIDIA_H100_80GB_HBM3 to match config directory convention.

The script output confirms a critical naming inconsistency. Across ~178 configuration files, 51+ use the detailed device variant NVIDIA_H100_80GB_HBM3, while only 4 (including this file) use the generic NVIDIA_H100. Similar patterns hold for other device families (H200, H20, A100, etc.), where detailed names with memory/variant specifiers are the standard.

The file under review uses the minority naming pattern and should be renamed to NVIDIA_H100_80GB_HBM3 to align with the convention. Misnamed configs may fail device lookup or apply incorrect tuning parameters at runtime.

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=160,N=640,device_name=NVIDIA_H100,dtype=fp8_w8a8,block_shape=[128,128].json (lines 1-146) the device name in the filename uses the generic "NVIDIA_H100" but the repo convention requires the detailed variant "NVIDIA_H100_80GB_HBM3"; rename the file to replace NVIDIA_H100 with NVIDIA_H100_80GB_HBM3 (keeping the rest of the filename identical), update any references in code/CI/configs that import or lookup this filename to the new name, and verify there are no duplicate files after the rename.

suyoggupta · 2025-10-23T00:59:03Z

@nzmora-nvidia : what kind of perf upside are you seeing with the ability to dynamically select the config?

…kernel For the triton fused_moe_kernel, search for a device-specific (skew) tile size configuration using the batch size as key. Each device has it's own configuration file in JSON format. If the config file is not found then we revert to the default tile size configuration. Signed-off-by: Neta Zmora <[email protected]>

nzmora-nvidia · 2025-10-23T10:40:58Z

/bot run

nvchenghaoz · 2025-10-24T17:33:56Z

...triton_fused_moe_configs/E=1,N=14336,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json

@@ -0,0 +1,146 @@
+{


Is INT8 needed?

Thanks for catching this.

nzmora-nvidia requested a review from a team as a code owner October 22, 2025 23:32

nzmora-nvidia requested review from lucaslie, nvchenghaoz and suyoggupta and removed request for lucaslie October 22, 2025 23:32

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

nzmora-nvidia changed the title ~~[TRTLLM-8511][feat] Dynamically select a tile size for fused_mlp_moe_kernel~~ [TRTLLM-8511][feat] AutoDeploy: optimize fused_mlp_moe_kernel tiles Oct 22, 2025

github-project-automation bot added this to AutoDeploy Board Oct 22, 2025

github-project-automation bot moved this to Backlog in AutoDeploy Board Oct 22, 2025

nzmora-nvidia force-pushed the users/nzmora/auto-select-moe-kernel-config-main branch from fd27001 to 7a4518b Compare October 23, 2025 09:50

nzmora-nvidia mentioned this pull request Oct 23, 2025

[AutoDeploy]: Dynamically select a tile size for fused_mlp_moe_kernel #8511

Open

1 task

nvchenghaoz reviewed Oct 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TRTLLM-8511][feat] AutoDeploy: optimize fused_mlp_moe_kernel tiles #8597

[TRTLLM-8511][feat] AutoDeploy: optimize fused_mlp_moe_kernel tiles #8597

Uh oh!

nzmora-nvidia commented Oct 22, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 22, 2025 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 22, 2025

Uh oh!

coderabbitai bot Oct 22, 2025

Uh oh!

coderabbitai bot Oct 22, 2025

Uh oh!

coderabbitai bot Oct 22, 2025

Uh oh!

suyoggupta commented Oct 23, 2025

Uh oh!

nzmora-nvidia commented Oct 23, 2025

Uh oh!

nvchenghaoz Oct 24, 2025

Uh oh!

nzmora-nvidia Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[TRTLLM-8511][feat] AutoDeploy: optimize fused_mlp_moe_kernel tiles #8597

Are you sure you want to change the base?

[TRTLLM-8511][feat] AutoDeploy: optimize fused_mlp_moe_kernel tiles #8597

Uh oh!

Conversation

nzmora-nvidia commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

suyoggupta commented Oct 23, 2025

Uh oh!

nzmora-nvidia commented Oct 23, 2025

Uh oh!

nvchenghaoz Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

nzmora-nvidia Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nzmora-nvidia commented Oct 22, 2025 •

edited

Loading

coderabbitai bot commented Oct 22, 2025 •

edited

Loading