Skip to content

Conversation

@nzmora-nvidia
Copy link
Collaborator

@nzmora-nvidia nzmora-nvidia commented Oct 22, 2025

For the triton fused_moe_kernel, search for a device-specific (skew) tile size configuration using the batch size as key. Each device has it's own configuration file in JSON format. If the config file is not found then we revert to the default tile size configuration.

Summary by CodeRabbit

  • Chores
    • Added comprehensive performance tuning configurations for fused Mixture-of-Experts operations across multiple NVIDIA GPU architectures (A100, H100, H200, B200, etc.) and data type variants, enabling optimized kernel parameter selection for different workload sizes.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@nzmora-nvidia nzmora-nvidia requested a review from a team as a code owner October 22, 2025 23:32
@nzmora-nvidia nzmora-nvidia requested review from lucaslie, nvchenghaoz and suyoggupta and removed request for lucaslie October 22, 2025 23:32
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 22, 2025

📝 Walkthrough

Walkthrough

This pull request adds approximately 100+ JSON configuration files for Triton fused Mixture-of-Experts (MoE) kernel tuning across multiple NVIDIA GPU devices, data types, and model configurations. Each file maps numeric keys to kernel launch parameters (BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, num_stages). No code logic is altered—this is purely configuration data for the auto-deploy mechanism.

Changes

Cohort / File(s) Summary
Triton Fused MoE Configuration Files
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=*.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=*,N=*.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=*,N=*,device_name=*.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=*,N=*,device_name=*,dtype=*.json
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=*,N=*,device_name=*,dtype=*,block_shape=*.json
Adds 100+ JSON configuration files providing pre-tuned kernel parameters for different combinations of expert count (E), hidden dimension (N), GPU device (NVIDIA A100, H100, H200, B200, GB200, H20, A800, L20, etc.), data type (int8_w8a16, fp8_w8a8, float8), and optional block shape. Each file contains a dictionary mapping configuration keys (numeric strings) to tuning parameters with consistent schema: BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, and num_stages.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Rationale: While the changes are homogeneous (repetitive JSON configuration files following identical structure), the large volume (~100+ files) requires systematic spot-checking for JSON validity, parameter consistency, and plausibility of tuning values across devices. The review benefits from the consistent pattern but demands verification of coverage, proper naming conventions, and absence of obvious configuration errors or duplicates.

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description Check ❓ Inconclusive The PR description provides a brief explanation of the core functionality: implementing device-specific configuration lookup for the Triton fused MoE kernel using batch size as a key, with fallback to default configuration if device-specific files are not found. However, the description template is largely incomplete. The "Test Coverage" section is empty with no test cases listed, and the "PR Checklist" section contains mostly unchecked items with minimal completion. While the core description content is present and reasonably specific about what the change does, the overall PR description lacks the comprehensive coverage expected by the template, particularly around testing strategy and pre-submission verification items. To resolve this, the author should complete the PR description template by: (1) providing specific test cases in the "Test Coverage" section that validate the device-specific configuration lookup functionality and fallback behavior, (2) filling out all relevant items in the "PR Checklist" to confirm the change meets coding guidelines, has appropriate test coverage, and includes necessary documentation updates, and (3) ensuring all sections are adequately filled to demonstrate thorough pre-submission review.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Title Check ✅ Passed The PR title "[TRTLLM-8511][feat] AutoDeploy: optimize fused_mlp_moe_kernel tiles" follows the required format with a JIRA ticket, type descriptor, and concise description. It clearly summarizes the primary change—optimizing tile configurations for the fused MoE kernel in the AutoDeploy system. The title is specific enough that reviewers scanning history would understand this involves configuration optimization rather than code logic changes. While there is a minor naming inconsistency (files reference "fused_moe" not "fused_mlp_moe"), the title accurately captures the main objective of the changeset.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (9)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1792,device_name=NVIDIA_H100_80GB_HBM3.json (1)

1-146: Confirm configuration tuning and validity.

Since this PR introduces 100+ JSON config files without code changes, please verify:

  • These configurations were generated through an automated tuning process on H100 hardware.
  • The schema is validated at runtime when loading these configs.
  • Fallback/default behavior is documented if a batch size key is not present in this file.

Consider adding a README or schema validation to document expected parameter ranges and any inter-parameter constraints (e.g., GROUP_SIZE_M must divide BLOCK_SIZE_M).

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json (1)

1-146: Verify batch size lookup strategy and fallback handling in loader code.

This JSON configuration file is syntactically valid and appears well-structured for mapping batch sizes to kernel tuning parameters. However, several aspects require verification in the loading/lookup mechanism:

  1. Batch size coverage: The file defines configurations for specific batch sizes (1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128, 256, 512, 1024, 1536, 2048, 3072, 4096). How are requests for batch sizes not in this set handled? (e.g., batch size 5, 10, etc.)

  2. Fallback mechanism: The PR mentions fallback to a default configuration if the device config is not found. Confirm that the fallback also handles batch sizes not defined in this file.

  3. Parameter validation: Ensure the loading code validates that required fields (BLOCK_SIZE_M/N/K, GROUP_SIZE_M, num_warps, num_stages) are present and have reasonable numeric values.

Please verify the configuration loading mechanism by examining the code that reads these JSON files. The verification should confirm:

  • How batch size lookups are performed (exact match vs. closest match vs. fallback to default)
  • Error handling for missing batch sizes
  • Parameter validation before use

Consider adding a top-level _metadata object to document the configuration schema and version:

{
    "_metadata": {
        "version": "1.0",
        "device": "NVIDIA_H200",
        "dtype": "fp8_w8a8",
        "E": 128,
        "N": 768,
        "block_shape": [128, 128],
        "description": "Triton MoE kernel parameters indexed by batch size"
    },
    "1": {
        "BLOCK_SIZE_M": 64,
        ...
    }
}

This would improve maintainability and self-documentation without changing the core lookup logic.

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_H20-3e,dtype=fp8_w8a8,block_shape=[128,128].json (1)

2-145: Configuration patterns appear reasonable but lack tuning documentation.

The file shows sensible progression: small batches (1–128) use conservative block sizes (BLOCK_SIZE_M=16), while larger batches (256+) scale up to BLOCK_SIZE_M=64 with increasing GROUP_SIZE_M, which aligns with typical Triton kernel tuning heuristics. However, num_warps (4) and num_stages (3) remain constant across all batch sizes—clarify whether this is device-specific tuning wisdom or a missed opportunity for further optimization.

Additionally, include a brief comment (either in the filename, a README, or the PR description) documenting the tuning methodology: Which benchmark/workload was used? What performance metric was optimized? This context is valuable for future maintenance and validation.

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1792,device_name=NVIDIA_A100-SXM4-80GB.json (2)

1-218: Consider adding schema metadata for maintainability and validation.

With 100+ configuration files across multiple devices and configurations, adding metadata to the JSON would improve robustness:

{
  "_schema_version": "1.0",
  "_generated_at": "2025-10-22",
  "_device": "NVIDIA_A100-SXM4-80GB",
  "_experts": 16,
  "_hidden_dim": 1792,
  "_notes": "Batch sizes are keys; values are Triton kernel launch parameters.",
  "1": { ... },
  ...
}

This enables:

  • Validation of file structure and compatibility
  • Tracking of config age and maintenance
  • Documentation of filename encoding (what do E and N represent?)
  • Easier debugging of issues across the configuration suite

1-218: Establish validation and testing strategy for the configuration suite.

With 100+ device-specific configuration files, consider:

  1. Schema validation: Add a CI check that validates all JSON files conform to the expected schema (required fields, value ranges).
  2. Coverage testing: Verify that batch sizes in each config file cover the expected range and that interpolation/fallback logic is tested.
  3. Consistency checks: Flag suspicious patterns (e.g., identical configs across different device types) that might indicate copy-paste errors.
  4. Configuration generation process: Document how these configs were tuned/generated (benchmark tool, hyperparameter search strategy, reproducibility) so future maintainers can regenerate or update them.
  5. Integration tests: Load each config file and verify it produces valid kernel launch parameters when queried.
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=1792,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1)

1-146: Valid configuration file; note indentation inconsistency.

JSON syntax and structure are correct. Extended batch-size coverage (26 keys) is appropriate for larger model scenarios. However, this file uses 2-space indentation while earlier files (files 1–4) use 4-space indentation. This is a minor formatting inconsistency across the configuration set.

Consider normalizing indentation to 4 spaces across all configuration files for consistency.

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=7168,device_name=NVIDIA_A100-SXM4-80GB.json (1)

1-146: Valid configuration file; note indentation inconsistency.

JSON syntax and structure are correct. Configuration values are within Triton valid ranges. However, this file uses 2-space indentation while most other files use 4-space indentation, reinforcing the formatting inconsistency observed across the configuration set.

Normalize indentation to 4 spaces across all configuration files for consistency.

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=3072,device_name=NVIDIA_H200,dtype=int8_w8a16.json (1)

1-146: JSON is valid, but indentation is inconsistent with the companion configuration file.

This file uses 4-space indentation while E=128,N=1024,device_name=NVIDIA_H200,dtype=fp8_w8a8.json uses 2-space indentation. While not a functional issue, normalizing indentation across all configuration files would improve consistency and maintainability.

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1792,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1)

2-17: Minor: Identical configurations for batch sizes 1 and 2.

Batch sizes "1" and "2" share identical kernel parameters. While this may be intentional (treating both as "small batch" cases), verify this was not an unintended duplication during configuration generation or tuning.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 00c2b81 and e8c21ce.

📒 Files selected for processing (107)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=14336,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=14336,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=1792,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=1792,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=1792,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=3072,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=3072,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=3072,device_name=NVIDIA_H100_80GB_HBM3.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=3072,device_name=NVIDIA_H200,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=3584,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=3584,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=3584,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=7168,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=7168,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=1,N=7168,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1024,device_name=NVIDIA_H100,dtype=fp8_w8a8.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1024,device_name=NVIDIA_H200,dtype=fp8_w8a8.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1024,device_name=NVIDIA_H200.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1856,device_name=NVIDIA_H100_80GB_HBM3.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=192,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=192,device_name=NVIDIA_H100_80GB_HBM3.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=192,device_name=NVIDIA_H20-3e.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=192,device_name=NVIDIA_H20.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=192,device_name=NVIDIA_H200.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=352,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=384,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=384,device_name=NVIDIA_GB200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=384,device_name=NVIDIA_H20,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=384,device_name=NVIDIA_H20-3e,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=384,device_name=NVIDIA_H20-3e.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=384,device_name=NVIDIA_H20.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=384,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=384,device_name=NVIDIA_H200.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=512,device_name=NVIDIA_H100_80GB_HBM3.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=704,device_name=NVIDIA_B200,dtype=fp8_w8a8.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=704,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_GB200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_H20,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_H20-3e,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_H20.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=768,device_name=NVIDIA_H200.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=96,device_name=NVIDIA_H20.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1024,device_name=NVIDIA_B200,dtype=fp8_w8a8.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1024,device_name=NVIDIA_B200.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1024,device_name=NVIDIA_H100.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1024,device_name=NVIDIA_H200,dtype=fp8_w8a8.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1024,device_name=NVIDIA_H200.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1344,device_name=NVIDIA_A100-SXM4-40GB.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1344,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1344,device_name=NVIDIA_H100_80GB_HBM3.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=14336,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=14336,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=14336,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1792,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1792,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1792,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=1792,device_name=NVIDIA_H100_80GB_HBM3.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=2048,device_name=NVIDIA_H200,dtype=fp8_w8a8.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=2048,device_name=NVIDIA_H200.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=2688,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=2688,device_name=NVIDIA_H100_80GB_HBM3.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=3072,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=3072,device_name=NVIDIA_H100_80GB_HBM3,dtype=float8.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=3072,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=3072,device_name=NVIDIA_H200,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=3200,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=3584,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=3584,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=3584,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=6400,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=7168,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=7168,device_name=NVIDIA_A100-SXM4-80GB.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=7168,device_name=NVIDIA_H100_80GB_HBM3,dtype=float8.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=7168,device_name=NVIDIA_H100_80GB_HBM3,dtype=int8_w8a16.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=16,N=800,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=160,N=192,device_name=NVIDIA_A800-SXM4-80GB.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=160,N=192,device_name=NVIDIA_H20-3e.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=160,N=320,device_name=NVIDIA_H20-3e.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=160,N=640,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=160,N=640,device_name=NVIDIA_GB200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=160,N=640,device_name=NVIDIA_H100,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=20,N=2560,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=20,N=2560,device_name=NVIDIA_GB200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=20,N=2560,device_name=NVIDIA_H100,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=20,N=2560,device_name=NVIDIA_H20-3e,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=128,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=128,device_name=NVIDIA_A100-SXM4-80GB,dtype=int8_w8a8.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=128,device_name=NVIDIA_A800-SXM4-80GB,dtype=int8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=128,device_name=NVIDIA_A800-SXM4-80GB,dtype=int8_w8a8.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=128,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=128,device_name=NVIDIA_H20,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=128,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=256,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=256,device_name=NVIDIA_H20,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=256,device_name=NVIDIA_H20,dtype=int8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=256,device_name=NVIDIA_H20-3e,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=256,device_name=NVIDIA_L20,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=512,device_name=NVIDIA_H100_80GB_HBM3.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=256,N=64,device_name=NVIDIA_A800-SXM4-80GB.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=32,N=1408,device_name=NVIDIA_B200.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=32,N=2048,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=32,N=2048,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=384,N=128,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=384,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8,block_shape=[128,128].json (1 hunks)
⛔ Files not processed due to max files limit (24)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=384,N=128,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128,128].json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=384,N=256,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=384,N=256,device_name=NVIDIA_GB200,dtype=fp8_w8a8,block_shape=[128,128].json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=40,N=1536,device_name=NVIDIA_B200,dtype=fp8_w8a8.json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=40,N=2560,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128,128].json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=40,N=2560,device_name=NVIDIA_GB200,dtype=fp8_w8a8,block_shape=[128,128].json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=40,N=2560,device_name=NVIDIA_H100,dtype=fp8_w8a8,block_shape=[128,128].json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=128,device_name=NVIDIA_B200.json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=128,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=128,device_name=NVIDIA_H100_80GB_HBM3.json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=128,device_name=NVIDIA_H20-3e.json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=128,device_name=NVIDIA_H200.json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=256,device_name=NVIDIA_B200.json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=256,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=256,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128,128].json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=256,device_name=NVIDIA_H100_80GB_HBM3.json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=256,device_name=NVIDIA_H20-3e.json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=256,device_name=NVIDIA_H200.json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=512,device_name=NVIDIA_B200.json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=512,device_name=NVIDIA_GB200,dtype=fp8_w8a8.json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=512,device_name=NVIDIA_H100_80GB_HBM3.json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=512,device_name=NVIDIA_H20-3e.json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=512,device_name=NVIDIA_H200.json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=512,N=64,device_name=NVIDIA_B200.json
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

Comment on lines +1 to +147
{
"triton_version": "3.4.0",
"1": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 32,
"BLOCK_SIZE_K": 64,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 4
},
"2": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 32,
"BLOCK_SIZE_K": 64,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"4": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 8,
"num_stages": 4
},
"8": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 8,
"num_stages": 5
},
"16": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 8,
"num_stages": 5
},
"24": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 32,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 64,
"num_warps": 4,
"num_stages": 3
},
"32": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 32,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 5
},
"48": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 5
},
"64": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 1,
"num_warps": 8,
"num_stages": 5
},
"96": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 32,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 5
},
"128": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 32,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 5
},
"256": {
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 8,
"num_stages": 5
},
"512": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 8,
"num_stages": 4
},
"1024": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 8,
"num_stages": 4
},
"1536": {
"BLOCK_SIZE_M": 128,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 64,
"GROUP_SIZE_M": 1,
"num_warps": 8,
"num_stages": 5
},
"2048": {
"BLOCK_SIZE_M": 128,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 64,
"GROUP_SIZE_M": 1,
"num_warps": 8,
"num_stages": 5
},
"3072": {
"BLOCK_SIZE_M": 256,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 64,
"GROUP_SIZE_M": 1,
"num_warps": 8,
"num_stages": 4
},
"4096": {
"BLOCK_SIZE_M": 256,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 64,
"GROUP_SIZE_M": 1,
"num_warps": 8,
"num_stages": 4
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Validate batch-size key format and configuration file contract.

The JSON uses string keys (e.g., "1", "2", ..., "4096") to represent batch sizes. Ensure that:

  1. The consuming code correctly parses batch-size keys as strings and performs the intended lookup/matching logic
  2. All numeric parameter values (BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M, num_warps, num_stages) are within valid ranges for Triton kernel configuration
  3. The triton_version field matches the version expected by the deployment system

Run a verification to check:

  • Are batch-size keys consistently treated as strings across all similar config files?
  • Are there any out-of-bounds or malformed entries across all device config files that follow this pattern?

🏁 Script executed:

#!/bin/bash
# Description: Validate JSON structure and key format consistency across all triton_fused_moe_configs files

# Check if all config files are valid JSON
echo "Validating JSON syntax..."
find tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs -name "*.json" -exec sh -c '
  jq empty "$1" 2>&1 || echo "Invalid JSON: $1"
' _ {} \;

# Sample check: Verify that all files have triton_version field
echo -e "\nChecking triton_version presence..."
find tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs -name "*.json" -exec sh -c '
  if ! jq -e ".triton_version" "$1" > /dev/null 2>&1; then
    echo "Missing triton_version: $1"
  fi
' _ {} \;

# Check for inconsistent GROUP_SIZE_M values relative to BLOCK_SIZE_M
echo -e "\nChecking GROUP_SIZE_M <= BLOCK_SIZE_M constraint..."
find tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs -name "*.json" -exec sh -c '
  jq -r "to_entries[] | select(.value | type == \"object\" and has(\"GROUP_SIZE_M\") and has(\"BLOCK_SIZE_M\")) | select(.value.GROUP_SIZE_M > .value.BLOCK_SIZE_M) | \"\(.key): GROUP_SIZE_M=\(.value.GROUP_SIZE_M) > BLOCK_SIZE_M=\(.value.BLOCK_SIZE_M)\"" "$1" | while read line; do
    [ -n "$line" ] && echo "$1: $line"
  done
' _ {} \;

Length of output: 100555


Fix systematic configuration constraint violations across triton_fused_moe_configs.

Verification reveals two critical issues affecting this file and nearly all others in the directory:

  1. Missing triton_version field: 243+ config files lack the required triton_version field. The reviewed file correctly includes "triton_version": "3.4.0", but this is exceptional.

  2. GROUP_SIZE_M > BLOCK_SIZE_M constraint violations: Widespread violations detected. In the reviewed file, batch size key "24" violates the constraint: GROUP_SIZE_M=64 > BLOCK_SIZE_M=16. This pattern is repeated across hundreds of entries in all device config files, affecting batch sizes 1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128, 256, 512, 1024, and higher.

The scale and consistency of these violations across the entire config directory suggest either a broken config generation pipeline or a schema/validation gap. Audit the generation process and ensure all configs satisfy kernel constraints before deployment.

🤖 Prompt for AI Agents
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1856,device_name=NVIDIA_H100_80GB_HBM3.json
lines 1-147: this file (and many others) either miss the required triton_version
field or contain entries where GROUP_SIZE_M > BLOCK_SIZE_M (e.g. key "24" has
GROUP_SIZE_M=64 while BLOCK_SIZE_M=16). Fix by ensuring every config has a
"triton_version" key set (add if missing) and enforce GROUP_SIZE_M <=
BLOCK_SIZE_M for every batch-size entry (adjust violating values to be <=
BLOCK_SIZE_M, e.g. change GROUP_SIZE_M from 64 to 16 or recompute to a valid
divisor), and add/restore a generation-time validation step (and CI check) that
rejects config files that fail these schema constraints so the generator cannot
emit invalid configs.

Comment on lines +1 to +122
{
"2": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 64,
"num_warps": 4,
"num_stages": 5
},
"4": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 4
},
"8": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 32,
"num_warps": 4,
"num_stages": 5
},
"16": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"32": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 5
},
"64": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 5
},
"128": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 16,
"num_warps": 4,
"num_stages": 5
},
"256": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"512": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 8,
"num_stages": 3
},
"1024": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 16,
"num_warps": 8,
"num_stages": 3
},
"2048": {
"BLOCK_SIZE_M": 128,
"BLOCK_SIZE_N": 256,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 16,
"num_warps": 8,
"num_stages": 4
},
"3072": {
"BLOCK_SIZE_M": 128,
"BLOCK_SIZE_N": 256,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 16,
"num_warps": 8,
"num_stages": 4
},
"4096": {
"BLOCK_SIZE_M": 128,
"BLOCK_SIZE_N": 256,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 16,
"num_warps": 8,
"num_stages": 4
},
"8192": {
"BLOCK_SIZE_M": 128,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"16384": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 256,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 32,
"num_warps": 8,
"num_stages": 2
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify JSON structure and batch-size key lookup in the consuming code.

The JSON file is syntactically valid and well-formatted, with consistent parameter structures across all batch sizes. However, without access to the Python code that loads this file, I cannot confirm critical integration details.

Please verify the following using the code that loads these configuration files:

  1. Batch size key type handling: Are the string keys (e.g., "2", "4") correctly parsed? Does the code convert the incoming batch size to a string before lookup, or does it expect integer keys?
  2. Error handling & fallback: If a batch size is not found in the config, does the code gracefully fall back to a default configuration as mentioned in the PR objectives?
  3. File path resolution: Confirm that the filename convention (especially the device_name, E, N, and dtype parameters) is correctly resolved at runtime.

Run the following verification script to locate and inspect the code that consumes these configuration files:


🏁 Script executed:

#!/bin/bash
# Search for code that loads triton_fused_moe_configs JSON files

# Find Python files that reference the config directory or load JSON configs
rg -n "triton_fused_moe_configs|fused_moe.*config" --type=py -A 5 -B 2

# Search for JSON loading logic to understand key type handling
rg -n "json\.load|json\.loads" --type=py tensorrt_llm/_torch/auto_deploy -A 3 -B 1 | head -50

Length of output: 2554



🏁 Script executed:

#!/bin/bash
# Get full function context around the config loading code
sed -n '190,240p' tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py

# Find the function definition and caller context
rg -n "def.*config|batch_size" tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py -A 2 -B 1 | head -100

# Search for error handling and KeyError catches
rg -n "except|KeyError|get\(|fallback|default" tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py -B 2 -A 2

Length of output: 4805


Fix dtype parameter not being passed to get_moe_configs().

The JSON file structure is valid and batch-size key handling is correct: string keys are converted to integers at line 220, and lookup uses the closest matching batch size to avoid KeyError.

However, a critical bug was found in the calling code: _get_kernel_config() at line 234 in tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py hard-codes dtype=None instead of using the dtype parameter passed to the function. Since the config filename includes dtype (e.g., dtype=fp8_w8a8.json), this causes the wrong configuration file to be selected.

Change line 234 from:

configs = get_moe_configs(E, N, dtype=None)

to:

configs = get_moe_configs(E, N, dtype=dtype)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py around
line 234, the call to get_moe_configs(...) hard-codes dtype=None; update the
call to pass the function's dtype parameter (i.e., use dtype=dtype) so the
correct dtype-specific JSON config file is selected.

Comment on lines +1 to +146
{
"1": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 4
},
"2": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 4
},
"4": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 4
},
"8": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"16": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 4
},
"24": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"32": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 5
},
"48": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 5
},
"64": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 5
},
"96": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 5
},
"128": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 5
},
"256": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 4
},
"512": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 16,
"num_warps": 4,
"num_stages": 4
},
"1024": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 16,
"num_warps": 4,
"num_stages": 4
},
"1536": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 16,
"num_warps": 4,
"num_stages": 4
},
"2048": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 16,
"num_warps": 4,
"num_stages": 4
},
"3072": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 32,
"num_warps": 4,
"num_stages": 4
},
"4096": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 32,
"num_warps": 4,
"num_stages": 4
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

JSON structure is valid; parameter ranges appear reasonable for B200 FP8 tuning.

The configuration file is well-formed with strategically spaced batch size entries (sparse at small batch sizes, exponential-like spacing at larger sizes). Parameter transitions are coherent: BLOCK_SIZE_M and GROUP_SIZE_M scale with batch size, and num_stages/BLOCK_SIZE_K vary appropriately. All values fall within expected ranges for Triton kernel tuning (num_warps=4, BLOCK_SIZE_N=128 fixed; BLOCK_SIZE_M∈{16,64}, GROUP_SIZE_M∈{1,16,32}, num_stages∈{3,4,5}).

However, I need to verify the integration with the calling code and tuning methodology.

Run the following script to verify the batch size lookup strategy and confirm that fallback logic handles missing configs gracefully:

Additionally, verify that these configurations were generated/tuned for the B200 device. Please provide or link to:

  • Documentation on how these parameters were auto-tuned or benchmarked
  • Evidence that this config was validated for correctness and performance on B200

[/request_verification]


🏁 Script executed:

#!/bin/bash
# Description: Inspect the config loading code for batch size lookup and fallback strategy.

# Find Python files that load these JSON configs
fd -e py -x grep -l 'triton_fused_moe_configs' {} \; 2>/dev/null

# Search for the config loader function/class
rg -nP 'def.*(?:load|select).*config|class.*ConfigLoader' --type=py -g 'fused_moe*'

# Verify fallback mechanism for missing batch sizes
rg -nP 'fallback|default.*config|nearest|batch.*size.*lookup' --type=py -g 'fused_moe*' -A 3 -B 1

Length of output: 68


🏁 Script executed:

#!/bin/bash
# Examine the config loading code in triton_moe.py
cat ./tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py

Length of output: 16987


🏁 Script executed:

#!/bin/bash
# Verify how _get_kernel_config is called and whether there are other callers
rg -n '_get_kernel_config' --type=py

Length of output: 226


JSON file is properly formatted, but config loading code has critical bugs preventing this file from being used.

The JSON structure is valid and parameter ranges are reasonable for B200 FP8 tuning. However, the configuration will never be loaded due to:

  1. dtype hardcoded to None (line 234): get_moe_configs(E, N, dtype=None) causes get_config_file_name() to generate filenames without dtype/block_shape. The file named with dtype=fp8_w8a8,block_shape=[128,128] will never be found; the system falls back to generic defaults.

  2. Parameter type mismatch (line 410): _get_kernel_config(M, E, inter_size2, H, top_k) passes H (tensor dimension) and top_k (int) where the function expects dtype: str | None and block_shape: list[int]. While these parameters are currently unused in the function body, this indicates incorrect API usage and prevents dtype/block_shape from being passed through correctly.

Fix the dtype hardcoding in get_moe_configs() call and correct the parameter types in the caller to enable device/dtype-specific configurations.

Comment on lines +1 to +146
{
"1": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 4
},
"2": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"4": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"8": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 16,
"num_warps": 4,
"num_stages": 5
},
"16": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 16,
"num_warps": 4,
"num_stages": 3
},
"24": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 16,
"num_warps": 4,
"num_stages": 5
},
"32": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"48": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"64": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"96": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"128": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"256": {
"BLOCK_SIZE_M": 16,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"512": {
"BLOCK_SIZE_M": 32,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 1,
"num_warps": 4,
"num_stages": 3
},
"1024": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 16,
"num_warps": 4,
"num_stages": 3
},
"1536": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 16,
"num_warps": 4,
"num_stages": 4
},
"2048": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 16,
"num_warps": 4,
"num_stages": 4
},
"3072": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 256,
"GROUP_SIZE_M": 16,
"num_warps": 4,
"num_stages": 4
},
"4096": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 128,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 16,
"num_warps": 4,
"num_stages": 4
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

JSON is valid; flag device name inconsistency in documentation.

The H100 configuration is well-formed and shows appropriate device-specific tuning divergence from the GB200 variant (e.g., different GROUP_SIZE_M and num_stages for mid-range batch sizes), which reflects distinct GPU architectures. However, the AI summary claims this file is named ...H100_80GB_HBM3... but the actual filename is NVIDIA_H100.

Clarify the device naming convention. Across the ~100+ configuration files, should device identifiers include memory/variant details (e.g., H100_80GB_HBM3) or remain generic (e.g., NVIDIA_H100)? Ensure the naming is consistent with the device lookup key used at runtime. Run the following verification to check naming consistency across the config directory:


🏁 Script executed:

#!/bin/bash
# List all device config files and their naming patterns
find tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs -name "*.json" -type f | \
  sed 's/.*device_name=//; s/,dtype.*//' | sort | uniq -c | sort -rn

Length of output: 598


Rename file: NVIDIA_H100NVIDIA_H100_80GB_HBM3 to match config directory convention.

The script output confirms a critical naming inconsistency. Across ~178 configuration files, 51+ use the detailed device variant NVIDIA_H100_80GB_HBM3, while only 4 (including this file) use the generic NVIDIA_H100. Similar patterns hold for other device families (H200, H20, A100, etc.), where detailed names with memory/variant specifiers are the standard.

The file under review uses the minority naming pattern and should be renamed to NVIDIA_H100_80GB_HBM3 to align with the convention. Misnamed configs may fail device lookup or apply incorrect tuning parameters at runtime.

🤖 Prompt for AI Agents
In
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=160,N=640,device_name=NVIDIA_H100,dtype=fp8_w8a8,block_shape=[128,128].json
(lines 1-146) the device name in the filename uses the generic "NVIDIA_H100" but
the repo convention requires the detailed variant "NVIDIA_H100_80GB_HBM3";
rename the file to replace NVIDIA_H100 with NVIDIA_H100_80GB_HBM3 (keeping the
rest of the filename identical), update any references in code/CI/configs that
import or lookup this filename to the new name, and verify there are no
duplicate files after the rename.

@nzmora-nvidia nzmora-nvidia changed the title [TRTLLM-8511][feat] Dynamically select a tile size for fused_mlp_moe_kernel [TRTLLM-8511][feat] AutoDeploy: optimize fused_mlp_moe_kernel tiles Oct 22, 2025
@suyoggupta
Copy link
Collaborator

@nzmora-nvidia : what kind of perf upside are you seeing with the ability to dynamically select the config?

…kernel

For the triton fused_moe_kernel, search for a device-specific (skew) tile size configuration
using the batch size as key. Each device has it's own configuration file in JSON format.
If the config file is not found then we revert to the default tile size configuration.

Signed-off-by: Neta Zmora <[email protected]>
@nzmora-nvidia nzmora-nvidia force-pushed the users/nzmora/auto-select-moe-kernel-config-main branch from fd27001 to 7a4518b Compare October 23, 2025 09:50
@nzmora-nvidia
Copy link
Collaborator Author

/bot run

@@ -0,0 +1,146 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is INT8 needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants