Skip to content

Conversation

@hyukn
Copy link
Collaborator

@hyukn hyukn commented Oct 21, 2025

Summary by CodeRabbit

Release Notes

  • New Features
    • Added AUTOTUNE strategy for AllReduce operations to automatically select and apply the best reduction tactic based on tensor characteristics.
    • Enhanced distributed synchronization with improved MPI coordination across multiple ranks for reliable profiling and operation execution.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@hyukn hyukn force-pushed the feat/allreduce_autotune branch from 8c2fe39 to cf7acdd Compare October 22, 2025 04:50
@hyukn
Copy link
Collaborator Author

hyukn commented Oct 22, 2025

/bot run --disable-fail-fast --only-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22122 [ run ] triggered by Bot. Commit: cf7acdd

@hyukn hyukn changed the title [TRTLLM-8129][feat] Apply AutoTuner to AllReduce Op for strategy tuning. [TRTLLM-8821][feat] Apply AutoTuner to AllReduce Op for strategy tuning. Oct 22, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #22122 [ run ] completed with state SUCCESS. Commit: cf7acdd
/LLM/main/L0_MergeRequest_PR pipeline #16681 (Partly Tested) completed with status: 'FAILURE'

@hyukn
Copy link
Collaborator Author

hyukn commented Oct 22, 2025

/bot run --disable-fail-fast --only-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22158 [ run ] triggered by Bot. Commit: cf7acdd

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22158 [ run ] completed with state FAILURE. Commit: cf7acdd
/LLM/main/L0_MergeRequest_PR pipeline #16707 (Partly Tested) completed with status: 'FAILURE'

@hyukn hyukn force-pushed the feat/allreduce_autotune branch from 6535fa5 to e2c1926 Compare October 23, 2025 08:53
@hyukn
Copy link
Collaborator Author

hyukn commented Oct 23, 2025

/bot run --disable-fail-fast --only-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22286 [ run ] triggered by Bot. Commit: e2c1926

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22286 [ run ] completed with state DISABLED
L0 testing is limited to prioritized users. User hyukn is not in the prioritized list. L0 testing cannot be triggered.

@hyukn
Copy link
Collaborator Author

hyukn commented Oct 24, 2025

/bot run --disable-fail-fast --only-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22337 [ run ] triggered by Bot. Commit: 8729ac4

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22337 [ run ] completed with state SUCCESS. Commit: 8729ac4
/LLM/main/L0_MergeRequest_PR pipeline #16840 (Partly Tested) completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@hyukn hyukn marked this pull request as ready for review October 24, 2025 05:40
@hyukn hyukn requested review from a team as code owners October 24, 2025 05:40
@hyukn hyukn requested review from byshiue and syuoni October 24, 2025 05:40
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 24, 2025

📝 Walkthrough

Walkthrough

This change introduces an autotuning framework for AllReduce operations in TensorRT-LLM. It adds MPI-coordinated profiling in the AutoTuner, implements a new AllReduceRunner class with a tunable_allreduce custom operator, extends the AllReduceStrategy enum with an AUTOTUNE option, threads the strategy through model configurations, and includes comprehensive test coverage for the new capability.

Changes

Cohort / File(s) Summary
MPI Synchronization & Profiling
tensorrt_llm/_torch/autotuner.py
Adds MPI utilities (mpi_barrier, mpi_broadcast) to coordinate profiling across ranks, introduces _is_sync_op helper to identify synchronization-required operations, and removes legacy profiling_debug flag.
AllReduce Autotuning Operator
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
Introduces AllReduceRunner class for planning and tactic selection, adds tunable_allreduce custom op that uses AutoTuner to pick optimal tactics (NCCL, ONESHOT, TWOSHOT), and registers fake implementation for fallback paths.
AllReduce Strategy Integration
tensorrt_llm/_torch/distributed/ops.py
Adds conditional logic in AllReduce.forward to route AUTOTUNE strategy through tunable_allreduce while preserving existing paths for other strategies.
Strategy Enumeration & Configuration
tensorrt_llm/functional.py, tensorrt_llm/_torch/model_config.py, tensorrt_llm/llmapi/llm_args.py
Adds AUTOTUNE = 9 to AllReduceStrategy enum, maps "AUTOTUNE" string to enum in config resolution, and expands allreduce_strategy literal type to include 'AUTOTUNE'.
Model Implementation
tensorrt_llm/_torch/models/modeling_llama.py
Updates Llama4DecoderLayer to pass strategy parameter from model_config to AllReduce constructor.
Test Coverage
tests/unittest/_torch/multi_gpu/test_allreduce.py
Adds is_autotune parameter to test harness functions, integrates autotune decorator for profiling, replaces manual fusion paths with calc_fused_allreduce, adds test_tunable_allreduce_fusion_patterns for multi-config autotuning validation.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant AR as AllReduce
    participant TAR as tunable_allreduce
    participant ATuner as AutoTuner
    participant Runner as AllReduceRunner
    participant NCCL as NCCL/Ops

    App->>AR: forward(strategy=AUTOTUNE)
    AR->>TAR: tunable_allreduce(...)
    
    rect rgb(200, 220, 255)
    Note over TAR,ATuner: Profiling Phase
    TAR->>ATuner: profile tactics (NCCL, ONESHOT, TWOSHOT)
    ATuner->>Runner: time each tactic
    ATuner->>ATuner: MPI barrier & broadcast results
    ATuner->>TAR: best_tactic
    end
    
    rect rgb(220, 255, 220)
    Note over TAR,NCCL: Execution Phase
    TAR->>Runner: execute(best_tactic)
    Runner->>NCCL: allreduce_op with selected tactic
    NCCL->>Runner: result
    Runner->>TAR: output
    end
    
    TAR->>AR: result
    AR->>App: output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Rationale: The changes span multiple architectural layers with new public APIs (AllReduceRunner, tunable_allreduce, AUTOTUNE strategy), introduce MPI synchronization coordination logic requiring careful verification, integrate the autotuning path into distributed ops with conditional branching, and include substantial test augmentation with new test patterns. While changes follow a cohesive theme, each component (profiling coordination, operator implementation, strategy routing, config threading) requires separate verification for correctness and integration points.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description is largely incomplete. The author has provided only the repository's PR template with HTML comments intact and has not filled in the critical "Description" or "Test Coverage" sections that should explain the issue, the solution, and relevant test coverage. While the checklist box is marked as checked, this appears to be superficial as the substantive content sections remain blank. The description offers no actual explanation of the changes, their motivation, or testing strategy beyond the template structure. The author should provide a meaningful PR description by filling in the "Description" section with an explanation of the issue being addressed and how the AutoTuner integration solves it, and complete the "Test Coverage" section by listing the relevant tests (such as test_tunable_allreduce_fusion_patterns mentioned in the raw summary) that validate these changes. The PR checklist items should also be reviewed and confirmed to ensure all requirements are actually met.
Docstring Coverage ⚠️ Warning Docstring coverage is 12.90% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "[TRTLLM-8821][feat] Apply AutoTuner to AllReduce Op for strategy tuning" is well-structured and directly reflects the main changes in the pull request. The changes across multiple files clearly support this title: new AutoTuner integration with AllReduce operations, introduction of a tunable_allreduce function, addition of an AUTOTUNE strategy, and MPI synchronization updates. The title is concise, specific, and follows the required format with ticket ID and type indicator, making it clear to reviewers what the primary objective is.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

806-818: NameError in fake: use act_fp4 (not act_fp8).

The fake for w4a8_mxfp4_fp8_gemm references undefined act_fp8. Use the parameter act_fp4:

-    return act_fp8.new_empty((act_fp8.size(0), weight.size(0)),
-                             dtype=output_dtype)
+    return act_fp4.new_empty((act_fp4.size(0), weight.size(0)),
+                             dtype=output_dtype)
tests/unittest/_torch/multi_gpu/test_allreduce.py (1)

1-2: Update SPDX year to current (2025).

Guideline asks for current year in headers. Please bump 2024 → 2025.

Apply this diff:

-# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🧹 Nitpick comments (7)
tensorrt_llm/_torch/distributed/ops.py (1)

611-611: Minor: 'strategy' arg is unused in tunable_allreduce.

You pass strategy=allreduce_strategy but tunable_allreduce ignores it. Either use it (e.g., to constrain tactics) or drop the parameter to reduce confusion.

tensorrt_llm/_torch/autotuner.py (1)

719-725: Rank-0 gating for sync ops to avoid N× profiling; small scoping nit.

Docstring claims “only rank 0 profiles,” but the loop still profiles on every rank. For AllReduceRunner, consider:

  • If is_sync_op and rank != 0: skip timing loops and set (best_runner_id, best_tactic, min_time) to sentinel; rely on the broadcasted results.
  • Keep the existing barriers to maintain determinism.

Also, the post-loop check uses the last loop’s 'runner' binding. Use any(...) over runners or compute a flag before loops.

Happy to draft a patch if you want to gate profiling to rank 0.

Also applies to: 760-779

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

1143-1156: Optional: annotate class attributes to satisfy Ruff (RUF012).

Mark all_support_ops and tuning_config as ClassVar to silence lints.

-from typing import List, Mapping, Optional, Tuple
+from typing import List, Mapping, Optional, Tuple, ClassVar
@@
-    all_support_ops = {
+    all_support_ops: ClassVar[set[int]] = {
@@
-    tuning_config = TuningConfig(
+    tuning_config: ClassVar[TuningConfig] = TuningConfig(
tests/unittest/_torch/multi_gpu/test_allreduce.py (4)

249-252: Optional: persist autotune cache to speed retries.

Passing a cache_path reduces repeated profiling across ranks/reruns. Safe: the autotuner writes per-rank files.

Apply this diff and add the import noted below:

-    with autotune(tune_mode=is_autotune):
+    with autotune(
+        tune_mode=is_autotune,
+        cache_path=os.path.join(tempfile.gettempdir(), "trtllm_allreduce_autotune_cache.json"),
+    ):
         calc_output = calc_func(xs[tensor_parallel_rank], residual)

Add at top-level imports:

import tempfile  # for autotune cache path

255-260: Guard against silent truncation in zip and keep Py3.8 compatibility.

Add an explicit length check and annotate the zip to avoid Ruff B905 without using strict (not available on Py3.8).

Apply this diff:

-    for calc_output_tensor, ref_output_tensor in zip(calc_output, ref_output):
+    assert len(calc_output) == len(ref_output), "Mismatched outputs between calc and ref"
+    for calc_output_tensor, ref_output_tensor in zip(calc_output, ref_output):  # noqa: B905 - PY38 compatibility
         check_accuracy(calc_output_tensor,
                        ref_output_tensor,
                        atol=0.05,
                        rtol=0.15,
                        percent=0.99)

296-301: Silence Ruff B905 on the zip(*[...]*tp) transpose (Py3.8 target).

Annotate the call; structure is safe here.

Apply this diff:

-        *zip(*[(tensor_parallel_size, run_allreduce_op, x, residual,
+        *zip(*[(tensor_parallel_size, run_allreduce_op, x, residual,  # noqa: B905 - PY38 compatibility
                 [linear_weight], hidden_size, dtype, fusion_op)] *
              tensor_parallel_size),

338-342: Repeat of the zip(*[...]*tp) pattern: add the same Py3.8-safe annotation.

Keep consistency with the earlier test.

Apply this diff:

-        *zip(*[(tensor_parallel_size, run_allreduce_op, x, residual,
+        *zip(*[(tensor_parallel_size, run_allreduce_op, x, residual,  # noqa: B905 - PY38 compatibility
                 [linear_weight], hidden_size, dtype, fusion_op, True)] *
              tensor_parallel_size),
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a5845e and 8729ac4.

📒 Files selected for processing (8)
  • tensorrt_llm/_torch/autotuner.py (6 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2 hunks)
  • tensorrt_llm/_torch/distributed/ops.py (2 hunks)
  • tensorrt_llm/_torch/model_config.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
  • tensorrt_llm/functional.py (1 hunks)
  • tensorrt_llm/llmapi/llm_args.py (1 hunks)
  • tests/unittest/_torch/multi_gpu/test_allreduce.py (8 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/autotuner.py
  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/functional.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/unittest/_torch/multi_gpu/test_allreduce.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/autotuner.py
  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/functional.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/unittest/_torch/multi_gpu/test_allreduce.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/autotuner.py
  • tensorrt_llm/_torch/distributed/ops.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/functional.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/unittest/_torch/multi_gpu/test_allreduce.py
🧬 Code graph analysis (6)
tensorrt_llm/_torch/model_config.py (1)
tensorrt_llm/functional.py (1)
  • AllReduceStrategy (3876-3886)
tensorrt_llm/_torch/models/modeling_llama.py (1)
tensorrt_llm/_torch/distributed/ops.py (1)
  • AllReduce (455-634)
tensorrt_llm/_torch/autotuner.py (1)
tensorrt_llm/_utils.py (2)
  • mpi_barrier (569-571)
  • mpi_broadcast (579-580)
tensorrt_llm/_torch/distributed/ops.py (3)
cpp/tensorrt_llm/thop/allreduceOp.cpp (18)
  • strategy (828-859)
  • strategy (828-828)
  • input (269-301)
  • input (269-272)
  • input (321-399)
  • input (321-323)
  • input (401-432)
  • input (401-403)
  • input (434-476)
  • input (434-436)
  • input (478-543)
  • input (478-480)
  • input (545-727)
  • input (545-549)
  • input (729-795)
  • input (729-732)
  • op (1164-1164)
  • op (1203-1203)
tensorrt_llm/functional.py (1)
  • AllReduceStrategy (3876-3886)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
  • tunable_allreduce (1213-1248)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (5)
tensorrt_llm/functional.py (7)
  • AllReduceFusionOp (3889-3898)
  • AllReduceStrategy (3876-3886)
  • Tensor (107-602)
  • shape (270-274)
  • shape (277-282)
  • shape (2056-2096)
  • allreduce (4047-4140)
tensorrt_llm/_torch/autotuner.py (9)
  • TunableRunner (154-210)
  • TuningConfig (54-102)
  • DynamicTensorSpec (24-36)
  • ConstraintSpec (40-50)
  • get_valid_tactics (157-175)
  • forward (181-207)
  • AutoTuner (515-982)
  • get (539-542)
  • choose_one (544-651)
tensorrt_llm/_torch/utils.py (3)
  • last_positive_power_of_2 (218-223)
  • shape (105-106)
  • _ (192-198)
cpp/tensorrt_llm/thop/allreduceOp.cpp (10)
  • op (1164-1164)
  • op (1203-1203)
  • input (269-301)
  • input (269-272)
  • input (321-399)
  • input (321-323)
  • input (401-432)
  • input (401-403)
  • strategy (828-859)
  • strategy (828-828)
tensorrt_llm/quantization/utils/fp4_utils.py (1)
  • get_fp4_shape (31-42)
tests/unittest/_torch/multi_gpu/test_allreduce.py (4)
tensorrt_llm/_torch/autotuner.py (1)
  • autotune (214-246)
tensorrt_llm/_torch/distributed/ops.py (1)
  • AllReduce (455-634)
tensorrt_llm/functional.py (8)
  • AllReduceFusionOp (3889-3898)
  • AllReduceParams (3901-3940)
  • AllReduceStrategy (3876-3886)
  • dtype (255-259)
  • dtype (262-267)
  • rank (492-496)
  • Tensor (107-602)
  • allreduce (4047-4140)
tests/unittest/conftest.py (1)
  • mpi_pool_executor (243-251)
🪛 Ruff (0.14.1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

1144-1147: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


1177-1177: Unused method argument: profile

(ARG002)


1178-1178: Unused method argument: kwargs

(ARG002)


1221-1221: Unused function argument: strategy

(ARG001)


1254-1254: Unused function argument: residual

(ARG001)


1255-1255: Unused function argument: norm_weight

(ARG001)


1256-1256: Unused function argument: scale

(ARG001)


1257-1257: Unused function argument: bias

(ARG001)


1258-1258: Unused function argument: workspace

(ARG001)


1259-1259: Unused function argument: group

(ARG001)


1260-1260: Unused function argument: strategy

(ARG001)


1262-1262: Unused function argument: eps

(ARG001)


1263-1263: Unused function argument: trigger_completion_at_end

(ARG001)

tests/unittest/_torch/multi_gpu/test_allreduce.py

255-255: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


339-341: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🔇 Additional comments (11)
tensorrt_llm/_torch/model_config.py (1)

194-195: LGTM! Clean addition of AUTOTUNE strategy mapping.

The new mapping entry correctly extends the get_all_reduce_strategy function to support the AUTOTUNE strategy. The formatting is consistent with existing entries, and the trailing comma on line 194 follows Python best practices.

tensorrt_llm/functional.py (1)

3886-3887: Enum extension fits strategy list.

Adding AUTOTUNE preserves the sequential IntEnum contract and aligns the strategy surface with the new autotuning pathway. Nice and tidy.

tensorrt_llm/llmapi/llm_args.py (1)

2495-2500: AUTOTUNE value added to allreduce_strategy looks good; confirm enum conversion at call sites.

Please verify that model_config.allreduce_strategy is converted to tensorrt_llm.functional.AllReduceStrategy (not a raw str) before being passed into AllReduce(..., strategy=...). This avoids mismatches in distributed paths. If already handled in ModelConfig, all good.

tensorrt_llm/_torch/models/modeling_llama.py (1)

653-655: Ensure strategy is an Enum, not a str.

AllReduce(strategy=...) expects AllReduceStrategy. Please confirm model_config.allreduce_strategy is the Enum, not a string Literal, to avoid runtime issues and incorrect branch selection.

tests/unittest/_torch/multi_gpu/test_allreduce.py (7)

24-30: Imports for autotune and accuracy utilities look good.

Adopts public APIs and shared test utilities appropriately.


103-111: Extending run_allreduce_op with is_autotune is consistent with the call chain.

Signature matches the updated caller; no issues spotted.


143-145: Explicit strategy selection is correct (NCCL baseline vs AUTOTUNE).

This guards the baseline path from AUTO/MNNVL variability; good for deterministic testing.


165-165: Normalize return shape for fusion_op == NONE.

Wrapping the tensor in a list makes downstream zipping uniform. Good.


228-242: Fusion dispatch table now covers NONE via the common path.

Mapping aligns with helper functions; consistent with the new return-shape contract.


306-329: New autotune-enabled test matrix is appropriate.

Param choices exercise small and large seq lens and guard NVFP4 with skip_pre_blackwell. Looks good.


66-81: Code change is correct—backward compatibility verified.

Found two call sites in test_allreduce.py:

  • Line 297: passes 8 arguments, is_autotune uses default False
  • Line 338: passes 9 arguments with explicit True for is_autotune

The function signature at line 66 with is_autotune: bool = False ensures existing callers work unchanged while new tests can opt-in by passing True. No positional argument mismatches.

Comment on lines +1143 to +1173
class AllReduceRunner(TunableRunner):
all_support_ops = {
AllReduceFusionOp.NONE.value,
AllReduceFusionOp.RESIDUAL_RMS_NORM.value,
}

tuning_config = TuningConfig(
dynamic_tensor_specs=(DynamicTensorSpec(
0, 0,
(8192, 4096, 2048, 1024, 512, 256, 128, 64, 32, 16, 8, 4, 2, 1),
last_positive_power_of_2), ),
constraint_specs=(ConstraintSpec(1, 0, lambda shapes: shapes[0][0]), ),
)

def __init__(
self,
tp_size: int,
group: List[int],
op: int,
eps: float,
trigger_completion_at_end: bool,
):
self.tp_size = tp_size
self.op = op
self._group = group
self._eps = eps
self._trigger_completion_at_end = trigger_completion_at_end

def __hash__(self):
return hash((self.tp_size, self.op))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix: Support MPI-disabled PG path in AllReduceRunner.

AllReduceRunner.forward always uses torch.ops.trtllm.allreduce. When MPI is disabled you must call allreduce_pg and pass rank/pg. Suggested patch:

@@
-from tensorrt_llm.functional import AllReduceFusionOp, AllReduceStrategy
+from tensorrt_llm.functional import AllReduceFusionOp, AllReduceStrategy
+from tensorrt_llm._utils import mpi_disabled
+from typing import Optional, Any, ClassVar
@@
-class AllReduceRunner(TunableRunner):
-    all_support_ops = {
+class AllReduceRunner(TunableRunner):
+    all_support_ops: ClassVar[set[int]] = {
         AllReduceFusionOp.NONE.value,
         AllReduceFusionOp.RESIDUAL_RMS_NORM.value,
     }
@@
-    def __init__(
+    def __init__(
         self,
         tp_size: int,
         group: List[int],
         op: int,
         eps: float,
         trigger_completion_at_end: bool,
+        *,
+        rank: Optional[int] = None,
+        pg: Optional[Any] = None,
     ):
         self.tp_size = tp_size
         self.op = op
         self._group = group
         self._eps = eps
         self._trigger_completion_at_end = trigger_completion_at_end
+        self._rank = rank
+        self._pg = pg
@@
-        return torch.ops.trtllm.allreduce(
-            input,
-            residual,
-            norm_weight,
-            scale,
-            bias,
-            workspace,
-            self._group,
-            tactic,
-            self.op,
-            self._eps,
-            self._trigger_completion_at_end,
-        )
+        if mpi_disabled() and self._pg is not None and self._rank is not None:
+            return torch.ops.trtllm.allreduce_pg(
+                input,
+                residual,
+                norm_weight,
+                scale,
+                bias,
+                workspace,
+                self._group,
+                self._pg,
+                tactic,
+                self.op,
+                self._eps,
+                self._trigger_completion_at_end,
+            )
+        else:
+            return torch.ops.trtllm.allreduce(
+                input,
+                residual,
+                norm_weight,
+                scale,
+                bias,
+                workspace,
+                self._group,
+                tactic,
+                self.op,
+                self._eps,
+                self._trigger_completion_at_end,
+            )

Also applies to: 1188-1210

🧰 Tools
🪛 Ruff (0.14.1)

1144-1147: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

Comment on lines +1212 to +1249
@torch.library.custom_op("trtllm::tunable_allreduce", mutates_args=())
def tunable_allreduce(
input: torch.Tensor,
residual: Optional[torch.Tensor],
norm_weight: Optional[torch.Tensor],
scale: Optional[torch.Tensor],
bias: Optional[torch.Tensor],
workspace: Optional[torch.Tensor],
group: List[int],
strategy: int,
op: int,
eps: float,
tp_size: int,
trigger_completion_at_end: bool,
) -> List[torch.Tensor]:

tuner = AutoTuner.get()

allreduce_runner = AllReduceRunner(
tp_size,
group,
op,
eps,
trigger_completion_at_end,
)

_, best_tactic = tuner.choose_one(
"trtllm::tunable_allreduce::allreduce",
[allreduce_runner],
AllReduceRunner.tuning_config,
[input, residual, norm_weight, scale, bias, workspace],
)

return allreduce_runner(
[input, residual, norm_weight, scale, bias, workspace],
tactic=best_tactic,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix: Wire PG args through tunable_allreduce and correct fake signature.

  • Missing tp_size in @register_fake signature causes schema mismatch.
  • Need to accept optional rank/pg and pass to runner for MPI-disabled case.

Patch:

@@
-@torch.library.custom_op("trtllm::tunable_allreduce", mutates_args=())
+@torch.library.custom_op("trtllm::tunable_allreduce", mutates_args=())
 def tunable_allreduce(
     input: torch.Tensor,
     residual: Optional[torch.Tensor],
     norm_weight: Optional[torch.Tensor],
     scale: Optional[torch.Tensor],
     bias: Optional[torch.Tensor],
     workspace: Optional[torch.Tensor],
     group: List[int],
     strategy: int,
     op: int,
     eps: float,
     tp_size: int,
     trigger_completion_at_end: bool,
+    rank: Optional[int] = None,
+    pg: Optional[object] = None,
 ) -> List[torch.Tensor]:
@@
-    allreduce_runner = AllReduceRunner(
+    allreduce_runner = AllReduceRunner(
         tp_size,
         group,
         op,
         eps,
         trigger_completion_at_end,
+        rank=rank,
+        pg=pg,
     )
@@
-@tunable_allreduce.register_fake
+@tunable_allreduce.register_fake
 def _(
     input: torch.Tensor,
     residual: Optional[torch.Tensor],
     norm_weight: Optional[torch.Tensor],
     scale: Optional[torch.Tensor],
     bias: Optional[torch.Tensor],
     workspace: Optional[torch.Tensor],
     group: List[int],
     strategy: int,
     op: int,
     eps: float,
+    tp_size: int,
     trigger_completion_at_end: bool,
+    rank: Optional[int] = None,
+    pg: Optional[object] = None,
 ) -> torch.Tensor:

This aligns the fake schema with the real op and unblocks torch.compile/meta.

Also applies to: 1251-1295

🧰 Tools
🪛 Ruff (0.14.1)

1221-1221: Unused function argument: strategy

(ARG001)

Comment on lines +601 to +617
if self.strategy == AllReduceStrategy.AUTOTUNE:
output = torch.ops.trtllm.tunable_allreduce(
input=input,
residual=all_reduce_params.residual,
norm_weight=all_reduce_params.norm_weight,
scale=all_reduce_params.scale,
bias=all_reduce_params.bias,
workspace=self.workspace,
group=self.mapping.tp_group,
strategy=allreduce_strategy,
op=all_reduce_params.fusion_op,
eps=all_reduce_params.eps,
tp_size=self.mapping.tp_size,
trigger_completion_at_end=all_reduce_params.
trigger_completion_at_end,
)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

MPI-disabled bug: AUTOTUNE path ignores ProcessGroup and rank; uses non-PG op.

In the AUTOTUNE branch you call trtllm.tunable_allreduce without passing rank/pg, and AllReduceRunner always dispatches to allreduce (non-PG). When mpi_disabled(), this will fail or hang.

Pass rank and pg through and make the runner use allreduce_pg when MPI is disabled. Suggested patch (ops side):

@@
-        if self.strategy == AllReduceStrategy.AUTOTUNE:
-            output = torch.ops.trtllm.tunable_allreduce(
+        if self.strategy == AllReduceStrategy.AUTOTUNE:
+            output = torch.ops.trtllm.tunable_allreduce(
                 input=input,
                 residual=all_reduce_params.residual,
                 norm_weight=all_reduce_params.norm_weight,
                 scale=all_reduce_params.scale,
                 bias=all_reduce_params.bias,
                 workspace=self.workspace,
                 group=self.mapping.tp_group,
                 strategy=allreduce_strategy,
                 op=all_reduce_params.fusion_op,
                 eps=all_reduce_params.eps,
                 tp_size=self.mapping.tp_size,
-                trigger_completion_at_end=all_reduce_params.
-                trigger_completion_at_end,
+                trigger_completion_at_end=all_reduce_params.trigger_completion_at_end,
+                # Wire PG when MPI is disabled
+                **({"rank": torch.distributed.get_rank(),
+                    "pg": self.mapping.tp_group_pg.boxed()} if self._disable_mpi else {}),
             )

Apply the complementary changes in tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (see my other comment) to accept rank/pg and call allreduce_pg accordingly.

Committable suggestion skipped: line range outside the PR's diff.

Signed-off-by: Yukun He <[email protected]>
@hyukn hyukn force-pushed the feat/allreduce_autotune branch from 8729ac4 to eb789c7 Compare October 28, 2025 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants