-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-8821][feat] Apply AutoTuner to AllReduce Op for strategy tuning. #8531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
8c2fe39 to
cf7acdd
Compare
|
/bot run --disable-fail-fast --only-multi-gpu-test |
|
PR_Github #22122 [ run ] triggered by Bot. Commit: |
|
PR_Github #22122 [ run ] completed with state |
|
/bot run --disable-fail-fast --only-multi-gpu-test |
|
PR_Github #22158 [ run ] triggered by Bot. Commit: |
|
PR_Github #22158 [ run ] completed with state |
6535fa5 to
e2c1926
Compare
|
/bot run --disable-fail-fast --only-multi-gpu-test |
|
PR_Github #22286 [ run ] triggered by Bot. Commit: |
|
PR_Github #22286 [ run ] completed with state |
|
/bot run --disable-fail-fast --only-multi-gpu-test |
|
PR_Github #22337 [ run ] triggered by Bot. Commit: |
Signed-off-by: Yukun He <[email protected]>
|
PR_Github #22337 [ run ] completed with state |
📝 WalkthroughWalkthroughThis change introduces an autotuning framework for AllReduce operations in TensorRT-LLM. It adds MPI-coordinated profiling in the AutoTuner, implements a new AllReduceRunner class with a tunable_allreduce custom operator, extends the AllReduceStrategy enum with an AUTOTUNE option, threads the strategy through model configurations, and includes comprehensive test coverage for the new capability. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant AR as AllReduce
participant TAR as tunable_allreduce
participant ATuner as AutoTuner
participant Runner as AllReduceRunner
participant NCCL as NCCL/Ops
App->>AR: forward(strategy=AUTOTUNE)
AR->>TAR: tunable_allreduce(...)
rect rgb(200, 220, 255)
Note over TAR,ATuner: Profiling Phase
TAR->>ATuner: profile tactics (NCCL, ONESHOT, TWOSHOT)
ATuner->>Runner: time each tactic
ATuner->>ATuner: MPI barrier & broadcast results
ATuner->>TAR: best_tactic
end
rect rgb(220, 255, 220)
Note over TAR,NCCL: Execution Phase
TAR->>Runner: execute(best_tactic)
Runner->>NCCL: allreduce_op with selected tactic
NCCL->>Runner: result
Runner->>TAR: output
end
TAR->>AR: result
AR->>App: output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Rationale: The changes span multiple architectural layers with new public APIs (AllReduceRunner, tunable_allreduce, AUTOTUNE strategy), introduce MPI synchronization coordination logic requiring careful verification, integrate the autotuning path into distributed ops with conditional branching, and include substantial test augmentation with new test patterns. While changes follow a cohesive theme, each component (profiling coordination, operator implementation, strategy routing, config threading) requires separate verification for correctness and integration points. Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
806-818: NameError in fake: use act_fp4 (not act_fp8).The fake for w4a8_mxfp4_fp8_gemm references undefined act_fp8. Use the parameter act_fp4:
- return act_fp8.new_empty((act_fp8.size(0), weight.size(0)), - dtype=output_dtype) + return act_fp4.new_empty((act_fp4.size(0), weight.size(0)), + dtype=output_dtype)tests/unittest/_torch/multi_gpu/test_allreduce.py (1)
1-2: Update SPDX year to current (2025).Guideline asks for current year in headers. Please bump 2024 → 2025.
Apply this diff:
-# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🧹 Nitpick comments (7)
tensorrt_llm/_torch/distributed/ops.py (1)
611-611: Minor: 'strategy' arg is unused in tunable_allreduce.You pass strategy=allreduce_strategy but tunable_allreduce ignores it. Either use it (e.g., to constrain tactics) or drop the parameter to reduce confusion.
tensorrt_llm/_torch/autotuner.py (1)
719-725: Rank-0 gating for sync ops to avoid N× profiling; small scoping nit.Docstring claims “only rank 0 profiles,” but the loop still profiles on every rank. For AllReduceRunner, consider:
- If is_sync_op and rank != 0: skip timing loops and set (best_runner_id, best_tactic, min_time) to sentinel; rely on the broadcasted results.
- Keep the existing barriers to maintain determinism.
Also, the post-loop check uses the last loop’s 'runner' binding. Use any(...) over runners or compute a flag before loops.
Happy to draft a patch if you want to gate profiling to rank 0.
Also applies to: 760-779
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
1143-1156: Optional: annotate class attributes to satisfy Ruff (RUF012).Mark all_support_ops and tuning_config as ClassVar to silence lints.
-from typing import List, Mapping, Optional, Tuple +from typing import List, Mapping, Optional, Tuple, ClassVar @@ - all_support_ops = { + all_support_ops: ClassVar[set[int]] = { @@ - tuning_config = TuningConfig( + tuning_config: ClassVar[TuningConfig] = TuningConfig(tests/unittest/_torch/multi_gpu/test_allreduce.py (4)
249-252: Optional: persist autotune cache to speed retries.Passing a
cache_pathreduces repeated profiling across ranks/reruns. Safe: the autotuner writes per-rank files.Apply this diff and add the import noted below:
- with autotune(tune_mode=is_autotune): + with autotune( + tune_mode=is_autotune, + cache_path=os.path.join(tempfile.gettempdir(), "trtllm_allreduce_autotune_cache.json"), + ): calc_output = calc_func(xs[tensor_parallel_rank], residual)Add at top-level imports:
import tempfile # for autotune cache path
255-260: Guard against silent truncation inzipand keep Py3.8 compatibility.Add an explicit length check and annotate the
zipto avoid Ruff B905 without usingstrict(not available on Py3.8).Apply this diff:
- for calc_output_tensor, ref_output_tensor in zip(calc_output, ref_output): + assert len(calc_output) == len(ref_output), "Mismatched outputs between calc and ref" + for calc_output_tensor, ref_output_tensor in zip(calc_output, ref_output): # noqa: B905 - PY38 compatibility check_accuracy(calc_output_tensor, ref_output_tensor, atol=0.05, rtol=0.15, percent=0.99)
296-301: Silence Ruff B905 on thezip(*[...]*tp)transpose (Py3.8 target).Annotate the call; structure is safe here.
Apply this diff:
- *zip(*[(tensor_parallel_size, run_allreduce_op, x, residual, + *zip(*[(tensor_parallel_size, run_allreduce_op, x, residual, # noqa: B905 - PY38 compatibility [linear_weight], hidden_size, dtype, fusion_op)] * tensor_parallel_size),
338-342: Repeat of thezip(*[...]*tp)pattern: add the same Py3.8-safe annotation.Keep consistency with the earlier test.
Apply this diff:
- *zip(*[(tensor_parallel_size, run_allreduce_op, x, residual, + *zip(*[(tensor_parallel_size, run_allreduce_op, x, residual, # noqa: B905 - PY38 compatibility [linear_weight], hidden_size, dtype, fusion_op, True)] * tensor_parallel_size),
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
tensorrt_llm/_torch/autotuner.py(6 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py(2 hunks)tensorrt_llm/_torch/distributed/ops.py(2 hunks)tensorrt_llm/_torch/model_config.py(1 hunks)tensorrt_llm/_torch/models/modeling_llama.py(1 hunks)tensorrt_llm/functional.py(1 hunks)tensorrt_llm/llmapi/llm_args.py(1 hunks)tests/unittest/_torch/multi_gpu/test_allreduce.py(8 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/_torch/model_config.pytensorrt_llm/_torch/models/modeling_llama.pytensorrt_llm/_torch/autotuner.pytensorrt_llm/_torch/distributed/ops.pytensorrt_llm/_torch/custom_ops/torch_custom_ops.pytensorrt_llm/functional.pytensorrt_llm/llmapi/llm_args.pytests/unittest/_torch/multi_gpu/test_allreduce.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/model_config.pytensorrt_llm/_torch/models/modeling_llama.pytensorrt_llm/_torch/autotuner.pytensorrt_llm/_torch/distributed/ops.pytensorrt_llm/_torch/custom_ops/torch_custom_ops.pytensorrt_llm/functional.pytensorrt_llm/llmapi/llm_args.pytests/unittest/_torch/multi_gpu/test_allreduce.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/_torch/model_config.pytensorrt_llm/_torch/models/modeling_llama.pytensorrt_llm/_torch/autotuner.pytensorrt_llm/_torch/distributed/ops.pytensorrt_llm/_torch/custom_ops/torch_custom_ops.pytensorrt_llm/functional.pytensorrt_llm/llmapi/llm_args.pytests/unittest/_torch/multi_gpu/test_allreduce.py
🧬 Code graph analysis (6)
tensorrt_llm/_torch/model_config.py (1)
tensorrt_llm/functional.py (1)
AllReduceStrategy(3876-3886)
tensorrt_llm/_torch/models/modeling_llama.py (1)
tensorrt_llm/_torch/distributed/ops.py (1)
AllReduce(455-634)
tensorrt_llm/_torch/autotuner.py (1)
tensorrt_llm/_utils.py (2)
mpi_barrier(569-571)mpi_broadcast(579-580)
tensorrt_llm/_torch/distributed/ops.py (3)
cpp/tensorrt_llm/thop/allreduceOp.cpp (18)
strategy(828-859)strategy(828-828)input(269-301)input(269-272)input(321-399)input(321-323)input(401-432)input(401-403)input(434-476)input(434-436)input(478-543)input(478-480)input(545-727)input(545-549)input(729-795)input(729-732)op(1164-1164)op(1203-1203)tensorrt_llm/functional.py (1)
AllReduceStrategy(3876-3886)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
tunable_allreduce(1213-1248)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (5)
tensorrt_llm/functional.py (7)
AllReduceFusionOp(3889-3898)AllReduceStrategy(3876-3886)Tensor(107-602)shape(270-274)shape(277-282)shape(2056-2096)allreduce(4047-4140)tensorrt_llm/_torch/autotuner.py (9)
TunableRunner(154-210)TuningConfig(54-102)DynamicTensorSpec(24-36)ConstraintSpec(40-50)get_valid_tactics(157-175)forward(181-207)AutoTuner(515-982)get(539-542)choose_one(544-651)tensorrt_llm/_torch/utils.py (3)
last_positive_power_of_2(218-223)shape(105-106)_(192-198)cpp/tensorrt_llm/thop/allreduceOp.cpp (10)
op(1164-1164)op(1203-1203)input(269-301)input(269-272)input(321-399)input(321-323)input(401-432)input(401-403)strategy(828-859)strategy(828-828)tensorrt_llm/quantization/utils/fp4_utils.py (1)
get_fp4_shape(31-42)
tests/unittest/_torch/multi_gpu/test_allreduce.py (4)
tensorrt_llm/_torch/autotuner.py (1)
autotune(214-246)tensorrt_llm/_torch/distributed/ops.py (1)
AllReduce(455-634)tensorrt_llm/functional.py (8)
AllReduceFusionOp(3889-3898)AllReduceParams(3901-3940)AllReduceStrategy(3876-3886)dtype(255-259)dtype(262-267)rank(492-496)Tensor(107-602)allreduce(4047-4140)tests/unittest/conftest.py (1)
mpi_pool_executor(243-251)
🪛 Ruff (0.14.1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
1144-1147: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
1177-1177: Unused method argument: profile
(ARG002)
1178-1178: Unused method argument: kwargs
(ARG002)
1221-1221: Unused function argument: strategy
(ARG001)
1254-1254: Unused function argument: residual
(ARG001)
1255-1255: Unused function argument: norm_weight
(ARG001)
1256-1256: Unused function argument: scale
(ARG001)
1257-1257: Unused function argument: bias
(ARG001)
1258-1258: Unused function argument: workspace
(ARG001)
1259-1259: Unused function argument: group
(ARG001)
1260-1260: Unused function argument: strategy
(ARG001)
1262-1262: Unused function argument: eps
(ARG001)
1263-1263: Unused function argument: trigger_completion_at_end
(ARG001)
tests/unittest/_torch/multi_gpu/test_allreduce.py
255-255: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
339-341: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🔇 Additional comments (11)
tensorrt_llm/_torch/model_config.py (1)
194-195: LGTM! Clean addition of AUTOTUNE strategy mapping.The new mapping entry correctly extends the
get_all_reduce_strategyfunction to support the AUTOTUNE strategy. The formatting is consistent with existing entries, and the trailing comma on line 194 follows Python best practices.tensorrt_llm/functional.py (1)
3886-3887: Enum extension fits strategy list.Adding
AUTOTUNEpreserves the sequential IntEnum contract and aligns the strategy surface with the new autotuning pathway. Nice and tidy.tensorrt_llm/llmapi/llm_args.py (1)
2495-2500: AUTOTUNE value added to allreduce_strategy looks good; confirm enum conversion at call sites.Please verify that model_config.allreduce_strategy is converted to tensorrt_llm.functional.AllReduceStrategy (not a raw str) before being passed into AllReduce(..., strategy=...). This avoids mismatches in distributed paths. If already handled in ModelConfig, all good.
tensorrt_llm/_torch/models/modeling_llama.py (1)
653-655: Ensure strategy is an Enum, not a str.AllReduce(strategy=...) expects AllReduceStrategy. Please confirm model_config.allreduce_strategy is the Enum, not a string Literal, to avoid runtime issues and incorrect branch selection.
tests/unittest/_torch/multi_gpu/test_allreduce.py (7)
24-30: Imports for autotune and accuracy utilities look good.Adopts public APIs and shared test utilities appropriately.
103-111: Extendingrun_allreduce_opwithis_autotuneis consistent with the call chain.Signature matches the updated caller; no issues spotted.
143-145: Explicit strategy selection is correct (NCCL baseline vs AUTOTUNE).This guards the baseline path from AUTO/MNNVL variability; good for deterministic testing.
165-165: Normalize return shape forfusion_op == NONE.Wrapping the tensor in a list makes downstream zipping uniform. Good.
228-242: Fusion dispatch table now coversNONEvia the common path.Mapping aligns with helper functions; consistent with the new return-shape contract.
306-329: New autotune-enabled test matrix is appropriate.Param choices exercise small and large seq lens and guard NVFP4 with
skip_pre_blackwell. Looks good.
66-81: Code change is correct—backward compatibility verified.Found two call sites in
test_allreduce.py:
- Line 297: passes 8 arguments,
is_autotuneuses defaultFalse✓- Line 338: passes 9 arguments with explicit
Trueforis_autotune✓The function signature at line 66 with
is_autotune: bool = Falseensures existing callers work unchanged while new tests can opt-in by passingTrue. No positional argument mismatches.
| class AllReduceRunner(TunableRunner): | ||
| all_support_ops = { | ||
| AllReduceFusionOp.NONE.value, | ||
| AllReduceFusionOp.RESIDUAL_RMS_NORM.value, | ||
| } | ||
|
|
||
| tuning_config = TuningConfig( | ||
| dynamic_tensor_specs=(DynamicTensorSpec( | ||
| 0, 0, | ||
| (8192, 4096, 2048, 1024, 512, 256, 128, 64, 32, 16, 8, 4, 2, 1), | ||
| last_positive_power_of_2), ), | ||
| constraint_specs=(ConstraintSpec(1, 0, lambda shapes: shapes[0][0]), ), | ||
| ) | ||
|
|
||
| def __init__( | ||
| self, | ||
| tp_size: int, | ||
| group: List[int], | ||
| op: int, | ||
| eps: float, | ||
| trigger_completion_at_end: bool, | ||
| ): | ||
| self.tp_size = tp_size | ||
| self.op = op | ||
| self._group = group | ||
| self._eps = eps | ||
| self._trigger_completion_at_end = trigger_completion_at_end | ||
|
|
||
| def __hash__(self): | ||
| return hash((self.tp_size, self.op)) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix: Support MPI-disabled PG path in AllReduceRunner.
AllReduceRunner.forward always uses torch.ops.trtllm.allreduce. When MPI is disabled you must call allreduce_pg and pass rank/pg. Suggested patch:
@@
-from tensorrt_llm.functional import AllReduceFusionOp, AllReduceStrategy
+from tensorrt_llm.functional import AllReduceFusionOp, AllReduceStrategy
+from tensorrt_llm._utils import mpi_disabled
+from typing import Optional, Any, ClassVar
@@
-class AllReduceRunner(TunableRunner):
- all_support_ops = {
+class AllReduceRunner(TunableRunner):
+ all_support_ops: ClassVar[set[int]] = {
AllReduceFusionOp.NONE.value,
AllReduceFusionOp.RESIDUAL_RMS_NORM.value,
}
@@
- def __init__(
+ def __init__(
self,
tp_size: int,
group: List[int],
op: int,
eps: float,
trigger_completion_at_end: bool,
+ *,
+ rank: Optional[int] = None,
+ pg: Optional[Any] = None,
):
self.tp_size = tp_size
self.op = op
self._group = group
self._eps = eps
self._trigger_completion_at_end = trigger_completion_at_end
+ self._rank = rank
+ self._pg = pg
@@
- return torch.ops.trtllm.allreduce(
- input,
- residual,
- norm_weight,
- scale,
- bias,
- workspace,
- self._group,
- tactic,
- self.op,
- self._eps,
- self._trigger_completion_at_end,
- )
+ if mpi_disabled() and self._pg is not None and self._rank is not None:
+ return torch.ops.trtllm.allreduce_pg(
+ input,
+ residual,
+ norm_weight,
+ scale,
+ bias,
+ workspace,
+ self._group,
+ self._pg,
+ tactic,
+ self.op,
+ self._eps,
+ self._trigger_completion_at_end,
+ )
+ else:
+ return torch.ops.trtllm.allreduce(
+ input,
+ residual,
+ norm_weight,
+ scale,
+ bias,
+ workspace,
+ self._group,
+ tactic,
+ self.op,
+ self._eps,
+ self._trigger_completion_at_end,
+ )Also applies to: 1188-1210
🧰 Tools
🪛 Ruff (0.14.1)
1144-1147: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
| @torch.library.custom_op("trtllm::tunable_allreduce", mutates_args=()) | ||
| def tunable_allreduce( | ||
| input: torch.Tensor, | ||
| residual: Optional[torch.Tensor], | ||
| norm_weight: Optional[torch.Tensor], | ||
| scale: Optional[torch.Tensor], | ||
| bias: Optional[torch.Tensor], | ||
| workspace: Optional[torch.Tensor], | ||
| group: List[int], | ||
| strategy: int, | ||
| op: int, | ||
| eps: float, | ||
| tp_size: int, | ||
| trigger_completion_at_end: bool, | ||
| ) -> List[torch.Tensor]: | ||
|
|
||
| tuner = AutoTuner.get() | ||
|
|
||
| allreduce_runner = AllReduceRunner( | ||
| tp_size, | ||
| group, | ||
| op, | ||
| eps, | ||
| trigger_completion_at_end, | ||
| ) | ||
|
|
||
| _, best_tactic = tuner.choose_one( | ||
| "trtllm::tunable_allreduce::allreduce", | ||
| [allreduce_runner], | ||
| AllReduceRunner.tuning_config, | ||
| [input, residual, norm_weight, scale, bias, workspace], | ||
| ) | ||
|
|
||
| return allreduce_runner( | ||
| [input, residual, norm_weight, scale, bias, workspace], | ||
| tactic=best_tactic, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix: Wire PG args through tunable_allreduce and correct fake signature.
- Missing tp_size in @register_fake signature causes schema mismatch.
- Need to accept optional rank/pg and pass to runner for MPI-disabled case.
Patch:
@@
-@torch.library.custom_op("trtllm::tunable_allreduce", mutates_args=())
+@torch.library.custom_op("trtllm::tunable_allreduce", mutates_args=())
def tunable_allreduce(
input: torch.Tensor,
residual: Optional[torch.Tensor],
norm_weight: Optional[torch.Tensor],
scale: Optional[torch.Tensor],
bias: Optional[torch.Tensor],
workspace: Optional[torch.Tensor],
group: List[int],
strategy: int,
op: int,
eps: float,
tp_size: int,
trigger_completion_at_end: bool,
+ rank: Optional[int] = None,
+ pg: Optional[object] = None,
) -> List[torch.Tensor]:
@@
- allreduce_runner = AllReduceRunner(
+ allreduce_runner = AllReduceRunner(
tp_size,
group,
op,
eps,
trigger_completion_at_end,
+ rank=rank,
+ pg=pg,
)
@@
-@tunable_allreduce.register_fake
+@tunable_allreduce.register_fake
def _(
input: torch.Tensor,
residual: Optional[torch.Tensor],
norm_weight: Optional[torch.Tensor],
scale: Optional[torch.Tensor],
bias: Optional[torch.Tensor],
workspace: Optional[torch.Tensor],
group: List[int],
strategy: int,
op: int,
eps: float,
+ tp_size: int,
trigger_completion_at_end: bool,
+ rank: Optional[int] = None,
+ pg: Optional[object] = None,
) -> torch.Tensor:This aligns the fake schema with the real op and unblocks torch.compile/meta.
Also applies to: 1251-1295
🧰 Tools
🪛 Ruff (0.14.1)
1221-1221: Unused function argument: strategy
(ARG001)
| if self.strategy == AllReduceStrategy.AUTOTUNE: | ||
| output = torch.ops.trtllm.tunable_allreduce( | ||
| input=input, | ||
| residual=all_reduce_params.residual, | ||
| norm_weight=all_reduce_params.norm_weight, | ||
| scale=all_reduce_params.scale, | ||
| bias=all_reduce_params.bias, | ||
| workspace=self.workspace, | ||
| group=self.mapping.tp_group, | ||
| strategy=allreduce_strategy, | ||
| op=all_reduce_params.fusion_op, | ||
| eps=all_reduce_params.eps, | ||
| tp_size=self.mapping.tp_size, | ||
| trigger_completion_at_end=all_reduce_params. | ||
| trigger_completion_at_end, | ||
| ) | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MPI-disabled bug: AUTOTUNE path ignores ProcessGroup and rank; uses non-PG op.
In the AUTOTUNE branch you call trtllm.tunable_allreduce without passing rank/pg, and AllReduceRunner always dispatches to allreduce (non-PG). When mpi_disabled(), this will fail or hang.
Pass rank and pg through and make the runner use allreduce_pg when MPI is disabled. Suggested patch (ops side):
@@
- if self.strategy == AllReduceStrategy.AUTOTUNE:
- output = torch.ops.trtllm.tunable_allreduce(
+ if self.strategy == AllReduceStrategy.AUTOTUNE:
+ output = torch.ops.trtllm.tunable_allreduce(
input=input,
residual=all_reduce_params.residual,
norm_weight=all_reduce_params.norm_weight,
scale=all_reduce_params.scale,
bias=all_reduce_params.bias,
workspace=self.workspace,
group=self.mapping.tp_group,
strategy=allreduce_strategy,
op=all_reduce_params.fusion_op,
eps=all_reduce_params.eps,
tp_size=self.mapping.tp_size,
- trigger_completion_at_end=all_reduce_params.
- trigger_completion_at_end,
+ trigger_completion_at_end=all_reduce_params.trigger_completion_at_end,
+ # Wire PG when MPI is disabled
+ **({"rank": torch.distributed.get_rank(),
+ "pg": self.mapping.tp_group_pg.boxed()} if self._disable_mpi else {}),
)Apply the complementary changes in tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (see my other comment) to accept rank/pg and call allreduce_pg accordingly.
Committable suggestion skipped: line range outside the PR's diff.
Signed-off-by: Yukun He <[email protected]>
Signed-off-by: Yukun He <[email protected]>
8729ac4 to
eb789c7
Compare
Signed-off-by: Yukun He <[email protected]>
Summary by CodeRabbit
Release Notes
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.