Torch-TensorRT v2.7.0 #3509

narendasan · 2025-05-07T23:44:30Z

narendasan
May 7, 2025
Collaborator

PyTorch 2.7, CUDA 12.8, TensorRT 10.9, Python 3.13

Torch-TensorRT 2.7.0 targets PyTorch 2.7, TensorRT 10.9, and CUDA 12.8, (builds for CUDA 11.8/12.4 are available via the PyTorch package index - https://download.pytorch.org/whl/cu118 https://download.pytorch.org/whl/cu124). Python versions from 3.9-3.13 are supported. We no longer provide builds for the pre-cxx11-abi, all wheels and tarballs will use the cxx11 ABI.

Known Issues

Engine refitting is disabled in Python 3.13.

Using Self Defined Kernels in TensorRT Engines using Automatic Plugin Generation

Users may develop their own custom kernels using DSLs such as OpenAI Triton. Through the use of PyTorch Custom Ops and Torch-TensorRT Automatic Plugin Generation, these kernels can be called within the TensorRT engine with minimal extra code required.

@triton.jit
def elementwise_scale_mul_kernel(X, Y, Z, a, b, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(0)
    # Compute the range of elements that this thread block will work on
    block_start = pid * BLOCK_SIZE
    # Range of indices this thread will handle
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    # Load elements from the X and Y tensors
    x_vals = tl.load(X + offsets)
    y_vals = tl.load(Y + offsets)
    # Perform the element-wise multiplication
    z_vals = x_vals * y_vals * a + b
    # Store the result in Z
    tl.store(Z + offsets, z_vals)


@torch.library.custom_op("torchtrt_ex::elementwise_scale_mul", mutates_args=())  # type: ignore[misc]
def elementwise_scale_mul(
    X: torch.Tensor, Y: torch.Tensor, b: float = 0.2, a: int = 2
) -> torch.Tensor:
    # Ensure the tensors are on the GPU
    assert X.is_cuda and Y.is_cuda, "Tensors must be on CUDA device."
    assert X.shape == Y.shape, "Tensors must have the same shape."

    # Create output tensor
    Z = torch.empty_like(X)

    # Define block size
    BLOCK_SIZE = 1024

    # Grid of programs
    grid = lambda meta: (X.numel() // meta["BLOCK_SIZE"],)

    # Launch the kernel with parameters a and b
    elementwise_scale_mul_kernel[grid](X, Y, Z, a, b, BLOCK_SIZE=BLOCK_SIZE)

    return Z

@torch.library.register_fake("torchtrt_ex::elementwise_scale_mul")
def _(x: torch.Tensor, y: torch.Tensor, b: float = 0.2, a: int = 2) -> torch.Tensor:
    return x

torch_tensorrt.dynamo.conversion.plugins.custom_op("torchtrt_ex::elementwise_scale_mul", supports_dynamic_shapes=True, requires_output_allocator=False)

trt_mod_w_kernel = torch_tensorrt.compile(module, ...)

torch_tensorrt.dynamo.conversion.plugins.custom_op will generate a TensorRT plugin using the Quick Deploy Plugin system and using PyTorch's FakeTensor mode by reusing information required to register a Torch custom op to use with TorchDynamo. It will also generate the Torch-TensorRT converter to insert the plugin to the TensorRT engine.

QDP Plugins for Torch Custom Ops and Converters for QDP Plugins can be generated individually using

torch_tensorrt.dynamo.conversion.plugins.generate_plugin(
    "torchtrt_ex::elementwise_scale_mul"
)
torch_tensorrt.dynamo.conversion.plugins.generate_plugin_converter(
    "torchtrt_ex::elementwise_scale_mul",
    supports_dynamic_shapes=True,
    requires_output_allocator=False,
)

MutableTorchTensorRTModule improvements

MutableTorchTensorRTModule automatically recompiles if the engine becomes invalid. Previously, engines would assume static shape which means that if a user provides a different sized input, the graph would recompile or pull from engine cache. Now developers are able to provide shape hints to the MutableTorchTensorRTModule which will allow the module to handle a broader range of inputs without recompiling. For example:

pipe.unet = torch_tensorrrt.MutableTorchTensorRTModule(pipe.unet, **settings)
BATCH = torch.export.Dim("BATCH", min=2, max=24)
_HEIGHT = torch.export.Dim("_HEIGHT", min=16, max=32)
_WIDTH = torch.export.Dim("_WIDTH", min=16, max=32)
HEIGHT = 4 * _HEIGHT
WIDTH = 4 * _WIDTH
args_dynamic_shapes = ({0: BATCH, 2: HEIGHT, 3: WIDTH}, {})
kwargs_dynamic_shapes = {
    "encoder_hidden_states": {0: BATCH},
    "added_cond_kwargs": {
        "text_embeds": {0: BATCH},
        "time_ids": {0: BATCH},
     },
     "return_dict": None,
}
pipe.unet.set_expected_dynamic_shape_range(
    args_dynamic_shapes, kwargs_dynamic_shapes
)

Data Dependent Shape support

For networks that produce outputs whose shapes are dependent on the shape of the input, the output buffer must be allocated at runtime. To support this use case we have added a new runtime mode Dynamic Output Allocation Mode to support Data Dependent Shape (DDS) operations, such as NonZero op. (#3388)

Note:

Dynamic output allocation mode cannot be used in conjunction with CUDA Graphs nor pre-allocated outputs feature.
Without dynamic output allocation, the output buffer is allocated based on the inferred output shape based on input size.

There are two scenarios in which dynamic output allocation is enabled:

The model has been identified at compile time to require dynamic output allocation for at least one TensorRT subgraph. These models will engage the runtime mode automatically (with logging) and are incompatible with other runtime modes such as CUDA Graphs. Converters can declare that subgraphs that they produce will require the output allocator using requires_output_allocator=True there by forcing any model which utilizes the converter to automatically use the output allocator runtime mode. e.g.,

    @dynamo_tensorrt_converter(
        torch.ops.aten.nonzero.default,
        supports_dynamic_shapes=True,
        requires_output_allocator=True,
    )
    def aten_ops_nonzero(
        ctx: ConversionContext,
        target: Target,
        args: Tuple[Argument, ...],
        kwargs: Dict[str, Argument],
        name: str,
    ) -> Union[TRTTensor, Sequence[TRTTensor]]:
        ...

Users may manually enable dynamic output allocation mode via the torch_tensorrt.runtime.enable_output_allocator context manager.

    # Enables Dynamic Output Allocation Mode, then resets the mode to its prior setting
    with torch_tensorrt.runtime.enable_output_allocator(trt_module):
        ...

Tiling Optimization support

Tiling optimization enables cross-kernel tiled inference. This technique leverages on-chip caching for continuous kernels in addition to kernel-level tiling. It can significantly enhance performance on platforms constrained by memory bandwidth. (#3444)

We currently support four tiling strategies "none", "fast", "moderate", "full". A higher level allows TensorRT to spend more time searching for better tiling strategy. Here's an example to call tiling optimization:

    compiled_model = torch_tensorrt.compile(
        model,
        ir="dynamo",
        inputs=inputs,
        tiling_optimization_level="full",
        l2_limit_for_tiling=10,
    )

Model Zoo additions

Added support for compiling the FLUX.1-dev 12B model in our model zoo. An example is available here. Quantized variants of FLUX are under development as part of future work.

General Improvements

Improved BF16 support in model compilation by fixing bugs and adding new tests to cover both full-graph and graph-break scenarios.
Significantly accelerated model compilation time (Accelerate network interpretation by 15x; fixed redundant code in TRT Interpreter #3396)

Python 3.13 support

We added support for Python 3.13 (#3455). However, due to the Python object reference issue in PyTorch 2.7, we disabled the refitting related features for Python 3.13 in this release. This issue should be fixed in the next release.

What's Changed

Fix usage example by @ohadravid in Fix usage example #3337
Bump TRT version to 10.7 by @zewenli98 in Bump TRT version to 10.7 #3313
using nccl ops from TRT-LLM namespace by @apbose in using nccl ops from TRT-LLM namespace #3250
feat: Trigger Actions to run multiple TRT versions weekly by @zewenli98 in feat: Trigger Actions to run multiple TRT versions weekly #3346
fix: torch 2.7 bump bug on the main branch by @zewenli98 in fix: torch 2.7 bump bug on the main branch #3353
fix: remove legacy conv converter by @chohk88 in fix: remove legacy conv converter #3343
chore: flip use_cxx11_abi naming by @zewenli98 in chore: flip use_cxx11_abi naming #3361
chore: address flaky test failures related to global partitioning by @peri044 in chore: address flaky test failures related to global partitioning #3369
fix(aten::instance_norm): Handle optional inputs in instance norm con… by @narendasan in fix(aten::instance_norm): Handle optional inputs in instance norm con… #3367
chore: moving away from tensorrt_bindings by @narendasan in chore: moving away from tensorrt_bindings #3365
Use IUnsqueezeLayer in unsqueeze impl by @HolyWu in Use IUnsqueezeLayer in unsqueeze impl #3366
Deprecate torchscript frontend by @narendasan in Deprecate torchscript frontend #3373
feat: Add FLUX-1.dev model to the model zoo by @peri044 in feat: Add FLUX-1.dev model to the model zoo #3382
Accelerate network interpretation by 15x; fixed redundant code in TRT Interpreter by @cehongwang in Accelerate network interpretation by 15x; fixed redundant code in TRT Interpreter #3396
chore(deps): bump transformers from 4.40.2 to 4.48.0 in /tests/modules by @dependabot in chore(deps): bump transformers from 4.40.2 to 4.48.0 in /tests/modules #3389
chore(deps): bump transformers from 4.44.2 to 4.48.0 in /examples/dynamo by @dependabot in chore(deps): bump transformers from 4.44.2 to 4.48.0 in /examples/dynamo #3404
chore(deps): bump @octokit/request-error and @actions/github in /.github/actions/assigner by @dependabot in chore(deps): bump @octokit/request-error and @actions/github in /.github/actions/assigner #3399
[oncall] Fix vulnerability in the transformers dependency for examples/dynamo by @jingsh in [oncall] Fix vulnerability in the transformers dependency for examples/dynamo #3403
feat: add --use_python_runtime and --enable_cuda_graph args to the perf run script by @zewenli98 in feat: add --use_python_runtime and --enable_cuda_graph args to the perf run script #3397
fix: Split addmm nodes to not cast bias for FP32 accumulation and flux example fixes. by @peri044 in fix: Split addmm nodes to not cast bias for FP32 accumulation and flux example fixes. #3395
chore(deps): bump @octokit/plugin-paginate-rest from 9.2.1 to 9.2.2 in /.github/actions/assigner by @dependabot in chore(deps): bump @octokit/plugin-paginate-rest from 9.2.1 to 9.2.2 in /.github/actions/assigner #3413
chore(deps): bump @octokit/request from 8.4.0 to 8.4.1 in /.github/actions/assigner by @dependabot in chore(deps): bump @octokit/request from 8.4.0 to 8.4.1 in /.github/actions/assigner #3414
feat: Automatically generate QDP plugins by @bowang007 in feat: Automatically generate QDP plugins #3370
[reland] SDPA decomposition by @HolyWu in [reland] SDPA decomposition #3336
Fix: Fix the plugin test failure in CI by @bowang007 in Fix: Fix the plugin test failure in CI #3417
fix: Fix CI issues due to unintended fake tensor creation in torch.compile tests by @peri044 in fix: Fix CI issues due to unintended fake tensor creation in torch.compile tests #3416
chore: Bump to CUDA 12.8 and TRT 10.8 for Blackwell support by @zewenli98 in chore: Bump to CUDA 12.8 and TRT 10.8 for Blackwell support #3405
Mutable module improvement by @cehongwang in Mutable module improvement #3394
docs: Updated cuda graphs doc by @keehyuna in docs: Updated cuda graphs doc #3357
fix: nightly build python versions by @zewenli98 in fix: nightly build python versions #3426
fix: typing issue in Python 3.9 and earlier by @zewenli98 in fix: typing issue in Python 3.9 and earlier #3431
chore: bump to TRT 10.9 by @zewenli98 in chore: bump to TRT 10.9 #3436
fix: remove TRT version restriction by @zewenli98 in fix: remove TRT version restriction #3435
removing the fuse distributed ops lowering pass for tegra platforms by @apbose in removing the fuse distributed ops lowering pass for tegra platforms #3411
feat: second attempt to support DDS and NonZero op by @zewenli98 in feat: second attempt to support DDS and NonZero op #3388
fix: conv parameter check failure by @chohk88 in fix: conv parameter check failure #3428
Support broadcast index put by @chohk88 in Support broadcast index put #3421
DLFW changes 25.01 onwards by @apbose in DLFW changes 25.01 onwards #3356
feat: support Output Allocator runtime mode for plugin autogen by @zewenli98 in feat: support Output Allocator runtime mode for plugin autogen #3442
Fix assertions in slice_scatter decomposition by @kkimmk in Fix assertions in slice_scatter decomposition #3420
fix: Fix PTQ export by @peri044 in fix: Fix PTQ export #3447
feat: support tiling optimization as of TRT 10.8 by @zewenli98 in feat: support tiling optimization as of TRT 10.8 #3444
fix: structured inputs for CudaGraphsTorchTensorRTModule by @keehyuna in fix: structured inputs for CudaGraphsTorchTensorRTModule #3407
Cherrypick Nccl ops correction changes #3387 for release/2.7 by @zewenli98 in Cherrypick #3387 for release/2.7 #3451
Cherrypick fix: Destory cuda graphs before setting weight streaming #3461 for release/2.7 by @zewenli98 in Cherrypick #3461 for release/2.7 #3464
Cherrypick chore: reenable py313 #3455 for release/2.7 by @zewenli98 in Cherrypick #3455 for release/2.7 #3480
Cherrypick of PR 3472 by @apbose in Cherrypick of PR 3472 #3482
Cherrypick feat: rmsnorm lowering #3440 for release/2.7 by @zewenli98 in Cherrypick #3440 for release/2.7 #3485
Cherrypick feat: Support flashinfer.rmsnorm #3424 for release/2.7 by @zewenli98 in Cherrypick #3424 for release/2.7 #3484
fix: l2_limit_for_tiling (fix: l2_limit_for_tiling #3479) cherry pick to 2.7 by @peri044 in fix: l2_limit_for_tiling (#3479) cherry pick to 2.7 #3493
Cherrypick Enabled refit on Python 3.13 #3481 and fix: Change the translational layer from numpy to torch during conversion to handle additional data types #3445 by @zewenli98 in Cherrypick #3481 and #3445 #3498
Cherry pick to release 2.7-3499-add python3.13 into the final release artifact by @lanluo-nvidia in Cherry pick to release 2.7-3499-add python3.13 into the final release artifact #3500
fix: cherry pick PR of 3445 by @peri044 in fix: cherry pick PR of 3445 #3457
Cherrypick of PR 3462 by @apbose in Cherrypick of PR 3462 #3483

New Contributors

@ohadravid made their first contribution in Fix usage example #3337
@jingsh made their first contribution in [oncall] Fix vulnerability in the transformers dependency for examples/dynamo #3403
@kkimmk made their first contribution in Fix assertions in slice_scatter decomposition #3420

Full Changelog: v2.6.0...v2.7.0

This discussion was created from the release Torch-TensorRT v2.7.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Torch-TensorRT v2.7.0 #3509

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Torch-TensorRT v2.7.0 #3509

Uh oh!

narendasan May 7, 2025 Collaborator

PyTorch 2.7, CUDA 12.8, TensorRT 10.9, Python 3.13

Known Issues

Using Self Defined Kernels in TensorRT Engines using Automatic Plugin Generation

MutableTorchTensorRTModule improvements

Data Dependent Shape support

Note:

Tiling Optimization support

Model Zoo additions

General Improvements

Python 3.13 support

What's Changed

New Contributors

Replies: 0 comments

narendasan
May 7, 2025
Collaborator