[TRTLLM-8658][infra] upgrade to DLFW 25.10 and pytorch 2.9.0 / triton 3.5.0 #8621

ZhanruiSunCh · 2025-10-23T08:44:23Z

Summary by CodeRabbit

Chores
- Updated CUDA toolkit to version 13.0.2
- Updated PyTorch to version 2.9.0
- Updated TensorRT to version 10.13.3.9
- Updated supporting NVIDIA libraries and tooling versions
- Removed CUDA 12.9 build support
- Removed protobuf dependency
- Updated base container images to development versions

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: ZhanruiSunCh <[email protected]>

…fp8_mpi[DeepSeek-V3-Lite-fp8] Signed-off-by: ZhanruiSunCh <[email protected]>

Signed-off-by: ZhanruiSunCh <[email protected]>

ZhanruiSunCh · 2025-10-24T03:07:08Z

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1"

tensorrt-cicd · 2025-10-24T03:13:32Z

PR_Github #22363 [ run ] triggered by Bot. Commit: 5f0dac6

tensorrt-cicd · 2025-10-24T04:52:49Z

PR_Github #22363 [ run ] completed with state SUCCESS. Commit: 5f0dac6
/LLM/release-1.1/L0_MergeRequest_PR pipeline #249 (Partly Tested) completed with status: 'SUCCESS'

Signed-off-by: ZhanruiSunCh <[email protected]>

ZhanruiSunCh · 2025-10-24T09:59:12Z

/bot run --skip-test

coderabbitai · 2025-10-24T10:04:46Z

📝 Walkthrough

Walkthrough

The changes update TensorRT-LLM dependencies and Docker configuration to support CUDA 13.0-13.2, PyTorch 2.9.0, and TensorRT 10.13.3.9. CUDA 12.9 build configurations are removed from CI/CD pipelines. Docker images switch to GitLab-based staging builds, protobuf installation is removed, and MPI rank detection via SLURM/OpenMPI is introduced for distributed workloads.

Changes

Cohort / File(s)	Summary
Dependency Constraint Removals `constraints.txt`, `docker/common/install.sh`	Removed protobuf dependency constraint (>=4.25.8) and associated WAR advisory comment; protobuf installation feature entirely eliminated from install.sh.
Docker Build Configuration Updates `docker/Dockerfile.multi`, `docker/Makefile`	Updated Dockerfile.multi to use GitLab-based image references, simplified post-installation steps by removing protobuf and Triton symlink handling. Incremented BASE_TAG from 13.0.0 to 13.0.1 across Rocky Linux and Ubuntu 22.04 targets in Makefile.
Docker Installation Script Version Updates `docker/common/install_cuda_toolkit.sh`, `docker/common/install_pytorch.sh`, `docker/common/install_tensorrt.sh`	Updated CUDA toolkit from 13.0.0_580.65.06 to 13.0.2_580.95.05; PyTorch from 2.8.0 to 2.9.0; TensorRT from 10.13.2.6 to 10.13.3.9 with corresponding updates to cuDNN, cuBLAS, NVRTC, CUDA Runtime, and driver versions.
Docker MPI Support `docker/common/install_mpi4py.sh`	Introduced conditional MPI rank detection block for TRTLLM_USE_MPI_KVCACHE, with CUDA runtime import fallback logic (cuda.bindings.runtime → cuda.cudart) and rank detection via SLURM_PROCID or OMPI_COMM_WORLD_RANK environment variables.
Jenkins CI/CD Configuration Cleanup `jenkins/Build.groovy`, `jenkins/L0_Test.groovy`	Removed CUDA 12.9 docker image constants (LLM_DOCKER_IMAGE_12_9, LLM_SBSA_DOCKER_IMAGE_12_9) and CU12 build configurations (CONFIG_LINUX_X86_64_VANILLA_CU12, CONFIG_LINUX_AARCH64_CU12). Simplified runLLMBuild method signature by removing is_cu12 parameter and eliminated all CUDA 12.9-specific test branches and image references.
Jenkins Image Tag Updates `jenkins/current_image_tags.properties`	Replaced versioned image variables with unified staging/tritondevel image definitions (LLM_DOCKER_IMAGE, LLM_SBSA_DOCKER_IMAGE, LLM_ROCKYLINUX8_PY310_DOCKER_IMAGE, LLM_ROCKYLINUX8_PY312_DOCKER_IMAGE), removing CUDA 12.9 variant suffixes.
Python Dependencies `requirements.txt`	Updated CUDA index from cu128 to cu130; bumped PyTorch to 2.9.0a0 with ecosystem alignment; updated triton from 3.3.1 to 3.5.0; removed pinned pillow==10.3.0; added datasets==3.1.0; adjusted NVIDIA toolchain packages for CUDA 13 compatibility.

Sequence Diagram

sequenceDiagram
    participant Script as install_mpi4py.sh
    participant System as System/Environment
    participant CUDA as CUDA Runtime
    participant MPI as MPI Library

    Script->>Script: Check if TRTLLM_USE_MPI_KVCACHE == 1
    alt Conditional Block Active
        Script->>Script: Import CUDA bindings (try cuda.bindings.runtime)
        alt cudart import fails
            Script->>Script: Fallback to cuda.cudart
        end
        
        Script->>System: Check SLURM_PROCID
        alt SLURM_PROCID exists
            Script->>Script: Set rank from SLURM_PROCID
        else
            Script->>System: Check OMPI_COMM_WORLD_RANK
            alt OMPI_COMM_WORLD_RANK exists
                Script->>Script: Set rank from OMPI_COMM_WORLD_RANK
            else
                Script->>Script: Error: No rank source found
            end
        end
        
        rect rgba(100, 200, 150, 0.2)
            note right of Script: Rank Detection & CUDA Setup
            Script->>CUDA: Compute effective rank (rank % device_count)
            Script->>CUDA: Set device to effective rank
            Script->>Script: Print selected rank and device
        end
        
        Script->>MPI: Continue with MPI comm wrapping
        Script->>MPI: Proceed with installation steps
    else Conditional Block Inactive
        Script->>Script: Skip MPI-specific logic
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

The review demands attention across multiple categories: (1) heterogeneous version updates requiring validation of compatibility chains (CUDA 13.0/13.2, PyTorch 2.9.0, TensorRT 10.13.3.9); (2) substantial restructuring of CI/CD pipelines with CUDA 12.9 removal across two large Groovy files, including method signature changes affecting call sites; (3) new MPI rank detection logic introducing fallback patterns and environment-variable-based control flow; (4) image configuration migration to staging/tritondevel lineage with semantic impact on build sourcing. While individual changes are mostly straightforward, the breadth and interconnected nature of version/configuration updates necessitate cross-file verification to ensure consistent CUDA/PyTorch targeting.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description is entirely empty of substantive content. The author has provided only the template structure with all section headers and comments, but no actual information has been filled in. The Description section is missing an explanation of the issue and solution, the Test Coverage section lacks any test information, and the PR Checklist remains unchecked with items unaddressed. Given that this PR involves significant infrastructure changes including CUDA version upgrades, removal of CUDA 12 support, and updates to multiple build configurations and dependencies, the absence of any explanatory content makes it impossible to understand the rationale or scope of these changes.	The author should complete the PR description by filling in the Description section with a brief explanation of why this infrastructure upgrade is being performed and what benefits it provides. The Test Coverage section should document which tests validate the changes, and the PR Checklist items should be reviewed and checked as appropriate. At minimum, the description should clarify the scope of breaking changes (such as the removal of CUDA 12.9 support) and any migration steps required from the changes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "[TRTLLM-8658][infra] upgrade to DLFW 25.10 and pytorch 2.9.0 / triton 3.5.0" is directly related to the primary changes in the changeset. The raw_summary confirms that PyTorch was updated from 2.8.0 to 2.9.0, Triton was updated to 3.5.0, and the Docker configuration was updated with GitLab-based images and development tags corresponding to DLFW 25.10. The title follows the repository's required format with the JIRA ticket identifier and [infra] type designation, is concise and clear, and would help teammates quickly understand the main objective of this infrastructure upgrade PR.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docker/common/install_mpi4py.sh (1)
30-71: Fix critical git dependency and rank detection logic; os import not needed.

The patched code uses git apply without installing git first—this will fail in containers that lack git. The rank detection uses elif which prevents detecting when both SLURM_PROCID and OMPI_COMM_WORLD_RANK are simultaneously set; use two separate if statements instead.

The os module concern is resolved—it is already imported at line 10 of the upstream mpi4py 3.1.5 source file.

Critical fixes required:

Install git before running git apply in docker/common/install_mpi4py.sh (after line 18, before the cd and git apply):
+if ! command -v git >/dev/null 2>&1; then
+    if grep -qi 'ubuntu\|debian' /etc/os-release; then
+        apt-get update && apt-get install -y git && apt-get clean && rm -rf /var/lib/apt/lists/*
+    else
+        dnf makecache --refresh && dnf install -y git && dnf clean all
+    fi
+fi
 cd "$TMP_DIR/mpi4py-${MPI4PY_VERSION}"
 git apply <<EOF
In the patch itself, replace the elif with a second if to detect rank conflicts:
-        if(os.getenv("SLURM_PROCID")):
+        if os.getenv("SLURM_PROCID") is not None:
             slurm_rank = int(os.environ["SLURM_PROCID"])
             has_slurm_rank=True
-        elif(os.getenv("OMPI_COMM_WORLD_RANK")):
+        if os.getenv("OMPI_COMM_WORLD_RANK") is not None:
             ompi_rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
             has_ompi_rank=True

🧹 Nitpick comments (5)

docker/common/install_tensorrt.sh (1)
5-9: Version bumps look coherent; add a guard to avoid partial installs if repo lag occurs.

Consider pre-checking availability of pinned libcudnn/libcublas/nvrtc/NCCL versions before purge to avoid leaving images in a broken state when mirrors lag.

Example guard:
@@ install_ubuntu_requirements() {
-    apt-get update
+    apt-get update
+    # Verify packages exist before removing current libs
+    apt-cache policy libcudnn9-cuda-13=${CUDNN_VER} | grep Candidate: || { echo "Pinned cuDNN ${CUDNN_VER} not found"; exit 1; }
+    apt-cache policy libcublas-$(echo $CUDA_VER | sed 's/\./-/g')=${CUBLAS_VER} | grep Candidate: || { echo "Pinned cuBLAS ${CUBLAS_VER} not found"; exit 1; }
Same idea can be applied to Rocky with repoquery or simple HEAD checks. This reduces CI flakiness.

Also applies to: 11-16, 19-21
requirements.txt (1)
1-1: Align Torch/TV vision pins and sanity-check constraints.

Torch range allows 2.9.0a0; prefer stable only unless alpha is required.

torchvision is unpinned; pin to the compatible series for Torch 2.9 to avoid resolver mismatches in CI.

Confirm constraints.txt exists if referenced; remove the “-c constraints.txt” include otherwise.

Suggested edits:
-# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-10.html#rel-25-10 uses 2.9.0a0.
-torch>=2.9.0a0,<=2.9.0
-torchvision
+torch==2.9.0
+torchvision~=0.20.0
If you need alpha compatibility for some targets, consider environment markers instead of a broad a0 allowance.

Also applies to: 23-27, 36-36, 67-67
jenkins/L0_Test.groovy (2)
42-43: Staging DLFW image is hard-pinned; add override and fallback.

Hard-coding a staging digest can break external forks or later reruns. Allow override via env and default to the public NGC tag with a fallback to staging.
-// DLFW_IMAGE = "urm.nvidia.com/docker/nvidia/pytorch:25.10-py3"
-DLFW_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm-staging/devel:pytorch_25.10-py3.36764868-devel"
+DLFW_IMAGE = env.DLFW_IMAGE ?: "urm.nvidia.com/docker/nvidia/pytorch:25.10-py3"
+// Optional fallback to staging if needed downstream:
+DLFW_IMAGE = env.USE_STAGING_DLFW?.toBoolean() ? "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm-staging/devel:pytorch_25.10-py3.36764868-devel" : DLFW_IMAGE
2349-2364: DLFW pip constraint cleanup and CUDA toolkit install are OS-aware; add guard for repeated installs.

On DLFW images, constraint file is cleared; on Ubuntu images, cuda-toolkit-13-0 is installed. Add an idempotency check to skip re-install if nvcc already reports 13.0 to save minutes in CI.
-apt-get -y install cuda-toolkit-13-0
+if ! nvcc --version 2>/dev/null | grep -q "release 13.0"; then
+  apt-get -y install cuda-toolkit-13-0
+fi
jenkins/Build.groovy (1)

457-458: Hardcoded Triton tag—ensure sync and document versioning strategy.

Line 457 hardcodes tritonShortTag = "r25.10", which is then applied to all four Triton repository tags in the CMake configuration. This matches the TRITON_BASE_TAG in docker/Dockerfile.multi (25.10-py3.36928345-devel).

Verify:

Is there a mechanism to keep this tag synchronized with docker/Dockerfile.multi's TRITON_BASE_TAG, or will manual updates be required?

Should this value be extracted from a shared configuration file or environment variable to avoid future desync?

Add a comment linking this to the Dockerfile.multi logic so maintainers understand the coupling.

Alternatively, if dynamic extraction is no longer needed due to the DLFW 25.10 stability, document this decision in the PR description.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d27034 and bfeb7b0.

📒 Files selected for processing (12)

constraints.txt (0 hunks)
docker/Dockerfile.multi (2 hunks)
docker/Makefile (1 hunks)
docker/common/install.sh (0 hunks)
docker/common/install_cuda_toolkit.sh (1 hunks)
docker/common/install_mpi4py.sh (1 hunks)
docker/common/install_pytorch.sh (1 hunks)
docker/common/install_tensorrt.sh (1 hunks)
jenkins/Build.groovy (1 hunks)
jenkins/L0_Test.groovy (10 hunks)
jenkins/current_image_tags.properties (1 hunks)
requirements.txt (3 hunks)

💤 Files with no reviewable changes (2)

docker/common/install.sh
constraints.txt

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: nv-lschneider
PR: NVIDIA/TensorRT-LLM#7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

📚 Learning: 2025-08-21T00:16:56.457Z

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

jenkins/current_image_tags.properties

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (7)

jenkins/L0_Test.groovy (2)

2323-2369: Good: sanity uses cu130 PyTorch wheels; mirror DLFW install_pytorch.sh to avoid drift.

This path installs torch 2.9.0 from cu130, matching requirements. Ensure docker/common/install_pytorch.sh is aligned (it still uses cu128). After fixing that script, CI environments and runtime will be consistent.

1812-1872: ****

These are separate runLLMBuild function definitions in different files, not a shared function. Build.groovy defines its own runLLMBuild(pipeline, buildFlags, tarName, is_linux_x86_64) at line 402, and all its call sites use this signature. L0_Test.groovy defines its own runLLMBuild(pipeline, cpu_arch, reinstall_dependencies=false, wheel_path="", cpver="cp312") at line 1812, and all its call sites are consistent with this signature (lines 2323 and 2639 both match the parameter definitions). Additionally, is_cu12 does not exist anywhere in the codebase—no parameter passes it, and no variable references it. There is no evidence of cross-file function breakage or incompatible call sites.

Likely an incorrect or invalid review comment.

docker/common/install_cuda_toolkit.sh (1)

8-9: Package pins verified—all CUDA 13.0.2 and driver 580.95.05 artifacts resolve across target OSes.

All required packages confirmed in official NVIDIA repositories:

Ubuntu 22.04 & 24.04: cuda-toolkit-13-0 v13.0.2-1 ✓

RHEL8 x86_64: cuda-toolkit-13-0-13.0.2 & cuda-compat-13-0-580.95.05 ✓

RHEL8 aarch64: cuda-toolkit-13-0-13.0.2 & cuda-compat-13-0-580.95.05 ✓

No lag or unavailability detected. Code ready to merge.

docker/Makefile (1)

195-195: Straightforward CUDA base image version bump.

The patch-level updates from CUDA 13.0.0 to 13.0.1 are consistent across all three configurations. Verify that these NGC CUDA images are published and remain available for the release.

Also applies to: 199-199, 204-204

jenkins/current_image_tags.properties (1)

16-19: Staging image tags—verify intended for PR or persistent.

All four image variables now reference internal staging images with PR-specific branch and commit identifiers in the tag. This is expected for a feature branch but should revert to NGC releases or stable staging before merge.

Confirm:

Is this file auto-generated in CI, or manually committed for this PR?

Are these staging images available in the referenced registry, and will they remain available during PR review/testing?

What is the plan for updating this file before merging to release/1.1?

docker/Dockerfile.multi (2)

74-74: Incorrect review comment—protobuf installation doesn't exist in this codebase.

The review assumes a system protobuf installation block was removed, but examination of the current code and git history shows:

No protobuf installation in install_base.sh, install_tensorrt.sh, install_pytorch.sh, install_opencv.sh, or any other install scripts

No --protobuf flag exists in install.sh

No protobuf-specific install script in the docker directory

Dockerfile.multi line 74 is unchanged in git history (always --opencv only)

TensorRT, PyTorch, and OpenCV are all installed from pre-built binaries/wheels, not compiled from source

Protobuf appears only in example model requirements.txt files, not the Docker build process

There is no downstream impact to verify because protobuf was never part of the Docker image build pipeline.

Likely an incorrect or invalid review comment.

2-5: Verified: Infrastructure shift is real, but intent and integration status unclear.

The change from NGC public images to internal GitLab staging registries is confirmed:

Tags now include internal build IDs (36764868, 36928345)

Protobuf installation is not present in install_base.sh or install_tensorrt.sh (removed entirely, not partial)

Internal GitLab references appear only in docker/Dockerfile.multi lines 2-3, not in CI workflows

However, gaps remain:

docker/README.md documents NGC Integration as the standard approach but is now outdated

No CI/workflow integration found, suggesting this may be staging-only or incomplete

Unclear if this is temporary or a permanent distribution strategy shift

Recommend clarifying:

Is this a temporary staging setup or long-term direction?

Are internal GitLab images accessible to all developers and CI systems?

Should docker/README.md be updated to reflect the new approach?

coderabbitai · 2025-10-24T10:04:49Z

docker/common/install_pytorch.sh

+# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-10.html#rel-25-10
+TORCH_VERSION="2.9.0"


⚠️ Potential issue | 🟠 Major

cu128 index is inconsistent with the CUDA 13.0 stack; switch to cu130 to get correct wheels.

requirements.txt and the DLFW 25.10 stack target CUDA 13.0; keeping cu128 here risks CPU wheels or ABI mismatches.

Apply this diff:

@@ install_from_pypi() { @@ - pip3 uninstall -y torch torchvision torchaudio - pip3 install torch==${TORCH_VERSION} torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 + pip3 uninstall -y torch torchvision torchaudio + # Use cu130 to match CUDA 13.x toolchain + pip3 install torch==${TORCH_VERSION} torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130 }

Optionally, consider pinning torchvision to the compatible series for 2.9 (e.g., 0.20.x) to avoid resolver backtracking during image builds.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In docker/common/install_pytorch.sh around lines 7 to 8, the CUDA tag is set to cu128 which is inconsistent with the project’s CUDA 13.0 stack; change the wheel tag from cu128 to cu130 so PyTorch wheels match the DLFW 25.10 / CUDA 13.0 environment, and optionally pin torchvision to a 0.20.x series compatible with Torch 2.9 to avoid dependency resolver backtracking during image builds.

tensorrt-cicd · 2025-10-24T10:05:01Z

PR_Github #22428 [ run ] triggered by Bot. Commit: bfeb7b0

tensorrt-cicd · 2025-10-24T11:16:03Z

PR_Github #22428 [ run ] completed with state SUCCESS. Commit: bfeb7b0
/LLM/release-1.1/L0_MergeRequest_PR pipeline #256 (Partly Tested) completed with status: 'FAILURE'

ZhanruiSunCh and others added 18 commits October 21, 2025 03:21

[TRTLLM-8464][infra] Use public triton 3.5.0

f93ce6f

Signed-off-by: ZhanruiSunCh <[email protected]>

Update images

9ec9730

Signed-off-by: ZhanruiSunCh <[email protected]>

fix images

bafb417

Signed-off-by: ZhanruiSunCh <[email protected]>

Test upgrade to DLFW 25.10 internal

b1d07b2

Signed-off-by: ZhanruiSunCh <[email protected]>

cuda 13.0.2 image is not ready

572ad76

Signed-off-by: ZhanruiSunCh <[email protected]>

Update images

e0c7288

Signed-off-by: ZhanruiSunCh <[email protected]>

Remove CUDA 12 code

61a5377

Signed-off-by: ZhanruiSunCh <[email protected]>

Update images

e6ff739

Signed-off-by: ZhanruiSunCh <[email protected]>

Update requirements.txt

43c0764

Fix

2552f4e

Signed-off-by: ZhanruiSunCh <[email protected]>

Update to latest DLFW internal pytorch images

8a6beac

Signed-off-by: ZhanruiSunCh <[email protected]>

Update images

8f52ded

Signed-off-by: ZhanruiSunCh <[email protected]>

Update tritonserver to DLFW 25.10 internal

fdf41a2

Signed-off-by: ZhanruiSunCh <[email protected]>

Update images

e116255

Signed-off-by: ZhanruiSunCh <[email protected]>

Update images

ec93309

Signed-off-by: ZhanruiSunCh <[email protected]>

[TRTLLM-6607] fix NSPECT issue

5f3ef8e

Signed-off-by: ZhanruiSunCh <[email protected]>

[Test Fix][From Leslie Fang] fix test_disaggregated_deepseek_v3_lite_…

a52892e

…fp8_mpi[DeepSeek-V3-Lite-fp8] Signed-off-by: ZhanruiSunCh <[email protected]>

Update images

5f0dac6

Signed-off-by: ZhanruiSunCh <[email protected]>

ZhanruiSunCh requested a review from a team as a code owner October 23, 2025 08:44

Fix B200 sanity check stage

bfeb7b0

Signed-off-by: ZhanruiSunCh <[email protected]>

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TRTLLM-8658][infra] upgrade to DLFW 25.10 and pytorch 2.9.0 / triton 3.5.0 #8621

[TRTLLM-8658][infra] upgrade to DLFW 25.10 and pytorch 2.9.0 / triton 3.5.0 #8621

ZhanruiSunCh commented Oct 23, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

ZhanruiSunCh commented Oct 24, 2025

Uh oh!

tensorrt-cicd commented Oct 24, 2025

Uh oh!

tensorrt-cicd commented Oct 24, 2025

Uh oh!

ZhanruiSunCh commented Oct 24, 2025

Uh oh!

coderabbitai bot commented Oct 24, 2025

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 24, 2025

Uh oh!

tensorrt-cicd commented Oct 24, 2025

Uh oh!

tensorrt-cicd commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-10.html#rel-25-10
		TORCH_VERSION="2.9.0"

[TRTLLM-8658][infra] upgrade to DLFW 25.10 and pytorch 2.9.0 / triton 3.5.0 #8621

Are you sure you want to change the base?

[TRTLLM-8658][infra] upgrade to DLFW 25.10 and pytorch 2.9.0 / triton 3.5.0 #8621

Conversation

ZhanruiSunCh commented Oct 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

ZhanruiSunCh commented Oct 24, 2025

Uh oh!

tensorrt-cicd commented Oct 24, 2025

Uh oh!

tensorrt-cicd commented Oct 24, 2025

Uh oh!

ZhanruiSunCh commented Oct 24, 2025

Uh oh!

coderabbitai bot commented Oct 24, 2025

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Oct 24, 2025

Uh oh!

tensorrt-cicd commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ZhanruiSunCh commented Oct 23, 2025 •

edited by coderabbitai bot

Loading