Skip to content

Conversation

@ZhanruiSunCh
Copy link
Collaborator

@ZhanruiSunCh ZhanruiSunCh commented Oct 23, 2025

Summary by CodeRabbit

  • Chores
    • Updated CUDA toolkit to version 13.0.2
    • Updated PyTorch to version 2.9.0
    • Updated TensorRT to version 10.13.3.9
    • Updated supporting NVIDIA libraries and tooling versions
    • Removed CUDA 12.9 build support
    • Removed protobuf dependency
    • Updated base container images to development versions

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

ZhanruiSunCh and others added 18 commits October 21, 2025 03:21
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
…fp8_mpi[DeepSeek-V3-Lite-fp8]

Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
@ZhanruiSunCh ZhanruiSunCh requested a review from a team as a code owner October 23, 2025 08:44
@ZhanruiSunCh
Copy link
Collaborator Author

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22363 [ run ] triggered by Bot. Commit: 5f0dac6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22363 [ run ] completed with state SUCCESS. Commit: 5f0dac6
/LLM/release-1.1/L0_MergeRequest_PR pipeline #249 (Partly Tested) completed with status: 'SUCCESS'

Signed-off-by: ZhanruiSunCh <[email protected]>
@ZhanruiSunCh
Copy link
Collaborator Author

/bot run --skip-test

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 24, 2025

📝 Walkthrough

Walkthrough

The changes update TensorRT-LLM dependencies and Docker configuration to support CUDA 13.0-13.2, PyTorch 2.9.0, and TensorRT 10.13.3.9. CUDA 12.9 build configurations are removed from CI/CD pipelines. Docker images switch to GitLab-based staging builds, protobuf installation is removed, and MPI rank detection via SLURM/OpenMPI is introduced for distributed workloads.

Changes

Cohort / File(s) Summary
Dependency Constraint Removals
constraints.txt, docker/common/install.sh
Removed protobuf dependency constraint (>=4.25.8) and associated WAR advisory comment; protobuf installation feature entirely eliminated from install.sh.
Docker Build Configuration Updates
docker/Dockerfile.multi, docker/Makefile
Updated Dockerfile.multi to use GitLab-based image references, simplified post-installation steps by removing protobuf and Triton symlink handling. Incremented BASE_TAG from 13.0.0 to 13.0.1 across Rocky Linux and Ubuntu 22.04 targets in Makefile.
Docker Installation Script Version Updates
docker/common/install_cuda_toolkit.sh, docker/common/install_pytorch.sh, docker/common/install_tensorrt.sh
Updated CUDA toolkit from 13.0.0_580.65.06 to 13.0.2_580.95.05; PyTorch from 2.8.0 to 2.9.0; TensorRT from 10.13.2.6 to 10.13.3.9 with corresponding updates to cuDNN, cuBLAS, NVRTC, CUDA Runtime, and driver versions.
Docker MPI Support
docker/common/install_mpi4py.sh
Introduced conditional MPI rank detection block for TRTLLM_USE_MPI_KVCACHE, with CUDA runtime import fallback logic (cuda.bindings.runtime → cuda.cudart) and rank detection via SLURM_PROCID or OMPI_COMM_WORLD_RANK environment variables.
Jenkins CI/CD Configuration Cleanup
jenkins/Build.groovy, jenkins/L0_Test.groovy
Removed CUDA 12.9 docker image constants (LLM_DOCKER_IMAGE_12_9, LLM_SBSA_DOCKER_IMAGE_12_9) and CU12 build configurations (CONFIG_LINUX_X86_64_VANILLA_CU12, CONFIG_LINUX_AARCH64_CU12). Simplified runLLMBuild method signature by removing is_cu12 parameter and eliminated all CUDA 12.9-specific test branches and image references.
Jenkins Image Tag Updates
jenkins/current_image_tags.properties
Replaced versioned image variables with unified staging/tritondevel image definitions (LLM_DOCKER_IMAGE, LLM_SBSA_DOCKER_IMAGE, LLM_ROCKYLINUX8_PY310_DOCKER_IMAGE, LLM_ROCKYLINUX8_PY312_DOCKER_IMAGE), removing CUDA 12.9 variant suffixes.
Python Dependencies
requirements.txt
Updated CUDA index from cu128 to cu130; bumped PyTorch to 2.9.0a0 with ecosystem alignment; updated triton from 3.3.1 to 3.5.0; removed pinned pillow==10.3.0; added datasets==3.1.0; adjusted NVIDIA toolchain packages for CUDA 13 compatibility.

Sequence Diagram

sequenceDiagram
    participant Script as install_mpi4py.sh
    participant System as System/Environment
    participant CUDA as CUDA Runtime
    participant MPI as MPI Library

    Script->>Script: Check if TRTLLM_USE_MPI_KVCACHE == 1
    alt Conditional Block Active
        Script->>Script: Import CUDA bindings (try cuda.bindings.runtime)
        alt cudart import fails
            Script->>Script: Fallback to cuda.cudart
        end
        
        Script->>System: Check SLURM_PROCID
        alt SLURM_PROCID exists
            Script->>Script: Set rank from SLURM_PROCID
        else
            Script->>System: Check OMPI_COMM_WORLD_RANK
            alt OMPI_COMM_WORLD_RANK exists
                Script->>Script: Set rank from OMPI_COMM_WORLD_RANK
            else
                Script->>Script: Error: No rank source found
            end
        end
        
        rect rgba(100, 200, 150, 0.2)
            note right of Script: Rank Detection & CUDA Setup
            Script->>CUDA: Compute effective rank (rank % device_count)
            Script->>CUDA: Set device to effective rank
            Script->>Script: Print selected rank and device
        end
        
        Script->>MPI: Continue with MPI comm wrapping
        Script->>MPI: Proceed with installation steps
    else Conditional Block Inactive
        Script->>Script: Skip MPI-specific logic
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

The review demands attention across multiple categories: (1) heterogeneous version updates requiring validation of compatibility chains (CUDA 13.0/13.2, PyTorch 2.9.0, TensorRT 10.13.3.9); (2) substantial restructuring of CI/CD pipelines with CUDA 12.9 removal across two large Groovy files, including method signature changes affecting call sites; (3) new MPI rank detection logic introducing fallback patterns and environment-variable-based control flow; (4) image configuration migration to staging/tritondevel lineage with semantic impact on build sourcing. While individual changes are mostly straightforward, the breadth and interconnected nature of version/configuration updates necessitate cross-file verification to ensure consistent CUDA/PyTorch targeting.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description is entirely empty of substantive content. The author has provided only the template structure with all section headers and comments, but no actual information has been filled in. The Description section is missing an explanation of the issue and solution, the Test Coverage section lacks any test information, and the PR Checklist remains unchecked with items unaddressed. Given that this PR involves significant infrastructure changes including CUDA version upgrades, removal of CUDA 12 support, and updates to multiple build configurations and dependencies, the absence of any explanatory content makes it impossible to understand the rationale or scope of these changes. The author should complete the PR description by filling in the Description section with a brief explanation of why this infrastructure upgrade is being performed and what benefits it provides. The Test Coverage section should document which tests validate the changes, and the PR Checklist items should be reviewed and checked as appropriate. At minimum, the description should clarify the scope of breaking changes (such as the removal of CUDA 12.9 support) and any migration steps required from the changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "[TRTLLM-8658][infra] upgrade to DLFW 25.10 and pytorch 2.9.0 / triton 3.5.0" is directly related to the primary changes in the changeset. The raw_summary confirms that PyTorch was updated from 2.8.0 to 2.9.0, Triton was updated to 3.5.0, and the Docker configuration was updated with GitLab-based images and development tags corresponding to DLFW 25.10. The title follows the repository's required format with the JIRA ticket identifier and [infra] type designation, is concise and clear, and would help teammates quickly understand the main objective of this infrastructure upgrade PR.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docker/common/install_mpi4py.sh (1)

30-71: Fix critical git dependency and rank detection logic; os import not needed.

The patched code uses git apply without installing git first—this will fail in containers that lack git. The rank detection uses elif which prevents detecting when both SLURM_PROCID and OMPI_COMM_WORLD_RANK are simultaneously set; use two separate if statements instead.

The os module concern is resolved—it is already imported at line 10 of the upstream mpi4py 3.1.5 source file.

Critical fixes required:

  1. Install git before running git apply in docker/common/install_mpi4py.sh (after line 18, before the cd and git apply):
+if ! command -v git >/dev/null 2>&1; then
+    if grep -qi 'ubuntu\|debian' /etc/os-release; then
+        apt-get update && apt-get install -y git && apt-get clean && rm -rf /var/lib/apt/lists/*
+    else
+        dnf makecache --refresh && dnf install -y git && dnf clean all
+    fi
+fi
 cd "$TMP_DIR/mpi4py-${MPI4PY_VERSION}"
 git apply <<EOF
  1. In the patch itself, replace the elif with a second if to detect rank conflicts:
-        if(os.getenv("SLURM_PROCID")):
+        if os.getenv("SLURM_PROCID") is not None:
             slurm_rank = int(os.environ["SLURM_PROCID"])
             has_slurm_rank=True
-        elif(os.getenv("OMPI_COMM_WORLD_RANK")):
+        if os.getenv("OMPI_COMM_WORLD_RANK") is not None:
             ompi_rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
             has_ompi_rank=True
🧹 Nitpick comments (5)
docker/common/install_tensorrt.sh (1)

5-9: Version bumps look coherent; add a guard to avoid partial installs if repo lag occurs.

Consider pre-checking availability of pinned libcudnn/libcublas/nvrtc/NCCL versions before purge to avoid leaving images in a broken state when mirrors lag.

Example guard:

@@ install_ubuntu_requirements() {
-    apt-get update
+    apt-get update
+    # Verify packages exist before removing current libs
+    apt-cache policy libcudnn9-cuda-13=${CUDNN_VER} | grep Candidate: || { echo "Pinned cuDNN ${CUDNN_VER} not found"; exit 1; }
+    apt-cache policy libcublas-$(echo $CUDA_VER | sed 's/\./-/g')=${CUBLAS_VER} | grep Candidate: || { echo "Pinned cuBLAS ${CUBLAS_VER} not found"; exit 1; }

Same idea can be applied to Rocky with repoquery or simple HEAD checks. This reduces CI flakiness.

Also applies to: 11-16, 19-21

requirements.txt (1)

1-1: Align Torch/TV vision pins and sanity-check constraints.

  • Torch range allows 2.9.0a0; prefer stable only unless alpha is required.
  • torchvision is unpinned; pin to the compatible series for Torch 2.9 to avoid resolver mismatches in CI.
  • Confirm constraints.txt exists if referenced; remove the “-c constraints.txt” include otherwise.

Suggested edits:

-# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-10.html#rel-25-10 uses 2.9.0a0.
-torch>=2.9.0a0,<=2.9.0
-torchvision
+torch==2.9.0
+torchvision~=0.20.0

If you need alpha compatibility for some targets, consider environment markers instead of a broad a0 allowance.

Also applies to: 23-27, 36-36, 67-67

jenkins/L0_Test.groovy (2)

42-43: Staging DLFW image is hard-pinned; add override and fallback.

Hard-coding a staging digest can break external forks or later reruns. Allow override via env and default to the public NGC tag with a fallback to staging.

-// DLFW_IMAGE = "urm.nvidia.com/docker/nvidia/pytorch:25.10-py3"
-DLFW_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm-staging/devel:pytorch_25.10-py3.36764868-devel"
+DLFW_IMAGE = env.DLFW_IMAGE ?: "urm.nvidia.com/docker/nvidia/pytorch:25.10-py3"
+// Optional fallback to staging if needed downstream:
+DLFW_IMAGE = env.USE_STAGING_DLFW?.toBoolean() ? "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm-staging/devel:pytorch_25.10-py3.36764868-devel" : DLFW_IMAGE

2349-2364: DLFW pip constraint cleanup and CUDA toolkit install are OS-aware; add guard for repeated installs.

On DLFW images, constraint file is cleared; on Ubuntu images, cuda-toolkit-13-0 is installed. Add an idempotency check to skip re-install if nvcc already reports 13.0 to save minutes in CI.

-apt-get -y install cuda-toolkit-13-0
+if ! nvcc --version 2>/dev/null | grep -q "release 13.0"; then
+  apt-get -y install cuda-toolkit-13-0
+fi
jenkins/Build.groovy (1)

457-458: Hardcoded Triton tag—ensure sync and document versioning strategy.

Line 457 hardcodes tritonShortTag = "r25.10", which is then applied to all four Triton repository tags in the CMake configuration. This matches the TRITON_BASE_TAG in docker/Dockerfile.multi (25.10-py3.36928345-devel).

Verify:

  • Is there a mechanism to keep this tag synchronized with docker/Dockerfile.multi's TRITON_BASE_TAG, or will manual updates be required?
  • Should this value be extracted from a shared configuration file or environment variable to avoid future desync?
  • Add a comment linking this to the Dockerfile.multi logic so maintainers understand the coupling.

Alternatively, if dynamic extraction is no longer needed due to the DLFW 25.10 stability, document this decision in the PR description.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d27034 and bfeb7b0.

📒 Files selected for processing (12)
  • constraints.txt (0 hunks)
  • docker/Dockerfile.multi (2 hunks)
  • docker/Makefile (1 hunks)
  • docker/common/install.sh (0 hunks)
  • docker/common/install_cuda_toolkit.sh (1 hunks)
  • docker/common/install_mpi4py.sh (1 hunks)
  • docker/common/install_pytorch.sh (1 hunks)
  • docker/common/install_tensorrt.sh (1 hunks)
  • jenkins/Build.groovy (1 hunks)
  • jenkins/L0_Test.groovy (10 hunks)
  • jenkins/current_image_tags.properties (1 hunks)
  • requirements.txt (3 hunks)
💤 Files with no reviewable changes (2)
  • docker/common/install.sh
  • constraints.txt
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: nv-lschneider
PR: NVIDIA/TensorRT-LLM#7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
📚 Learning: 2025-08-21T00:16:56.457Z
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

  • jenkins/current_image_tags.properties
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
jenkins/L0_Test.groovy (2)

2323-2369: Good: sanity uses cu130 PyTorch wheels; mirror DLFW install_pytorch.sh to avoid drift.

This path installs torch 2.9.0 from cu130, matching requirements. Ensure docker/common/install_pytorch.sh is aligned (it still uses cu128). After fixing that script, CI environments and runtime will be consistent.


1812-1872: ****

These are separate runLLMBuild function definitions in different files, not a shared function. Build.groovy defines its own runLLMBuild(pipeline, buildFlags, tarName, is_linux_x86_64) at line 402, and all its call sites use this signature. L0_Test.groovy defines its own runLLMBuild(pipeline, cpu_arch, reinstall_dependencies=false, wheel_path="", cpver="cp312") at line 1812, and all its call sites are consistent with this signature (lines 2323 and 2639 both match the parameter definitions). Additionally, is_cu12 does not exist anywhere in the codebase—no parameter passes it, and no variable references it. There is no evidence of cross-file function breakage or incompatible call sites.

Likely an incorrect or invalid review comment.

docker/common/install_cuda_toolkit.sh (1)

8-9: Package pins verified—all CUDA 13.0.2 and driver 580.95.05 artifacts resolve across target OSes.

All required packages confirmed in official NVIDIA repositories:

  • Ubuntu 22.04 & 24.04: cuda-toolkit-13-0 v13.0.2-1 ✓
  • RHEL8 x86_64: cuda-toolkit-13-0-13.0.2 & cuda-compat-13-0-580.95.05 ✓
  • RHEL8 aarch64: cuda-toolkit-13-0-13.0.2 & cuda-compat-13-0-580.95.05 ✓

No lag or unavailability detected. Code ready to merge.

docker/Makefile (1)

195-195: Straightforward CUDA base image version bump.

The patch-level updates from CUDA 13.0.0 to 13.0.1 are consistent across all three configurations. Verify that these NGC CUDA images are published and remain available for the release.

Also applies to: 199-199, 204-204

jenkins/current_image_tags.properties (1)

16-19: Staging image tags—verify intended for PR or persistent.

All four image variables now reference internal staging images with PR-specific branch and commit identifiers in the tag. This is expected for a feature branch but should revert to NGC releases or stable staging before merge.

Confirm:

  • Is this file auto-generated in CI, or manually committed for this PR?
  • Are these staging images available in the referenced registry, and will they remain available during PR review/testing?
  • What is the plan for updating this file before merging to release/1.1?
docker/Dockerfile.multi (2)

74-74: Incorrect review comment—protobuf installation doesn't exist in this codebase.

The review assumes a system protobuf installation block was removed, but examination of the current code and git history shows:

  • No protobuf installation in install_base.sh, install_tensorrt.sh, install_pytorch.sh, install_opencv.sh, or any other install scripts
  • No --protobuf flag exists in install.sh
  • No protobuf-specific install script in the docker directory
  • Dockerfile.multi line 74 is unchanged in git history (always --opencv only)
  • TensorRT, PyTorch, and OpenCV are all installed from pre-built binaries/wheels, not compiled from source
  • Protobuf appears only in example model requirements.txt files, not the Docker build process

There is no downstream impact to verify because protobuf was never part of the Docker image build pipeline.

Likely an incorrect or invalid review comment.


2-5: Verified: Infrastructure shift is real, but intent and integration status unclear.

The change from NGC public images to internal GitLab staging registries is confirmed:

  • Tags now include internal build IDs (36764868, 36928345)
  • Protobuf installation is not present in install_base.sh or install_tensorrt.sh (removed entirely, not partial)
  • Internal GitLab references appear only in docker/Dockerfile.multi lines 2-3, not in CI workflows

However, gaps remain:

  • docker/README.md documents NGC Integration as the standard approach but is now outdated
  • No CI/workflow integration found, suggesting this may be staging-only or incomplete
  • Unclear if this is temporary or a permanent distribution strategy shift

Recommend clarifying:

  1. Is this a temporary staging setup or long-term direction?
  2. Are internal GitLab images accessible to all developers and CI systems?
  3. Should docker/README.md be updated to reflect the new approach?

Comment on lines +7 to +8
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-10.html#rel-25-10
TORCH_VERSION="2.9.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

cu128 index is inconsistent with the CUDA 13.0 stack; switch to cu130 to get correct wheels.

requirements.txt and the DLFW 25.10 stack target CUDA 13.0; keeping cu128 here risks CPU wheels or ABI mismatches.

Apply this diff:

@@
 install_from_pypi() {
@@
-    pip3 uninstall -y torch torchvision torchaudio
-    pip3 install torch==${TORCH_VERSION} torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
+    pip3 uninstall -y torch torchvision torchaudio
+    # Use cu130 to match CUDA 13.x toolchain
+    pip3 install torch==${TORCH_VERSION} torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
 }

Optionally, consider pinning torchvision to the compatible series for 2.9 (e.g., 0.20.x) to avoid resolver backtracking during image builds.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In docker/common/install_pytorch.sh around lines 7 to 8, the CUDA tag is set to
cu128 which is inconsistent with the project’s CUDA 13.0 stack; change the wheel
tag from cu128 to cu130 so PyTorch wheels match the DLFW 25.10 / CUDA 13.0
environment, and optionally pin torchvision to a 0.20.x series compatible with
Torch 2.9 to avoid dependency resolver backtracking during image builds.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22428 [ run ] triggered by Bot. Commit: bfeb7b0

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22428 [ run ] completed with state SUCCESS. Commit: bfeb7b0
/LLM/release-1.1/L0_MergeRequest_PR pipeline #256 (Partly Tested) completed with status: 'FAILURE'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants