-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-8658][infra] upgrade to DLFW 25.10 and pytorch 2.9.0 / triton 3.5.0 #8621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release/1.1
Are you sure you want to change the base?
[TRTLLM-8658][infra] upgrade to DLFW 25.10 and pytorch 2.9.0 / triton 3.5.0 #8621
Conversation
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
…fp8_mpi[DeepSeek-V3-Lite-fp8] Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
|
/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1" |
|
PR_Github #22363 [ run ] triggered by Bot. Commit: |
|
PR_Github #22363 [ run ] completed with state |
Signed-off-by: ZhanruiSunCh <[email protected]>
|
/bot run --skip-test |
📝 WalkthroughWalkthroughThe changes update TensorRT-LLM dependencies and Docker configuration to support CUDA 13.0-13.2, PyTorch 2.9.0, and TensorRT 10.13.3.9. CUDA 12.9 build configurations are removed from CI/CD pipelines. Docker images switch to GitLab-based staging builds, protobuf installation is removed, and MPI rank detection via SLURM/OpenMPI is introduced for distributed workloads. Changes
Sequence DiagramsequenceDiagram
participant Script as install_mpi4py.sh
participant System as System/Environment
participant CUDA as CUDA Runtime
participant MPI as MPI Library
Script->>Script: Check if TRTLLM_USE_MPI_KVCACHE == 1
alt Conditional Block Active
Script->>Script: Import CUDA bindings (try cuda.bindings.runtime)
alt cudart import fails
Script->>Script: Fallback to cuda.cudart
end
Script->>System: Check SLURM_PROCID
alt SLURM_PROCID exists
Script->>Script: Set rank from SLURM_PROCID
else
Script->>System: Check OMPI_COMM_WORLD_RANK
alt OMPI_COMM_WORLD_RANK exists
Script->>Script: Set rank from OMPI_COMM_WORLD_RANK
else
Script->>Script: Error: No rank source found
end
end
rect rgba(100, 200, 150, 0.2)
note right of Script: Rank Detection & CUDA Setup
Script->>CUDA: Compute effective rank (rank % device_count)
Script->>CUDA: Set device to effective rank
Script->>Script: Print selected rank and device
end
Script->>MPI: Continue with MPI comm wrapping
Script->>MPI: Proceed with installation steps
else Conditional Block Inactive
Script->>Script: Skip MPI-specific logic
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes The review demands attention across multiple categories: (1) heterogeneous version updates requiring validation of compatibility chains (CUDA 13.0/13.2, PyTorch 2.9.0, TensorRT 10.13.3.9); (2) substantial restructuring of CI/CD pipelines with CUDA 12.9 removal across two large Groovy files, including method signature changes affecting call sites; (3) new MPI rank detection logic introducing fallback patterns and environment-variable-based control flow; (4) image configuration migration to staging/tritondevel lineage with semantic impact on build sourcing. While individual changes are mostly straightforward, the breadth and interconnected nature of version/configuration updates necessitate cross-file verification to ensure consistent CUDA/PyTorch targeting. Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docker/common/install_mpi4py.sh (1)
30-71: Fix critical git dependency and rank detection logic; os import not needed.The patched code uses
git applywithout installing git first—this will fail in containers that lack git. The rank detection useselifwhich prevents detecting when both SLURM_PROCID and OMPI_COMM_WORLD_RANK are simultaneously set; use two separateifstatements instead.The
osmodule concern is resolved—it is already imported at line 10 of the upstream mpi4py 3.1.5 source file.Critical fixes required:
- Install git before running
git applyindocker/common/install_mpi4py.sh(after line 18, before thecdandgit apply):+if ! command -v git >/dev/null 2>&1; then + if grep -qi 'ubuntu\|debian' /etc/os-release; then + apt-get update && apt-get install -y git && apt-get clean && rm -rf /var/lib/apt/lists/* + else + dnf makecache --refresh && dnf install -y git && dnf clean all + fi +fi cd "$TMP_DIR/mpi4py-${MPI4PY_VERSION}" git apply <<EOF
- In the patch itself, replace the
elifwith a secondifto detect rank conflicts:- if(os.getenv("SLURM_PROCID")): + if os.getenv("SLURM_PROCID") is not None: slurm_rank = int(os.environ["SLURM_PROCID"]) has_slurm_rank=True - elif(os.getenv("OMPI_COMM_WORLD_RANK")): + if os.getenv("OMPI_COMM_WORLD_RANK") is not None: ompi_rank = int(os.environ["OMPI_COMM_WORLD_RANK"]) has_ompi_rank=True
🧹 Nitpick comments (5)
docker/common/install_tensorrt.sh (1)
5-9: Version bumps look coherent; add a guard to avoid partial installs if repo lag occurs.Consider pre-checking availability of pinned libcudnn/libcublas/nvrtc/NCCL versions before purge to avoid leaving images in a broken state when mirrors lag.
Example guard:
@@ install_ubuntu_requirements() { - apt-get update + apt-get update + # Verify packages exist before removing current libs + apt-cache policy libcudnn9-cuda-13=${CUDNN_VER} | grep Candidate: || { echo "Pinned cuDNN ${CUDNN_VER} not found"; exit 1; } + apt-cache policy libcublas-$(echo $CUDA_VER | sed 's/\./-/g')=${CUBLAS_VER} | grep Candidate: || { echo "Pinned cuBLAS ${CUBLAS_VER} not found"; exit 1; }Same idea can be applied to Rocky with repoquery or simple HEAD checks. This reduces CI flakiness.
Also applies to: 11-16, 19-21
requirements.txt (1)
1-1: Align Torch/TV vision pins and sanity-check constraints.
- Torch range allows 2.9.0a0; prefer stable only unless alpha is required.
- torchvision is unpinned; pin to the compatible series for Torch 2.9 to avoid resolver mismatches in CI.
- Confirm constraints.txt exists if referenced; remove the “-c constraints.txt” include otherwise.
Suggested edits:
-# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-10.html#rel-25-10 uses 2.9.0a0. -torch>=2.9.0a0,<=2.9.0 -torchvision +torch==2.9.0 +torchvision~=0.20.0If you need alpha compatibility for some targets, consider environment markers instead of a broad a0 allowance.
Also applies to: 23-27, 36-36, 67-67
jenkins/L0_Test.groovy (2)
42-43: Staging DLFW image is hard-pinned; add override and fallback.Hard-coding a staging digest can break external forks or later reruns. Allow override via env and default to the public NGC tag with a fallback to staging.
-// DLFW_IMAGE = "urm.nvidia.com/docker/nvidia/pytorch:25.10-py3" -DLFW_IMAGE = "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm-staging/devel:pytorch_25.10-py3.36764868-devel" +DLFW_IMAGE = env.DLFW_IMAGE ?: "urm.nvidia.com/docker/nvidia/pytorch:25.10-py3" +// Optional fallback to staging if needed downstream: +DLFW_IMAGE = env.USE_STAGING_DLFW?.toBoolean() ? "urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm-staging/devel:pytorch_25.10-py3.36764868-devel" : DLFW_IMAGE
2349-2364: DLFW pip constraint cleanup and CUDA toolkit install are OS-aware; add guard for repeated installs.On DLFW images, constraint file is cleared; on Ubuntu images, cuda-toolkit-13-0 is installed. Add an idempotency check to skip re-install if nvcc already reports 13.0 to save minutes in CI.
-apt-get -y install cuda-toolkit-13-0 +if ! nvcc --version 2>/dev/null | grep -q "release 13.0"; then + apt-get -y install cuda-toolkit-13-0 +fijenkins/Build.groovy (1)
457-458: Hardcoded Triton tag—ensure sync and document versioning strategy.Line 457 hardcodes
tritonShortTag = "r25.10", which is then applied to all four Triton repository tags in the CMake configuration. This matches theTRITON_BASE_TAGindocker/Dockerfile.multi(25.10-py3.36928345-devel).Verify:
- Is there a mechanism to keep this tag synchronized with
docker/Dockerfile.multi'sTRITON_BASE_TAG, or will manual updates be required?- Should this value be extracted from a shared configuration file or environment variable to avoid future desync?
- Add a comment linking this to the Dockerfile.multi logic so maintainers understand the coupling.
Alternatively, if dynamic extraction is no longer needed due to the DLFW 25.10 stability, document this decision in the PR description.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
constraints.txt(0 hunks)docker/Dockerfile.multi(2 hunks)docker/Makefile(1 hunks)docker/common/install.sh(0 hunks)docker/common/install_cuda_toolkit.sh(1 hunks)docker/common/install_mpi4py.sh(1 hunks)docker/common/install_pytorch.sh(1 hunks)docker/common/install_tensorrt.sh(1 hunks)jenkins/Build.groovy(1 hunks)jenkins/L0_Test.groovy(10 hunks)jenkins/current_image_tags.properties(1 hunks)requirements.txt(3 hunks)
💤 Files with no reviewable changes (2)
- docker/common/install.sh
- constraints.txt
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: nv-lschneider
PR: NVIDIA/TensorRT-LLM#7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
📚 Learning: 2025-08-21T00:16:56.457Z
Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.
Applied to files:
jenkins/current_image_tags.properties
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
jenkins/L0_Test.groovy (2)
2323-2369: Good: sanity uses cu130 PyTorch wheels; mirror DLFW install_pytorch.sh to avoid drift.This path installs torch 2.9.0 from cu130, matching requirements. Ensure docker/common/install_pytorch.sh is aligned (it still uses cu128). After fixing that script, CI environments and runtime will be consistent.
1812-1872: ****These are separate
runLLMBuildfunction definitions in different files, not a shared function. Build.groovy defines its ownrunLLMBuild(pipeline, buildFlags, tarName, is_linux_x86_64)at line 402, and all its call sites use this signature. L0_Test.groovy defines its ownrunLLMBuild(pipeline, cpu_arch, reinstall_dependencies=false, wheel_path="", cpver="cp312")at line 1812, and all its call sites are consistent with this signature (lines 2323 and 2639 both match the parameter definitions). Additionally,is_cu12does not exist anywhere in the codebase—no parameter passes it, and no variable references it. There is no evidence of cross-file function breakage or incompatible call sites.Likely an incorrect or invalid review comment.
docker/common/install_cuda_toolkit.sh (1)
8-9: Package pins verified—all CUDA 13.0.2 and driver 580.95.05 artifacts resolve across target OSes.All required packages confirmed in official NVIDIA repositories:
- Ubuntu 22.04 & 24.04: cuda-toolkit-13-0 v13.0.2-1 ✓
- RHEL8 x86_64: cuda-toolkit-13-0-13.0.2 & cuda-compat-13-0-580.95.05 ✓
- RHEL8 aarch64: cuda-toolkit-13-0-13.0.2 & cuda-compat-13-0-580.95.05 ✓
No lag or unavailability detected. Code ready to merge.
docker/Makefile (1)
195-195: Straightforward CUDA base image version bump.The patch-level updates from CUDA 13.0.0 to 13.0.1 are consistent across all three configurations. Verify that these NGC CUDA images are published and remain available for the release.
Also applies to: 199-199, 204-204
jenkins/current_image_tags.properties (1)
16-19: Staging image tags—verify intended for PR or persistent.All four image variables now reference internal staging images with PR-specific branch and commit identifiers in the tag. This is expected for a feature branch but should revert to NGC releases or stable staging before merge.
Confirm:
- Is this file auto-generated in CI, or manually committed for this PR?
- Are these staging images available in the referenced registry, and will they remain available during PR review/testing?
- What is the plan for updating this file before merging to release/1.1?
docker/Dockerfile.multi (2)
74-74: Incorrect review comment—protobuf installation doesn't exist in this codebase.The review assumes a system protobuf installation block was removed, but examination of the current code and git history shows:
- No protobuf installation in
install_base.sh,install_tensorrt.sh,install_pytorch.sh,install_opencv.sh, or any other install scripts- No
--protobufflag exists ininstall.sh- No protobuf-specific install script in the docker directory
- Dockerfile.multi line 74 is unchanged in git history (always
--opencvonly)- TensorRT, PyTorch, and OpenCV are all installed from pre-built binaries/wheels, not compiled from source
- Protobuf appears only in example model
requirements.txtfiles, not the Docker build processThere is no downstream impact to verify because protobuf was never part of the Docker image build pipeline.
Likely an incorrect or invalid review comment.
2-5: Verified: Infrastructure shift is real, but intent and integration status unclear.The change from NGC public images to internal GitLab staging registries is confirmed:
- Tags now include internal build IDs (36764868, 36928345)
- Protobuf installation is not present in install_base.sh or install_tensorrt.sh (removed entirely, not partial)
- Internal GitLab references appear only in docker/Dockerfile.multi lines 2-3, not in CI workflows
However, gaps remain:
- docker/README.md documents NGC Integration as the standard approach but is now outdated
- No CI/workflow integration found, suggesting this may be staging-only or incomplete
- Unclear if this is temporary or a permanent distribution strategy shift
Recommend clarifying:
- Is this a temporary staging setup or long-term direction?
- Are internal GitLab images accessible to all developers and CI systems?
- Should docker/README.md be updated to reflect the new approach?
| # https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-10.html#rel-25-10 | ||
| TORCH_VERSION="2.9.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cu128 index is inconsistent with the CUDA 13.0 stack; switch to cu130 to get correct wheels.
requirements.txt and the DLFW 25.10 stack target CUDA 13.0; keeping cu128 here risks CPU wheels or ABI mismatches.
Apply this diff:
@@
install_from_pypi() {
@@
- pip3 uninstall -y torch torchvision torchaudio
- pip3 install torch==${TORCH_VERSION} torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
+ pip3 uninstall -y torch torchvision torchaudio
+ # Use cu130 to match CUDA 13.x toolchain
+ pip3 install torch==${TORCH_VERSION} torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
}Optionally, consider pinning torchvision to the compatible series for 2.9 (e.g., 0.20.x) to avoid resolver backtracking during image builds.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In docker/common/install_pytorch.sh around lines 7 to 8, the CUDA tag is set to
cu128 which is inconsistent with the project’s CUDA 13.0 stack; change the wheel
tag from cu128 to cu130 so PyTorch wheels match the DLFW 25.10 / CUDA 13.0
environment, and optionally pin torchvision to a 0.20.x series compatible with
Torch 2.9 to avoid dependency resolver backtracking during image builds.
|
PR_Github #22428 [ run ] triggered by Bot. Commit: |
|
PR_Github #22428 [ run ] completed with state |
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.