TensorRT-LLM: open-source library for optimized LLM inference on NVIDIA GPUs. Python and C++ codebase supporting TensorRT engine-based and PyTorch-based execution paths.
If a
CLAUDE.local.mdfile exists alongside this file, read and respect it — it contains developer-specific overrides that supplement this shared guidance.
CRITICAL (YOU MUST):
- Read and follow
CODING_GUIDELINES.mdfor ALL code changes (C++ and Python) - NVIDIA copyright header on ALL new files (update year on modified files)
git commit -s(DCO sign-off required). Never attribute AI tools in sign-off line. Always rely ongitto do the sign off instead of directly adding sign off in commit message.- Do not add co-authors to the git commit message unless explicitly instructed to do so by the user.
pre-commithooks run on commit — if files are modified by hooks, re-stage and commit again- PR title format:
[JIRA/NVBUG/None][type] description(e.g.,[TRTLLM-5516][perf] optimize cuda graph padding) - Set
LLM_MODELS_ROOTenv var when running tests that need model weights
| Task | Command |
|---|---|
| Unit tests | pytest tests/unittest/ |
| Specific test | pytest tests/unittest/llmapi/test_llm_args.py |
| Pattern match | pytest tests/unittest -k "test_llm_args" |
| Integration tests | LLM_MODELS_ROOT=/path/to/models pytest tests/integration/defs/... |
| Serve model | trtllm-serve <hf_model> --port 8000 |
| Serve with config | trtllm-serve <hf_model> --config config.yaml |
| Benchmark | trtllm-bench --model <hf_model> throughput --dataset <path> |
| Find CI stage for test | python scripts/test_to_stage_mapping.py --tests "test_name" |
Building TensorRT-LLM requires Docker and may involve compiling C++ components. See build from source for full instructions, or pip install for pre-built wheels. For container images, see NGC containers.
examples/configs/database/ contains 170+ pareto-optimized serving configurations
across multiple models, GPUs, ISL/OSL combinations, and concurrency levels.
Use these as starting points for deployment and benchmarking rather than hand-tuning parameters.
See deployment guides for model-specific walkthroughs.
See architecture diagram for the full Mermaid diagram.
| Backend | Status | Entry Point | Key Path |
|---|---|---|---|
| PyTorch | Default | LLM(backend="pytorch") |
_torch/pyexecutor/ → PyExecutor → PyTorch Engine |
| AutoDeploy | Beta | LLM(backend="_autodeploy") |
_torch/auto_deploy/ → ADExecutor → graph transforms + torch.export |
| TensorRT | Legacy | LLM(backend="tensorrt") |
builder.py → trtllm.Executor → TensorRT Engine |
Both PyTorch and TensorRT backends share these C++ components:
- Scheduling pipeline: Scheduler → BatchManager (in-flight batching) → KV Cache Manager
- Decoding pipeline: Decoder (token generation orchestration) → Sampling
HuggingFace Model → LLM API → Executor (PyTorch/AutoDeploy/TensorRT)
→ Scheduler → Model Forward → Decoder → Sampling → Generated Tokens
trtllm-serve: OpenAI-compatible REST + gRPC server, supports all backends- Disaggregated serving: separates prefill (context) and decode (generation) across GPUs
- KV cache exchange via NIXL (default), UCX, or MPI
| File | Role |
|---|---|
tensorrt_llm/llmapi/llm.py |
Main API entry point |
tensorrt_llm/llmapi/llm_args.py |
Complete configuration schema (Pydantic) |
tensorrt_llm/llmapi/llm_utils.py |
Model loading, model-specific default overrides |
tensorrt_llm/models/modeling_utils.py |
Base classes for all models (PretrainedConfig, PretrainedModel) |
tensorrt_llm/executor/executor.py |
Execution abstraction (GenerationExecutor) |
tensorrt_llm/models/automodel.py |
Auto-discovery and model registry |
| Pattern | Key Points |
|---|---|
| Config hierarchy | LlmArgs → TrtLlmArgs / TorchLlmArgs, model-specific defaults override generics, Pydantic validation |
| Model architecture | Each model: Config (inherits PretrainedConfig) + ForCausalLM (inherits PretrainedModel) |
| Model defaults | Architecture-specific overrides in llm_utils.py (attention kernels, quant, spec decoding, cache) |
| Distributed execution | Tensor/pipeline parallelism via Mapping class, multiple backends (MPI, Ray, RPC) |
| Auto-discovery | Models self-register via automodel.py, resolved by HF config architectures field |
- Pre-commit modifies files in-place — if hooks fail, files are already modified. Re-stage (
git add) and commit again. - Protected APIs exist — changes to LLM API signatures will fail
tests/api_stabilitytests. Get code owner review. - Integration tests need GPUs + models — always set
LLM_MODELS_ROOTand ensure GPU access. Unit tests don't. - Copyright year — update to current year when modifying existing files; add full header to new files.
- Avoid broad exception handling — catch specific exceptions, not bare
except:(seeCODING_GUIDELINES.md). - One concern per PR — avoid scope creep. If a PR touches unrelated areas, split it.
- User-facing configuration classes - when editing or defining any user-facing configuration classes (particularly
LlmArgsor any class used in its fields), you MUST follow the Pydantic guidelines inCODING_GUIDELINES.md.
- Set up build environment (see installation docs)
- Make changes following
CODING_GUIDELINES.md - Test locally with
pytest
- The main repository (
upstream) is located at https://github.com/NVIDIA/TensorRT-LLM/ - Branches should always be pushed to the user-specified fork (usually
origin) - If pushing fails to due pre-push pre-commits hooks getting updated, just re-push immediately
- PRs should be opened on the main repository
- PR title format:
[JIRA/NVBUG/None][type] description(e.g.,[TRTLLM-5516][perf] optimize cuda graph padding) - Target
mainunless fixing a release branch bug - See
CONTRIBUTING.mdfor full PR policies
- PR title format:
See CI overview for full details.
| Layer | Location | Notes |
|---|---|---|
| Unit tests | tests/unittest/ |
Run in pre-merge CI; some tests require GPU |
| API stability | tests/api_stability/ |
Protects committed API signatures |
| Integration tests | tests/integration/defs/ |
Requires GPU + LLM_MODELS_ROOT |
| Test lists | tests/integration/test_lists/test-db/ |
Per-GPU YAML files (l0_a10.yml, l0_h100.yml, etc.) |
| Test waives | tests/integration/test_lists/waives.txt |
Skip known-failing tests with NVBug links |
| Performance | See benchmarking guide | trtllm-bench and trtllm-serve benchmarks |
CI is triggered by posting comments on the PR. Basic commands:
/bot run— trigger the standard CI pipeline/bot run --disable-fail-fast— run all stages even if earlier ones fail (only add when explicitly needed)/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"— include AutoDeploy CI stages (use for AutoDeploy-related PRs)
For a full list of up-to-date bot commands, post /bot help as a PR comment and check the bot's reply.
CI tests run on internal NVIDIA Jenkins infrastructure (blossom-ci). To retrieve failed test cases from a PR:
Step 1: Get the Jenkins build number from PR comments
The CI bot (tensorrt-cicd) posts comments with links to the Jenkins build. Extract the L0_MergeRequest_PR build number:
PR_NUM=<pr_number>
BUILD_NUM=$(gh api "repos/NVIDIA/TensorRT-LLM/issues/${PR_NUM}/comments" --jq \
'[.[] | select(.user.login == "tensorrt-cicd") | select(.body | test("L0_MergeRequest_PR"))] | last | .body' \
| grep -oP 'L0_MergeRequest_PR/\K\d+')Step 2: Query the Jenkins testReport API for failures
Resolve the Jenkins base URL dynamically from the internal shortcut (requires corporate network):
JENKINS_BASE="$(curl -skI 'https://nv/trt-llm-cicd' 2>/dev/null | grep -i '^location:' | sed 's/^[Ll]ocation: *//;s/[[:space:]]*$//')job/main/job/L0_MergeRequest_PR"curl -s "${JENKINS_BASE}/${BUILD_NUM}/testReport/api/json" | python3 -c "
import json, sys
data = json.load(sys.stdin)
print(f'Summary: {data[\"passCount\"]} passed, {data[\"failCount\"]} failed, {data[\"skipCount\"]} skipped')
failed = []
for suite in data.get('suites', []):
for case in suite.get('cases', []):
if case.get('status') in ('FAILED', 'REGRESSION'):
failed.append(case)
if not failed:
print('No test failures!')
else:
print(f'Failed tests ({len(failed)}):')
for f in failed:
print(f' - {f[\"className\"]}.{f[\"name\"]}')
err = (f.get('errorDetails') or '')[:200]
if err:
print(f' Error: {err}')
"Step 3 (if needed): Get full stdout/stderr for a specific failure
The errorStackTrace can be incomplete when errors originate from subprocesses. In that case, fetch stdout and stderr for the specific test case to find the real error:
curl -s "${JENKINS_BASE}/${BUILD_NUM}/testReport/api/json" | python3 -c "
import json, sys
data = json.load(sys.stdin)
for suite in data.get('suites', []):
for case in suite.get('cases', []):
if case.get('status') in ('FAILED', 'REGRESSION'):
name = f'{case[\"className\"]}.{case[\"name\"]}'
if '<search_term>' in name:
print(f'=== {name} ===')
print('--- Error ---')
print(case.get('errorDetails', ''))
print('--- Stack Trace ---')
print(case.get('errorStackTrace', ''))
print('--- Stdout (last 3000 chars) ---')
print((case.get('stdout') or '')[-3000:])
print('--- Stderr (last 3000 chars) ---')
print((case.get('stderr') or '')[-3000:])
break
"Available fields per failed test case (from Jenkins testReport API):
className,name: test identifierstatus:FAILEDorREGRESSIONerrorDetails: error messageerrorStackTrace: full stack trace (may be incomplete for subprocess errors)stdout,stderr: full test output (can be large, check these when stack trace is insufficient)
| Topic | Path |
|---|---|
| Architecture overview | docs/source/developer-guide/overview.md |
| PyTorch backend | docs/source/torch/arch_overview.md |
| Adding a new model | docs/source/torch/adding_new_model.md |
| AutoDeploy | docs/source/features/auto_deploy/auto-deploy.md |
| Disaggregated serving | docs/source/features/disagg-serving.md |
| Speculative decoding | docs/source/features/speculative-decoding.md |
| Quantization | docs/source/features/quantization.md |
| Parallelism strategies | docs/source/features/parallel-strategy.md |
| KV cache | docs/source/features/kvcache.md |
| API change guidelines | docs/source/developer-guide/api-change.md |
| Feature compatibility matrix | docs/source/features/feature-combination-matrix.md |
| Supported models | docs/source/models/supported-models.md |
| Deployment guides | docs/source/deployment-guide/ |
| Examples & customization | docs/source/examples/ |
| Performance analysis | docs/source/developer-guide/perf-analysis.md |