AGENTS.md

TensorRT-LLM: open-source library for optimized LLM inference on NVIDIA GPUs. Python and C++ codebase supporting TensorRT engine-based and PyTorch-based execution paths.

If a CLAUDE.local.md file exists alongside this file, read and respect it — it contains developer-specific overrides that supplement this shared guidance.

Rules (Read First)

CRITICAL (YOU MUST):

Read and follow CODING_GUIDELINES.md for ALL code changes (C++ and Python)
NVIDIA copyright header on ALL new files (update year on modified files)
git commit -s (DCO sign-off required). Never attribute AI tools in sign-off line. Always rely on git to do the sign off instead of directly adding sign off in commit message.
Do not add co-authors to the git commit message unless explicitly instructed to do so by the user.
pre-commit hooks run on commit — if files are modified by hooks, re-stage and commit again
PR title format: [JIRA/NVBUG/None][type] description (e.g., [TRTLLM-5516][perf] optimize cuda graph padding)
Set LLM_MODELS_ROOT env var when running tests that need model weights

Common Commands

Task	Command
Unit tests	`pytest tests/unittest/`
Specific test	`pytest tests/unittest/llmapi/test_llm_args.py`
Pattern match	`pytest tests/unittest -k "test_llm_args"`
Integration tests	`LLM_MODELS_ROOT=/path/to/models pytest tests/integration/defs/...`
Serve model	`trtllm-serve <hf_model> --port 8000`
Serve with config	`trtllm-serve <hf_model> --config config.yaml`
Benchmark	`trtllm-bench --model <hf_model> throughput --dataset <path>`
Find CI stage for test	`python scripts/test_to_stage_mapping.py --tests "test_name"`

Installation & Build

Building TensorRT-LLM requires Docker and may involve compiling C++ components. See build from source for full instructions, or pip install for pre-built wheels. For container images, see NGC containers.

Reference Configs

examples/configs/database/ contains 170+ pareto-optimized serving configurations across multiple models, GPUs, ISL/OSL combinations, and concurrency levels. Use these as starting points for deployment and benchmarking rather than hand-tuning parameters. See deployment guides for model-specific walkthroughs.

Architecture

See architecture diagram for the full Mermaid diagram.

Backends

Backend	Status	Entry Point	Key Path
PyTorch	Default	`LLM(backend="pytorch")`	`_torch/pyexecutor/` → `PyExecutor` → PyTorch Engine
AutoDeploy	Beta	`LLM(backend="_autodeploy")`	`_torch/auto_deploy/` → `ADExecutor` → graph transforms + torch.export
TensorRT	Legacy	`LLM(backend="tensorrt")`	`builder.py` → `trtllm.Executor` → TensorRT Engine

Shared C++ Core (via Nanobind)

Both PyTorch and TensorRT backends share these C++ components:

Scheduling pipeline: Scheduler → BatchManager (in-flight batching) → KV Cache Manager
Decoding pipeline: Decoder (token generation orchestration) → Sampling

Request Flow

HuggingFace Model → LLM API → Executor (PyTorch/AutoDeploy/TensorRT)
    → Scheduler → Model Forward → Decoder → Sampling → Generated Tokens

Serving

trtllm-serve: OpenAI-compatible REST + gRPC server, supports all backends
Disaggregated serving: separates prefill (context) and decode (generation) across GPUs
- KV cache exchange via NIXL (default), UCX, or MPI

Key Files

File	Role
`tensorrt_llm/llmapi/llm.py`	Main API entry point
`tensorrt_llm/llmapi/llm_args.py`	Complete configuration schema (Pydantic)
`tensorrt_llm/llmapi/llm_utils.py`	Model loading, model-specific default overrides
`tensorrt_llm/models/modeling_utils.py`	Base classes for all models (`PretrainedConfig`, `PretrainedModel`)
`tensorrt_llm/executor/executor.py`	Execution abstraction (`GenerationExecutor`)
`tensorrt_llm/models/automodel.py`	Auto-discovery and model registry

Design Patterns

Pattern	Key Points
Config hierarchy	`LlmArgs` → `TrtLlmArgs` / `TorchLlmArgs`, model-specific defaults override generics, Pydantic validation
Model architecture	Each model: `Config` (inherits `PretrainedConfig`) + `ForCausalLM` (inherits `PretrainedModel`)
Model defaults	Architecture-specific overrides in `llm_utils.py` (attention kernels, quant, spec decoding, cache)
Distributed execution	Tensor/pipeline parallelism via `Mapping` class, multiple backends (MPI, Ray, RPC)
Auto-discovery	Models self-register via `automodel.py`, resolved by HF config `architectures` field

Anti-Patterns / Gotchas

Pre-commit modifies files in-place — if hooks fail, files are already modified. Re-stage (git add) and commit again.
Protected APIs exist — changes to LLM API signatures will fail tests/api_stability tests. Get code owner review.
Integration tests need GPUs + models — always set LLM_MODELS_ROOT and ensure GPU access. Unit tests don't.
Copyright year — update to current year when modifying existing files; add full header to new files.
Avoid broad exception handling — catch specific exceptions, not bare except: (see CODING_GUIDELINES.md).
One concern per PR — avoid scope creep. If a PR touches unrelated areas, split it.
User-facing configuration classes - when editing or defining any user-facing configuration classes (particularly LlmArgs or any class used in its fields), you MUST follow the Pydantic guidelines in CODING_GUIDELINES.md.

Development Workflow

Set up build environment (see installation docs)
Make changes following CODING_GUIDELINES.md
Test locally with pytest

Branching policy and PRs

The main repository (upstream) is located at https://github.com/NVIDIA/TensorRT-LLM/
Branches should always be pushed to the user-specified fork (usually origin)
If pushing fails to due pre-push pre-commits hooks getting updated, just re-push immediately
PRs should be opened on the main repository
- PR title format: [JIRA/NVBUG/None][type] description (e.g., [TRTLLM-5516][perf] optimize cuda graph padding)
- Target main unless fixing a release branch bug
- See CONTRIBUTING.md for full PR policies

CI / Testing

See CI overview for full details.

Layer	Location	Notes
Unit tests	`tests/unittest/`	Run in pre-merge CI; some tests require GPU
API stability	`tests/api_stability/`	Protects committed API signatures
Integration tests	`tests/integration/defs/`	Requires GPU + `LLM_MODELS_ROOT`
Test lists	`tests/integration/test_lists/test-db/`	Per-GPU YAML files (`l0_a10.yml`, `l0_h100.yml`, etc.)
Test waives	`tests/integration/test_lists/waives.txt`	Skip known-failing tests with NVBug links
Performance	See benchmarking guide	`trtllm-bench` and `trtllm-serve` benchmarks

Triggering CI

CI is triggered by posting comments on the PR. Basic commands:

/bot run — trigger the standard CI pipeline
/bot run --disable-fail-fast — run all stages even if earlier ones fail (only add when explicitly needed)
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" — include AutoDeploy CI stages (use for AutoDeploy-related PRs)

For a full list of up-to-date bot commands, post /bot help as a PR comment and check the bot's reply.

Retrieving CI Test Failures from a PR

CI tests run on internal NVIDIA Jenkins infrastructure (blossom-ci). To retrieve failed test cases from a PR:

Step 1: Get the Jenkins build number from PR comments

The CI bot (tensorrt-cicd) posts comments with links to the Jenkins build. Extract the L0_MergeRequest_PR build number:

PR_NUM=<pr_number>
BUILD_NUM=$(gh api "repos/NVIDIA/TensorRT-LLM/issues/${PR_NUM}/comments" --jq \
  '[.[] | select(.user.login == "tensorrt-cicd") | select(.body | test("L0_MergeRequest_PR"))] | last | .body' \
  | grep -oP 'L0_MergeRequest_PR/\K\d+')

Step 2: Query the Jenkins testReport API for failures

Resolve the Jenkins base URL dynamically from the internal shortcut (requires corporate network):

JENKINS_BASE="$(curl -skI 'https://nv/trt-llm-cicd' 2>/dev/null | grep -i '^location:' | sed 's/^[Ll]ocation: *//;s/[[:space:]]*$//')job/main/job/L0_MergeRequest_PR"

curl -s "${JENKINS_BASE}/${BUILD_NUM}/testReport/api/json" | python3 -c "
import json, sys
data = json.load(sys.stdin)
print(f'Summary: {data[\"passCount\"]} passed, {data[\"failCount\"]} failed, {data[\"skipCount\"]} skipped')
failed = []
for suite in data.get('suites', []):
    for case in suite.get('cases', []):
        if case.get('status') in ('FAILED', 'REGRESSION'):
            failed.append(case)
if not failed:
    print('No test failures!')
else:
    print(f'Failed tests ({len(failed)}):')
    for f in failed:
        print(f'  - {f[\"className\"]}.{f[\"name\"]}')
        err = (f.get('errorDetails') or '')[:200]
        if err:
            print(f'    Error: {err}')
"

Step 3 (if needed): Get full stdout/stderr for a specific failure

The errorStackTrace can be incomplete when errors originate from subprocesses. In that case, fetch stdout and stderr for the specific test case to find the real error:

curl -s "${JENKINS_BASE}/${BUILD_NUM}/testReport/api/json" | python3 -c "
import json, sys
data = json.load(sys.stdin)
for suite in data.get('suites', []):
    for case in suite.get('cases', []):
        if case.get('status') in ('FAILED', 'REGRESSION'):
            name = f'{case[\"className\"]}.{case[\"name\"]}'
            if '<search_term>' in name:
                print(f'=== {name} ===')
                print('--- Error ---')
                print(case.get('errorDetails', ''))
                print('--- Stack Trace ---')
                print(case.get('errorStackTrace', ''))
                print('--- Stdout (last 3000 chars) ---')
                print((case.get('stdout') or '')[-3000:])
                print('--- Stderr (last 3000 chars) ---')
                print((case.get('stderr') or '')[-3000:])
                break
"

Available fields per failed test case (from Jenkins testReport API):

className, name: test identifier
status: FAILED or REGRESSION
errorDetails: error message
errorStackTrace: full stack trace (may be incomplete for subprocess errors)
stdout, stderr: full test output (can be large, check these when stack trace is insufficient)

Key Documentation

Topic	Path
Architecture overview	`docs/source/developer-guide/overview.md`
PyTorch backend	`docs/source/torch/arch_overview.md`
Adding a new model	`docs/source/torch/adding_new_model.md`
AutoDeploy	`docs/source/features/auto_deploy/auto-deploy.md`
Disaggregated serving	`docs/source/features/disagg-serving.md`
Speculative decoding	`docs/source/features/speculative-decoding.md`
Quantization	`docs/source/features/quantization.md`
Parallelism strategies	`docs/source/features/parallel-strategy.md`
KV cache	`docs/source/features/kvcache.md`
API change guidelines	`docs/source/developer-guide/api-change.md`
Feature compatibility matrix	`docs/source/features/feature-combination-matrix.md`
Supported models	`docs/source/models/supported-models.md`
Deployment guides	`docs/source/deployment-guide/`
Examples & customization	`docs/source/examples/`
Performance analysis	`docs/source/developer-guide/perf-analysis.md`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AGENTS.md

Rules (Read First)

Common Commands

Installation & Build

Reference Configs

Architecture

Backends

Shared C++ Core (via Nanobind)

Request Flow

Serving

Key Files

Design Patterns

Anti-Patterns / Gotchas

Development Workflow

Branching policy and PRs

CI / Testing

Triggering CI

Retrieving CI Test Failures from a PR

Key Documentation

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

AGENTS.md

Rules (Read First)

Common Commands

Installation & Build

Reference Configs

Architecture

Backends

Shared C++ Core (via Nanobind)

Request Flow

Serving

Key Files

Design Patterns

Anti-Patterns / Gotchas

Development Workflow

Branching policy and PRs

CI / Testing

Triggering CI

Retrieving CI Test Failures from a PR

Key Documentation