Missing lookAheadRuntimeConfig in Triton Server with TensorRT-LLM backend HTTP Request

### System Info

- CPU architecture: x86_64
- GPU NVIDIA H100 80GB
- TensorRT-LLM backend tag: v0.17.0
- Container used: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
- OS Debian GNU/Linux 11 (bullseye)

### Who can help?

_No response_

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

### Build the model:
#### Start the container:

```
docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v </path/to/git/tensorrtllm_backend>:/tensorrtllm_backend \
    -v </path/to/engines>:/model/engine \
    -v </path/to/hf-checkpoint>:/model/src \
    nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
```
#### Quantize the model:
```
cd /tensorrtllm_backend/tensorrt_llm/examples/quantization;
python quantize.py \
    --model_dir /model/src  \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --output_dir /model/build
```
#### Build:
```
trtllm-build \
    --checkpoint_dir /model/build \
    --output_dir /model/engine \
    --gpt_attention_plugin auto \
    --gemm_plugin fp8 \
    --gemm_swiglu_plugin fp8 \
    --low_latency_gemm_swiglu_plugin fp8 \
    --remove_input_padding enable \
    --context_fmha enable \
    --max_beam_width 1 \
    --max_num_tokens 1000 \
    --max_seq_len 250 \
    --max_input_len 200 \
    --max_batch_size 4 \
    --use_fused_mlp enable \
    --use_fp8_context_fmha enable \
    --use_paged_context_fmha enable \
    --speculative_decoding_mode lookahead_decoding \
    --max_draft_len 15
```

### Adapt model repo:
#### Adding the following to config.pbtext:
```
parameters: {
  key: "decoding_mode"
  value: {
    string_value: "lookahead"
  }
}
```

### Run with Tritonserver:
#### Start the container:
```
docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v <path/to/model>:/models \
    nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
```
#### start tritonserver
`tritonserver --model-repository=/models
`

### Run Inference
#### Using the following python script:
```
import requests

response = requests.post(
    "http://localhost:8000/v2/models/tensorrt_llm_2beam/infer",
    json={
        "inputs": [
            {
                "name": "input_ids",
                "shape": [1, 4],
                "datatype": "INT32",
                "data": [[750, 23811, 31792, 4555]],  # "def hello_world():"
            },
            {
                "name": "input_lengths",
                "shape": [1, 1],
                "datatype": "INT32",
                "data": [[4]],
            },
            {
                "name": "request_output_len",
                "shape": [1, 1],
                "datatype": "INT32",
                "data": [[20]],
            },
        ]
    },
)
try:
    response.raise_for_status()
    print(response.json())
except requests.exceptions.RequestException as e:
    print(response.json()["error"])
```

### Expected behavior

Successfully infer and print the generated tokens ("output_ids" in response.json()).

### actual behavior

response.raise_for_status() throws a RequestException, status 500, with the following error:

>Executor failed process requestId 15 due to the following error: Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: requests[bi].lookaheadRuntimeConfig (/workspace/tensorrt_llm/cpp/tensorrt_llm/runtime/gptDecoder.cpp:218)
1       0x7f0e8b6bdff8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 95
2       0x7f0e8ba92022 tensorrt_llm::runtime::GptDecoder<__half>::setup(tensorrt_llm::runtime::SamplingConfig const&, unsigned long, std::shared_ptr<tensorrt_llm::runtime::ITensor const> const&, std::optional<tensorrt_llm::runtime::DecodingOutput> const&, std::optional<std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator<tensorrt_llm::runtime::decoder_batch::Request> > const> const&) + 3074
3       0x7f0e8baa5d0e tensorrt_llm::runtime::GptDecoderBatched::newRequests(std::vector<int, std::allocator<int> > const&, std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator<tensorrt_llm::runtime::decoder_batch::Request> > const&, std::vector<tensorrt_llm::runtime::SamplingConfig, std::allocator<tensorrt_llm::runtime::SamplingConfig> > const&, tensorrt_llm::runtime::ModelConfig const&) + 590
4       0x7f0e8c4f87f5 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::shared_ptr<tensorrt_llm::batch_manager::RuntimeBuffers>&) + 1717
5       0x7f0e8c4fbf40 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1792
6       0x7f0e8c594189 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 457
7       0x7f0e8c5a07df tensorrt_llm::executor::Executor::Impl::executionLoop() + 1247
8       0x7f1130391db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7f1130391db4]
9       0x7f113012fa94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7f113012fa94]
10      0x7f11301bca34 __clone + 68

### additional notes

Seems like Triton expects to get lookaheadRuntimeConfig, which I guess should be the parameters window_size, ngram_size, and verification_set_size in some form.
However, I couldn't find a reference for how to pass them in inference,
nor how to add them as inputs to config.pbtxt in the model repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Missing lookAheadRuntimeConfig in Triton Server with TensorRT-LLM backend HTTP Request #711

System Info

Who can help?

Information

Tasks

Reproduction

Build the model:

Start the container:

Quantize the model:

Build:

Adapt model repo:

Adding the following to config.pbtext:

Run with Tritonserver:

Start the container:

start tritonserver

Run Inference

Using the following python script:

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Missing lookAheadRuntimeConfig in Triton Server with TensorRT-LLM backend HTTP Request #711

Description

System Info

Who can help?

Information

Tasks

Reproduction

Build the model:

Start the container:

Quantize the model:

Build:

Adapt model repo:

Adding the following to config.pbtext:

Run with Tritonserver:

Start the container:

start tritonserver

Run Inference

Using the following python script:

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions