Tritonserver Fails to Start with TensorRT-LLM Backend with lookahead_decoding mode  - Assertion Failure in lookaheadDecodingLayer.cpp

### System Info

- CPU architecture: x86_64
- GPU NVIDIA H100 80GB
- TensorRT-LLM backend tag: v0.17.0
- Container used: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
- OS Debian GNU/Linux 11 (bullseye)

### Who can help?

_No response_

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

### Build the model:
#### Start the container:

```
docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v </path/to/git/tensorrtllm_backend>:/tensorrtllm_backend \
    -v </path/to/engines>:/model/engine \
    -v <path/to/hf-checkpoint>:/model/src \
    nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
```
#### Quantize the model:
```
cd /tensorrtllm_backend/tensorrt_llm/examples/quantization;
python quantize.py \
    --model_dir /model/src  \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --output_dir /model/build
```
#### Build:
```
trtllm-build \
    --checkpoint_dir /model/build \
    --output_dir /model/engine \
    --gpt_attention_plugin auto \
    --gemm_plugin fp8 \
    --gemm_swiglu_plugin fp8 \
    --low_latency_gemm_swiglu_plugin fp8 \
    --remove_input_padding enable \
    --context_fmha enable \
    --max_beam_width 1 \
    --max_num_tokens 1000 \
    --max_seq_len 250 \
    --max_input_len 200 \
    --max_batch_size 4 \
    --use_fused_mlp enable \
    --use_fp8_context_fmha enable \
    --use_paged_context_fmha enable \
    --speculative_decoding_mode lookahead_decoding \
    --max_draft_len 39
```

### Adapt model repo:
#### Adding the following to config.pbtext:
```
parameters: {
  key: "decoding_mode"
  value: {
    string_value: "lookahead"
  }
}
```

### Run with Tritonserver:
#### Start the container:
```
docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v <path/to/model>:/models \
    nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
```
#### start tritonserver
`tritonserver --model-repository=/models
`


### Expected behavior

Tritonserver should start successfully, and model inference should be available.

### actual behavior

Tritonserver fails to start with the following assertion error:

> E0218 20:57:33.147956 130 model_lifecycle.cc:654] "failed to load 'tensorrt_llm_2beam' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: 16 != 40 (/workspace/tensorrt_llm/cpp/tensorrt_llm/layers/lookaheadDecodingLayer.cpp:56)\n1       0x7ff34f6bdff8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 95\n2       0x7ff34f9d890c tensorrt_llm::layers::LookaheadDecodingLayer<__half>::CpuAlgorithmResources::CpuAlgorithmResources(tensorrt_llm::layers::DecoderDomain const&) + 4396\n3       0x7ff34f9d90c1 tensorrt_llm::layers::LookaheadDecodingLayer<__half>::LookaheadDecodingLayer(tensorrt_llm::layers::DecoderDomain const&, std::shared_ptr<tensorrt_llm::runtime::BufferManager>) + 241\n4       0x7ff34f97e862 tensorrt_llm::layers::DecodingLayer<__half>::DecodingLayer(tensorrt_llm::executor::DecodingMode const&, tensorrt_llm::layers::DecoderDomain const&, std::shared_ptr<tensorrt_llm::runtime::BufferManager>) + 978\n5       0x7ff34f994c88 tensorrt_llm::layers::DynamicDecodeLayer<__half>::initializeLayers() + 872\n6       0x7ff34f995bf9 tensorrt_llm::layers::DynamicDecodeLayer<__half>::initialize() + 1321\n7       0x7ff34f995dfa tensorrt_llm::layers::DynamicDecodeLayer<__half>::DynamicDecodeLayer(tensorrt_llm::executor::DecodingMode const&, tensorrt_llm::layers::DecoderDomain const&, std::shared_ptr<tensorrt_llm::runtime::BufferManager>) + 202\n8       0x7ff34fa8da0b tensorrt_llm::runtime::GptDecoder<__half>::GptDecoder(tensorrt_llm::executor::DecodingMode const&, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, std::shared_ptr<tensorrt_llm::runtime::CudaStream> const&, std::shared_ptr<tensorrt_llm::runtime::SpeculativeDecodingModule const>) + 603\n9       0x7ff34fa9a1bc tensorrt_llm::runtime::GptDecoderBatched::setup(tensorrt_llm::executor::DecodingMode const&, int, int, int, int, int, int, nvinfer1::DataType, tensorrt_llm::runtime::ModelConfig const&) + 3372\n10      0x7ff3504e7a99 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::createDecoder(std::optional<tensorrt_llm::executor::DecodingMode> const&) + 825\n11      0x7ff3504fdec0 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 3168\n12      0x7ff350476df9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 489\n13      0x7ff350597369 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185\n14      0x7ff3505979fd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229\n15      0x7ff350598c4a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474\n16      0x7ff35057e6d7 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87\n17      0x7ff5e803588e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3388e) [0x7ff5e803588e]\n18      0x7ff5e8032049 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185\n19      0x7ff5e8032592 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n20      0x7ff5e801f929 TRITONBACKEND_ModelInstanceInitialize + 153\n21      0x7ff5f6bd7649 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1649) [0x7ff5f6bd7649]\n22      0x7ff5f6bd80d2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20d2) [0x7ff5f6bd80d2]\n23      0x7ff5f6bbdcf3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cf3) [0x7ff5f6bbdcf3]\n24      0x7ff5f6bbe0a4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1880a4) [0x7ff5f6bbe0a4]\n25      0x7ff5f6bc768d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19168d) [0x7ff5f6bc768d]\n26      0x7ff5f6134ec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7ff5f6134ec3]\n27      0x7ff5f6bb4f02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17ef02) [0x7ff5f6bb4f02]\n28      0x7ff5f6bc2ddc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cddc) [0x7ff5f6bc2ddc]\n29      0x7ff5f6bc6e12 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190e12) [0x7ff5f6bc6e12]\n30      0x7ff5f6cc78e1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2918e1) [0x7ff5f6cc78e1]\n31      0x7ff5f6ccac3c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x294c3c) [0x7ff5f6ccac3c]\n32      0x7ff5f6e27305 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f1305) [0x7ff5f6e27305]\n33      0x7ff5f6391db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7ff5f6391db4]\n34      0x7ff5f612fa94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7ff5f612fa94]\n35      0x7ff5f61bca34 __clone + 68"
> I0218 20:57:33.148431 130 model_lifecycle.cc:789] "failed to load 'tensorrt_llm_2beam'"
> I0218 20:57:33.148569 130 server.cc:604]

### additional notes

Changing `--max_draft_len` to `15` allows Tritonserver to start, but this prevents selecting the desired max_draft_len value.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tritonserver Fails to Start with TensorRT-LLM Backend with lookahead_decoding mode - Assertion Failure in lookaheadDecodingLayer.cpp #710

System Info

Who can help?

Information

Tasks

Reproduction

Build the model:

Start the container:

Quantize the model:

Build:

Adapt model repo:

Adding the following to config.pbtext:

Run with Tritonserver:

Start the container:

start tritonserver

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tritonserver Fails to Start with TensorRT-LLM Backend with lookahead_decoding mode - Assertion Failure in lookaheadDecodingLayer.cpp #710

Description

System Info

Who can help?

Information

Tasks

Reproduction

Build the model:

Start the container:

Quantize the model:

Build:

Adapt model repo:

Adding the following to config.pbtext:

Run with Tritonserver:

Start the container:

start tritonserver

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions