Description
System Info
- CPU architecture: x86_64
- GPU NVIDIA H100 80GB
- TensorRT-LLM backend tag: v0.17.0
- Container used: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
- OS Debian GNU/Linux 11 (bullseye)
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Build the model:
Start the container:
docker run --rm -it --net host --shm-size=2g \
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v </path/to/git/tensorrtllm_backend>:/tensorrtllm_backend \
-v </path/to/engines>:/model/engine \
-v </path/to/hf-checkpoint>:/model/src \
nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
Quantize the model:
cd /tensorrtllm_backend/tensorrt_llm/examples/quantization;
python quantize.py \
--model_dir /model/src \
--qformat fp8 \
--kv_cache_dtype fp8 \
--output_dir /model/build
Build:
trtllm-build \
--checkpoint_dir /model/build \
--output_dir /model/engine \
--gpt_attention_plugin auto \
--gemm_plugin fp8 \
--gemm_swiglu_plugin fp8 \
--low_latency_gemm_swiglu_plugin fp8 \
--remove_input_padding enable \
--context_fmha enable \
--max_beam_width 1 \
--max_num_tokens 1000 \
--max_seq_len 250 \
--max_input_len 200 \
--max_batch_size 4 \
--use_fused_mlp enable \
--use_fp8_context_fmha enable \
--use_paged_context_fmha enable \
--speculative_decoding_mode lookahead_decoding \
--max_draft_len 15
Adapt model repo:
Adding the following to config.pbtext:
parameters: {
key: "decoding_mode"
value: {
string_value: "lookahead"
}
}
Run with Tritonserver:
Start the container:
docker run --rm -it --net host --shm-size=2g \
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v <path/to/model>:/models \
nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
start tritonserver
tritonserver --model-repository=/models
Run Inference
Using the following python script:
import requests
response = requests.post(
"http://localhost:8000/v2/models/tensorrt_llm_2beam/infer",
json={
"inputs": [
{
"name": "input_ids",
"shape": [1, 4],
"datatype": "INT32",
"data": [[750, 23811, 31792, 4555]], # "def hello_world():"
},
{
"name": "input_lengths",
"shape": [1, 1],
"datatype": "INT32",
"data": [[4]],
},
{
"name": "request_output_len",
"shape": [1, 1],
"datatype": "INT32",
"data": [[20]],
},
]
},
)
try:
response.raise_for_status()
print(response.json())
except requests.exceptions.RequestException as e:
print(response.json()["error"])
Expected behavior
Successfully infer and print the generated tokens ("output_ids" in response.json()).
actual behavior
response.raise_for_status() throws a RequestException, status 500, with the following error:
Executor failed process requestId 15 due to the following error: Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: requests[bi].lookaheadRuntimeConfig (/workspace/tensorrt_llm/cpp/tensorrt_llm/runtime/gptDecoder.cpp:218)
1 0x7f0e8b6bdff8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 95
2 0x7f0e8ba92022 tensorrt_llm::runtime::GptDecoder<__half>::setup(tensorrt_llm::runtime::SamplingConfig const&, unsigned long, std::shared_ptr<tensorrt_llm::runtime::ITensor const> const&, std::optional<tensorrt_llm::runtime::DecodingOutput> const&, std::optional<std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator<tensorrt_llm::runtime::decoder_batch::Request> > const> const&) + 3074
3 0x7f0e8baa5d0e tensorrt_llm::runtime::GptDecoderBatched::newRequests(std::vector<int, std::allocator > const&, std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator<tensorrt_llm::runtime::decoder_batch::Request> > const&, std::vector<tensorrt_llm::runtime::SamplingConfig, std::allocator<tensorrt_llm::runtime::SamplingConfig> > const&, tensorrt_llm::runtime::ModelConfig const&) + 590
4 0x7f0e8c4f87f5 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::shared_ptr<tensorrt_llm::batch_manager::RuntimeBuffers>&) + 1717
5 0x7f0e8c4fbf40 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1792
6 0x7f0e8c594189 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 457
7 0x7f0e8c5a07df tensorrt_llm::executor::Executor::Impl::executionLoop() + 1247
8 0x7f1130391db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7f1130391db4]
9 0x7f113012fa94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7f113012fa94]
10 0x7f11301bca34 __clone + 68
additional notes
Seems like Triton expects to get lookaheadRuntimeConfig, which I guess should be the parameters window_size, ngram_size, and verification_set_size in some form.
However, I couldn't find a reference for how to pass them in inference,
nor how to add them as inputs to config.pbtxt in the model repo.