Skip to content

Missing lookAheadRuntimeConfig in Triton Server with TensorRT-LLM backend HTTP Request #711

Open
@shaylapid

Description

@shaylapid

System Info

  • CPU architecture: x86_64
  • GPU NVIDIA H100 80GB
  • TensorRT-LLM backend tag: v0.17.0
  • Container used: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
  • OS Debian GNU/Linux 11 (bullseye)

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Build the model:

Start the container:

docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v </path/to/git/tensorrtllm_backend>:/tensorrtllm_backend \
    -v </path/to/engines>:/model/engine \
    -v </path/to/hf-checkpoint>:/model/src \
    nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3

Quantize the model:

cd /tensorrtllm_backend/tensorrt_llm/examples/quantization;
python quantize.py \
    --model_dir /model/src  \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --output_dir /model/build

Build:

trtllm-build \
    --checkpoint_dir /model/build \
    --output_dir /model/engine \
    --gpt_attention_plugin auto \
    --gemm_plugin fp8 \
    --gemm_swiglu_plugin fp8 \
    --low_latency_gemm_swiglu_plugin fp8 \
    --remove_input_padding enable \
    --context_fmha enable \
    --max_beam_width 1 \
    --max_num_tokens 1000 \
    --max_seq_len 250 \
    --max_input_len 200 \
    --max_batch_size 4 \
    --use_fused_mlp enable \
    --use_fp8_context_fmha enable \
    --use_paged_context_fmha enable \
    --speculative_decoding_mode lookahead_decoding \
    --max_draft_len 15

Adapt model repo:

Adding the following to config.pbtext:

parameters: {
  key: "decoding_mode"
  value: {
    string_value: "lookahead"
  }
}

Run with Tritonserver:

Start the container:

docker run --rm -it --net host --shm-size=2g \
    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
    -v <path/to/model>:/models \
    nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3

start tritonserver

tritonserver --model-repository=/models

Run Inference

Using the following python script:

import requests

response = requests.post(
    "http://localhost:8000/v2/models/tensorrt_llm_2beam/infer",
    json={
        "inputs": [
            {
                "name": "input_ids",
                "shape": [1, 4],
                "datatype": "INT32",
                "data": [[750, 23811, 31792, 4555]],  # "def hello_world():"
            },
            {
                "name": "input_lengths",
                "shape": [1, 1],
                "datatype": "INT32",
                "data": [[4]],
            },
            {
                "name": "request_output_len",
                "shape": [1, 1],
                "datatype": "INT32",
                "data": [[20]],
            },
        ]
    },
)
try:
    response.raise_for_status()
    print(response.json())
except requests.exceptions.RequestException as e:
    print(response.json()["error"])

Expected behavior

Successfully infer and print the generated tokens ("output_ids" in response.json()).

actual behavior

response.raise_for_status() throws a RequestException, status 500, with the following error:

Executor failed process requestId 15 due to the following error: Encountered an error in forwardAsync function: [TensorRT-LLM][ERROR] Assertion failed: requests[bi].lookaheadRuntimeConfig (/workspace/tensorrt_llm/cpp/tensorrt_llm/runtime/gptDecoder.cpp:218)
1 0x7f0e8b6bdff8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 95
2 0x7f0e8ba92022 tensorrt_llm::runtime::GptDecoder<__half>::setup(tensorrt_llm::runtime::SamplingConfig const&, unsigned long, std::shared_ptr<tensorrt_llm::runtime::ITensor const> const&, std::optional<tensorrt_llm::runtime::DecodingOutput> const&, std::optional<std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator<tensorrt_llm::runtime::decoder_batch::Request> > const> const&) + 3074
3 0x7f0e8baa5d0e tensorrt_llm::runtime::GptDecoderBatched::newRequests(std::vector<int, std::allocator > const&, std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator<tensorrt_llm::runtime::decoder_batch::Request> > const&, std::vector<tensorrt_llm::runtime::SamplingConfig, std::allocator<tensorrt_llm::runtime::SamplingConfig> > const&, tensorrt_llm::runtime::ModelConfig const&) + 590
4 0x7f0e8c4f87f5 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::shared_ptr<tensorrt_llm::batch_manager::RuntimeBuffers>&) + 1717
5 0x7f0e8c4fbf40 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1792
6 0x7f0e8c594189 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 457
7 0x7f0e8c5a07df tensorrt_llm::executor::Executor::Impl::executionLoop() + 1247
8 0x7f1130391db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7f1130391db4]
9 0x7f113012fa94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7f113012fa94]
10 0x7f11301bca34 __clone + 68

additional notes

Seems like Triton expects to get lookaheadRuntimeConfig, which I guess should be the parameters window_size, ngram_size, and verification_set_size in some form.
However, I couldn't find a reference for how to pass them in inference,
nor how to add them as inputs to config.pbtxt in the model repo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions