Open
Description
System Info
- GPU: A10G (g5.48xlarge)
- Container: nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3
- version of tensorrtllm_backend: v0.16.0
- Model: mistral7b
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- convert mistral7b checkpoint
python3 convert_checkpoint.py <model_directory> \
--output_dir <checkpoint_directory> \
--dtype float16 \
--tp_size 8
- build the engines
trtllm-build --checkpoint_dir <checkpoint_directory> \
--output_dir <engine_dir> \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--max_num_tokens 4096 \
--context_fmha enable \
--kv_cache_type paged \
--max_batch_size 32 \
--max_beam_width 10 \
--max_seq_len 128 \
--workers 8
3.create the model repo
ENGINE_DIR=<engine_dir>
TOKENIZER_DIR=<mistral7b_tokenizer_dir>
MODEL_FOLDER=<model_repo_dir>
TRITON_MAX_BATCH_SIZE=8
INSTANCE_COUNT=8
MAX_QUEUE_DELAY_MS=0
MAX_QUEUE_SIZE=0
FILL_TEMPLATE_SCRIPT=fill_template.py
DECOUPLED_MODE=false
MAX_BEAM_WIDTH=10
EXCLUDE_INPUT_IN_OUTPUT=true
BATCHING_STRATEGY=inflight_fused_batching
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:${BATCHING_STRATEGY},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},max_queue_size:${MAX_QUEUE_SIZE},encoder_input_features_data_type:TYPE_FP16,max_beam_width:${MAX_BEAM_WIDTH}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}
- launch triton server
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=8 --model_repo=<model_repo_dir> --log --log-file=/tmp/triton_debug.log
Expected behavior
I am doing a prefix completing task. Given a prefix, my model is finetuned to complete the prefix with 10 suggestions. The suggestions should be diverse. Example
request:
curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{
"text_input": "### Instruction: Provide suggestion starting with prefix ### Prefix: ipho ### Suggestion:",
"max_tokens": 20,
"beam_width": 10,
"pad_id": 2,
"end_id": 774}'
response:
Response:
{
"model_name": "tensorrt_llm_bls",
"model_version": "1",
"text_output": [
"iphone charger",
"iphone 15 plus case",
"iphone 13 case",
"iphone 15 case",
"iphone 14 case",
"iphone charger fast charging",
"iphone 15 pro case",
"iphone 14 plus case",
"iphone cable",
"iphone 13 pro case"
]
actual behavior
The first request work as expected. Starting from second requests, I am getting identical suggestions, seems no diversity in beam search, for example:
{
"model_name": "tensorrt_llm_bls",
"model_version": "1",
"text_output": [
"iphone 15 pro max case",
"iphone 15 pro max case",
"iphone 15 pro max case",
"iphone 15 pro max case",
"iphone 15 pro max case",
"iphone 15 pro max case",
"iphone 15 pro max case",
"iphone 15 pro max case",
"iphone 15 pro max case",
"iphone 15 pro max case"
]
}
Furthur testing shows a pattern: for an engine built with max_batch_size=N, only requests numbered (N*k + 1) return diverse beam search results. All other requests return identical suggestions.
additional notes
If switching back to V1 batching, everything works fine. The issue appears to be related to the interaction between in-flight-batching and beam search.