Skip to content

Beam search diversity lost with in-flight batching #682

Open
@Grace-YingHuang

Description

@Grace-YingHuang

System Info

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. convert mistral7b checkpoint
python3 convert_checkpoint.py <model_directory> \
                             --output_dir <checkpoint_directory> \
                             --dtype float16 \
                             --tp_size 8
  1. build the engines
trtllm-build --checkpoint_dir <checkpoint_directory> \
             --output_dir <engine_dir> \
             --gpt_attention_plugin float16 \
             --gemm_plugin float16 \
             --max_num_tokens 4096 \
             --context_fmha enable \
             --kv_cache_type paged \
             --max_batch_size 32 \
             --max_beam_width 10 \
             --max_seq_len 128 \
             --workers 8

3.create the model repo

ENGINE_DIR=<engine_dir>
TOKENIZER_DIR=<mistral7b_tokenizer_dir>
MODEL_FOLDER=<model_repo_dir>
TRITON_MAX_BATCH_SIZE=8
INSTANCE_COUNT=8
MAX_QUEUE_DELAY_MS=0
MAX_QUEUE_SIZE=0
FILL_TEMPLATE_SCRIPT=fill_template.py
DECOUPLED_MODE=false
MAX_BEAM_WIDTH=10
EXCLUDE_INPUT_IN_OUTPUT=true
BATCHING_STRATEGY=inflight_fused_batching

python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:${BATCHING_STRATEGY},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},max_queue_size:${MAX_QUEUE_SIZE},encoder_input_features_data_type:TYPE_FP16,max_beam_width:${MAX_BEAM_WIDTH}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}

  1. launch triton server
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=8 --model_repo=<model_repo_dir> --log --log-file=/tmp/triton_debug.log 

Expected behavior

I am doing a prefix completing task. Given a prefix, my model is finetuned to complete the prefix with 10 suggestions. The suggestions should be diverse. Example

request:

 curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{
  "text_input": "### Instruction: Provide suggestion starting with prefix ### Prefix: ipho ### Suggestion:",
  "max_tokens": 20,
  "beam_width": 10,
  "pad_id": 2,
  "end_id": 774}'

response:

Response:
{
  "model_name": "tensorrt_llm_bls",
  "model_version": "1",
  "text_output": [
    "iphone charger",
    "iphone 15 plus case",
    "iphone 13 case",
    "iphone 15 case",
    "iphone 14 case",
    "iphone charger fast charging",
    "iphone 15 pro case",
    "iphone 14 plus case",
    "iphone cable",
    "iphone 13 pro case"
  ]

actual behavior

The first request work as expected. Starting from second requests, I am getting identical suggestions, seems no diversity in beam search, for example:

{
  "model_name": "tensorrt_llm_bls",
  "model_version": "1",
  "text_output": [
    "iphone 15 pro max case",
    "iphone 15 pro max case",
    "iphone 15 pro max case",
    "iphone 15 pro max case",
    "iphone 15 pro max case",
    "iphone 15 pro max case",
    "iphone 15 pro max case",
    "iphone 15 pro max case",
    "iphone 15 pro max case",
    "iphone 15 pro max case"
  ]
}

Furthur testing shows a pattern: for an engine built with max_batch_size=N, only requests numbered (N*k + 1) return diverse beam search results. All other requests return identical suggestions.

additional notes

If switching back to V1 batching, everything works fine. The issue appears to be related to the interaction between in-flight-batching and beam search.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions