Beam search diversity lost with in-flight batching

### System Info

* GPU: A10G (g5.48xlarge)
* Container: [nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3](http://nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3)
* version of tensorrtllm_backend: v0.16.0
* Model: mistral7b 

### Who can help?

_No response_

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

1. convert mistral7b checkpoint 
```
python3 convert_checkpoint.py <model_directory> \
                             --output_dir <checkpoint_directory> \
                             --dtype float16 \
                             --tp_size 8
```
2. build the engines 
```
trtllm-build --checkpoint_dir <checkpoint_directory> \
             --output_dir <engine_dir> \
             --gpt_attention_plugin float16 \
             --gemm_plugin float16 \
             --max_num_tokens 4096 \
             --context_fmha enable \
             --kv_cache_type paged \
             --max_batch_size 32 \
             --max_beam_width 10 \
             --max_seq_len 128 \
             --workers 8
```
3.create the model repo
```
ENGINE_DIR=<engine_dir>
TOKENIZER_DIR=<mistral7b_tokenizer_dir>
MODEL_FOLDER=<model_repo_dir>
TRITON_MAX_BATCH_SIZE=8
INSTANCE_COUNT=8
MAX_QUEUE_DELAY_MS=0
MAX_QUEUE_SIZE=0
FILL_TEMPLATE_SCRIPT=fill_template.py
DECOUPLED_MODE=false
MAX_BEAM_WIDTH=10
EXCLUDE_INPUT_IN_OUTPUT=true
BATCHING_STRATEGY=inflight_fused_batching

python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:${BATCHING_STRATEGY},exclude_input_in_output:${EXCLUDE_INPUT_IN_OUTPUT},max_queue_size:${MAX_QUEUE_SIZE},encoder_input_features_data_type:TYPE_FP16,max_beam_width:${MAX_BEAM_WIDTH}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}

```

4. launch triton server 
```
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=8 --model_repo=<model_repo_dir> --log --log-file=/tmp/triton_debug.log 
```

### Expected behavior

I am doing a prefix completing task. Given a prefix, my model is finetuned to complete the prefix with 10 suggestions. The suggestions should be diverse. Example

request:
```
 curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{
  "text_input": "### Instruction: Provide suggestion starting with prefix ### Prefix: ipho ### Suggestion:",
  "max_tokens": 20,
  "beam_width": 10,
  "pad_id": 2,
  "end_id": 774}'
```

response:
```
Response:
{
  "model_name": "tensorrt_llm_bls",
  "model_version": "1",
  "text_output": [
    "iphone charger",
    "iphone 15 plus case",
    "iphone 13 case",
    "iphone 15 case",
    "iphone 14 case",
    "iphone charger fast charging",
    "iphone 15 pro case",
    "iphone 14 plus case",
    "iphone cable",
    "iphone 13 pro case"
  ]
```

### actual behavior

The first request work as expected. Starting from second requests, I am getting identical suggestions, seems no diversity in beam search, for example:

```
{
  "model_name": "tensorrt_llm_bls",
  "model_version": "1",
  "text_output": [
    "iphone 15 pro max case",
    "iphone 15 pro max case",
    "iphone 15 pro max case",
    "iphone 15 pro max case",
    "iphone 15 pro max case",
    "iphone 15 pro max case",
    "iphone 15 pro max case",
    "iphone 15 pro max case",
    "iphone 15 pro max case",
    "iphone 15 pro max case"
  ]
}

```

Furthur testing shows a pattern: for an engine built with max_batch_size=N, only requests numbered (N*k + 1) return diverse beam search results. All other requests return identical suggestions.

### additional notes

If switching back to V1 batching, everything works fine. The issue appears to be related to the interaction between in-flight-batching and beam search. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Beam search diversity lost with in-flight batching #682

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Beam search diversity lost with in-flight batching #682

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions