Why tensorrt_llm_bls backend doesn't support speculative decoding streaming or bsz > 1?

`mpirun -n 1 --allow-run-as-root python3 /app/TensorRT-LLM/examples/run.py \
    --tokenizer_dir ./llama33_70b \
    --draft_engine_dir ./draft-engine \
    --engine_dir /app/all_models/inflight_batcher_llm/tensorrt_llm/1 \
    --draft_target_model_config "[10,[0],[0], False]" \
    --kv_cache_free_gpu_memory_fraction=0.35 \
    --run_profiling \
    --max_output_len=1024 \
    --kv_cache_enable_block_reuse \
    --input_text="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\nA 
    3-digit integer contains one of each of the digits 1,3 and 5. What is the 
    probability that the integer is divisible by 
    5.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n" \
    --streaming
`

This example is working fine. With streaming and non streaming it works fine. However, when I try to implement the example here

https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/speculative-decoding.md#Draft-Target-Model

1) batch size > 1 does not work, even though "With the fast logits enabled and following optimization tips in [model configuration](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/model_config.md#some-tips-for-model-configuration), speculative decoding with draft logits achieves 2.x throughput in BS1, 1.x throughput in BS16 comparing to auto-regressive decoding using Llama 3.2 1B draft and Llama 3.1 70B target." this claims that batch size 16 is possible, how??

2) streaming does not working, gives an error that streaming is not supported with speculative decoding

the main culprit is the tensorrt_llm_bls module.

@byshiue @Shixiaowei02 @kaiyux @rmccorm4 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why tensorrt_llm_bls backend doesn't support speculative decoding streaming or bsz > 1? #676

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why tensorrt_llm_bls backend doesn't support speculative decoding streaming or bsz > 1? #676

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions