Skip to content

Why tensorrt_llm_bls backend doesn't support speculative decoding streaming or bsz > 1? #676

Open
@meowcoder22

Description

@meowcoder22

mpirun -n 1 --allow-run-as-root python3 /app/TensorRT-LLM/examples/run.py \ --tokenizer_dir ./llama33_70b \ --draft_engine_dir ./draft-engine \ --engine_dir /app/all_models/inflight_batcher_llm/tensorrt_llm/1 \ --draft_target_model_config "[10,[0],[0], False]" \ --kv_cache_free_gpu_memory_fraction=0.35 \ --run_profiling \ --max_output_len=1024 \ --kv_cache_enable_block_reuse \ --input_text="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\nA 3-digit integer contains one of each of the digits 1,3 and 5. What is the probability that the integer is divisible by 5.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n" \ --streaming

This example is working fine. With streaming and non streaming it works fine. However, when I try to implement the example here

https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/speculative-decoding.md#Draft-Target-Model

  1. batch size > 1 does not work, even though "With the fast logits enabled and following optimization tips in model configuration, speculative decoding with draft logits achieves 2.x throughput in BS1, 1.x throughput in BS16 comparing to auto-regressive decoding using Llama 3.2 1B draft and Llama 3.1 70B target." this claims that batch size 16 is possible, how??

  2. streaming does not working, gives an error that streaming is not supported with speculative decoding

the main culprit is the tensorrt_llm_bls module.

@byshiue @Shixiaowei02 @kaiyux @rmccorm4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions