Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why tensorrt_llm_bls backend doesn't support speculative decoding streaming or bsz > 1? #676

Open
meowcoder22 opened this issue Jan 9, 2025 · 0 comments

Comments

@meowcoder22
Copy link

mpirun -n 1 --allow-run-as-root python3 /app/TensorRT-LLM/examples/run.py \ --tokenizer_dir ./llama33_70b \ --draft_engine_dir ./draft-engine \ --engine_dir /app/all_models/inflight_batcher_llm/tensorrt_llm/1 \ --draft_target_model_config "[10,[0],[0], False]" \ --kv_cache_free_gpu_memory_fraction=0.35 \ --run_profiling \ --max_output_len=1024 \ --kv_cache_enable_block_reuse \ --input_text="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\nA 3-digit integer contains one of each of the digits 1,3 and 5. What is the probability that the integer is divisible by 5.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n" \ --streaming

This example is working fine. With streaming and non streaming it works fine. However, when I try to implement the example here

https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/speculative-decoding.md#Draft-Target-Model

  1. batch size > 1 does not work, even though "With the fast logits enabled and following optimization tips in model configuration, speculative decoding with draft logits achieves 2.x throughput in BS1, 1.x throughput in BS16 comparing to auto-regressive decoding using Llama 3.2 1B draft and Llama 3.1 70B target." this claims that batch size 16 is possible, how??

  2. streaming does not working, gives an error that streaming is not supported with speculative decoding

the main culprit is the tensorrt_llm_bls module.

@byshiue @Shixiaowei02 @kaiyux @rmccorm4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant