You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
mpirun -n 1 --allow-run-as-root python3 /app/TensorRT-LLM/examples/run.py \ --tokenizer_dir ./llama33_70b \ --draft_engine_dir ./draft-engine \ --engine_dir /app/all_models/inflight_batcher_llm/tensorrt_llm/1 \ --draft_target_model_config "[10,[0],[0], False]" \ --kv_cache_free_gpu_memory_fraction=0.35 \ --run_profiling \ --max_output_len=1024 \ --kv_cache_enable_block_reuse \ --input_text="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\nA 3-digit integer contains one of each of the digits 1,3 and 5. What is the probability that the integer is divisible by 5.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n" \ --streaming
This example is working fine. With streaming and non streaming it works fine. However, when I try to implement the example here
batch size > 1 does not work, even though "With the fast logits enabled and following optimization tips in model configuration, speculative decoding with draft logits achieves 2.x throughput in BS1, 1.x throughput in BS16 comparing to auto-regressive decoding using Llama 3.2 1B draft and Llama 3.1 70B target." this claims that batch size 16 is possible, how??
streaming does not working, gives an error that streaming is not supported with speculative decoding
mpirun -n 1 --allow-run-as-root python3 /app/TensorRT-LLM/examples/run.py \ --tokenizer_dir ./llama33_70b \ --draft_engine_dir ./draft-engine \ --engine_dir /app/all_models/inflight_batcher_llm/tensorrt_llm/1 \ --draft_target_model_config "[10,[0],[0], False]" \ --kv_cache_free_gpu_memory_fraction=0.35 \ --run_profiling \ --max_output_len=1024 \ --kv_cache_enable_block_reuse \ --input_text="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\nA 3-digit integer contains one of each of the digits 1,3 and 5. What is the probability that the integer is divisible by 5.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n" \ --streaming
This example is working fine. With streaming and non streaming it works fine. However, when I try to implement the example here
https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/speculative-decoding.md#Draft-Target-Model
batch size > 1 does not work, even though "With the fast logits enabled and following optimization tips in model configuration, speculative decoding with draft logits achieves 2.x throughput in BS1, 1.x throughput in BS16 comparing to auto-regressive decoding using Llama 3.2 1B draft and Llama 3.1 70B target." this claims that batch size 16 is possible, how??
streaming does not working, gives an error that streaming is not supported with speculative decoding
the main culprit is the tensorrt_llm_bls module.
@byshiue @Shixiaowei02 @kaiyux @rmccorm4
The text was updated successfully, but these errors were encountered: