Description
mpirun -n 1 --allow-run-as-root python3 /app/TensorRT-LLM/examples/run.py \ --tokenizer_dir ./llama33_70b \ --draft_engine_dir ./draft-engine \ --engine_dir /app/all_models/inflight_batcher_llm/tensorrt_llm/1 \ --draft_target_model_config "[10,[0],[0], False]" \ --kv_cache_free_gpu_memory_fraction=0.35 \ --run_profiling \ --max_output_len=1024 \ --kv_cache_enable_block_reuse \ --input_text="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\nA 3-digit integer contains one of each of the digits 1,3 and 5. What is the probability that the integer is divisible by 5.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n" \ --streaming
This example is working fine. With streaming and non streaming it works fine. However, when I try to implement the example here
-
batch size > 1 does not work, even though "With the fast logits enabled and following optimization tips in model configuration, speculative decoding with draft logits achieves 2.x throughput in BS1, 1.x throughput in BS16 comparing to auto-regressive decoding using Llama 3.2 1B draft and Llama 3.1 70B target." this claims that batch size 16 is possible, how??
-
streaming does not working, gives an error that streaming is not supported with speculative decoding
the main culprit is the tensorrt_llm_bls module.