Description
System Info
- CPU architecture: x86_64
- CPU memory size: ~1TB
- GPU properties:
- GPU name: NVIDIA H100
- GPU memory size: 80GB
- Libraries:
- Using docker image
nvcr.io/nvidia/tritonserver:25.04-pyt-python-py3
with the following libraries installed afterwards:tensorrt-llm==0.20.0rc3
torch==2.7.0
transformers==4.51.3
- Using docker image
- NVIDIA driver version: 535.161.08
- CUDA version: 12.9
- OS: 24.04.2 LTS
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I am currently trying to estimate potential throughput increase with implementing EAGLE-3 and noticed that the quantized model actually performs worse that non-quantized if speculative decoding is not used, and on the other side quantized model performs better with EAGLE-3 enabled.
Steps to reproduce the behaviour:
- Obtain Llama 3.1 8B Instruct and the FP8 its quantized version from corresponding repos (meta-llama/Llama-3.1-8B-Instruct and nvidia/Llama-3.1-8B-Instruct-FP8). Also obtain pretrained EAGLE-3 weights for Llama 3.1 from https://huggingface.co/lmsys/sglang-EAGLE3-LLaMA3.1-Instruct-8B
- Obtain the dataset targeted to generation speed measurement from https://pastebin.com/42mig2um
2.1. Input tokens are built using themodel.tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True)
command - Launch a concurrent benchmark with 16 & 128 connections for approximately 10 minutes for each scenario
My result for 16 connections
Llama-3.1-8B-Instruct
====================================
First Token Latency P90 = 29ms
First Token Latency P95 = 29ms
First Token Latency P99 = 30ms
Total Response Latency AVG = 10.34s
Requests per second = 1.53
Output tokens per second = 1149.48
Llama-3.1-8B-Instruct-FP8
====================================
First Token Latency P90 = 32ms
First Token Latency P95 = 32ms
First Token Latency P99 = 37ms
Total Response Latency AVG = 11.25s
Requests per second = 1.41
Output tokens per second = 1056.64
Llama-3.1-8B-Instruct w EAGLE3
====================================
First Token Latency P90 = 40ms
First Token Latency P95 = 41ms
First Token Latency P99 = 46ms
Total Response Latency AVG = 4.81s
Requests per second = 3.32
Output tokens per second = 2538.31
Llama-3.1-8B-Instruct-FP8 w EAGLE3
====================================
First Token Latency P90 = 44ms
First Token Latency P95 = 45ms
First Token Latency P99 = 52ms
Total Response Latency AVG = 5.15s
Requests per second = 3.11
Output tokens per second = 2303.76
My results for 128 connections
Llama-3.1-8B-Instruct
====================================
First Token Latency P90 = 52ms
First Token Latency P95 = 65ms
First Token Latency P99 = 144ms
Total Response Latency AVG = 12.49s
Requests per second = 10.01
Output tokens per second = 7628.94
Llama-3.1-8B-Instruct-FP8
====================================
First Token Latency P90 = 56ms
First Token Latency P95 = 59ms
First Token Latency P99 = 139ms
Total Response Latency AVG = 13.18s
Requests per second = 9.49
Output tokens per second = 6915.62
Llama-3.1-8B-Instruct w EAGLE3
====================================
First Token Latency P90 = 98ms
First Token Latency P95 = 99ms
First Token Latency P99 = 130ms
Total Response Latency AVG = 11.53s
Requests per second = 11.06
Output tokens per second = 8493.62
Llama-3.1-8B-Instruct-FP8 w EAGLE3
====================================
First Token Latency P90 = 87ms
First Token Latency P95 = 90ms
First Token Latency P99 = 127ms
Total Response Latency AVG = 10.03s
Requests per second = 12.67
Output tokens per second = 9324.91
Could you please clarify any possible reasons for such behaviour? Can this be considered a bug with inference of quantized models on PyTorch backend?
Moreover, do you have any ideas for what reason my GPU seems to be underutilized when I use the PyTorch backend? For every setup from the list above my GPU utilization did not go beyond 70%, but when the Triton Backend (https://github.com/triton-inference-server/backend) is used, the GPU is utilized fully.
Expected behavior
Full GPU utilization when using PyTorch backend, and FP8 model performing better than FP16 model (with and without EAGLE, correspondingly)
actual behavior
GPU underutilized (<70%), and FP8 model performs worse than FP16 in all scenarios except the scenario with the largest load (128 concurrent connections + EAGLE3)