Poor performance after FP8 Quantization for Llama 3.1 on PyTorch backend

### System Info

- CPU architecture: x86_64
- CPU memory size: ~1TB
- GPU properties:
	- GPU name: NVIDIA H100
	- GPU memory size: 80GB
- Libraries:
	- Using docker image `nvcr.io/nvidia/tritonserver:25.04-pyt-python-py3` with the following libraries installed afterwards:
		- `tensorrt-llm==0.20.0rc3`
		- `torch==2.7.0`
		- `transformers==4.51.3`
- NVIDIA driver version: 535.161.08
- CUDA version: 12.9
- OS: 24.04.2 LTS

### Who can help?

@Tracin @kaiyux

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

I am currently trying to estimate potential throughput increase with implementing EAGLE-3 and noticed that the quantized model actually performs worse that non-quantized if speculative decoding is not used, and on the other side quantized model performs better with EAGLE-3 enabled.

Steps to reproduce the behaviour:

1. Obtain Llama 3.1 8B Instruct and the FP8 its quantized version from corresponding repos (meta-llama/Llama-3.1-8B-Instruct and nvidia/Llama-3.1-8B-Instruct-FP8). Also obtain pretrained EAGLE-3 weights for Llama 3.1 from https://huggingface.co/lmsys/sglang-EAGLE3-LLaMA3.1-Instruct-8B
2. Obtain the dataset targeted to generation speed measurement from https://pastebin.com/42mig2um
	2.1. Input tokens are built using the `model.tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True)` command
3. Launch a concurrent benchmark with 16 & 128 connections for approximately 10 minutes for each scenario

My result for 16 connections
```
Llama-3.1-8B-Instruct
====================================
First Token Latency P90 =    29ms
 First Token Latency P95 =    29ms
 First Token Latency P99 =    30ms
 Total Response Latency AVG = 10.34s
 Requests per second =        1.53 
Output tokens per second =   1149.48

Llama-3.1-8B-Instruct-FP8
====================================
First Token Latency P90 =    32ms 
First Token Latency P95 =    32ms
 First Token Latency P99 =    37ms
 Total Response Latency AVG = 11.25s
 Requests per second =        1.41
Output tokens per second =   1056.64

Llama-3.1-8B-Instruct w EAGLE3
====================================
First Token Latency P90 =    40ms 
First Token Latency P95 =    41ms
 First Token Latency P99 =    46ms
 Total Response Latency AVG = 4.81s
 Requests per second =        3.32
Output tokens per second =   2538.31

Llama-3.1-8B-Instruct-FP8 w EAGLE3
====================================
First Token Latency P90 =    44ms 
First Token Latency P95 =    45ms 
First Token Latency P99 =    52ms
 Total Response Latency AVG = 5.15s
 Requests per second =        3.11
 Output tokens per second =   2303.76
```

My results for 128 connections
```
Llama-3.1-8B-Instruct
====================================
First Token Latency P90 =    52ms
 First Token Latency P95 =    65ms
 First Token Latency P99 =    144ms
 Total Response Latency AVG = 12.49s
 Requests per second =        10.01 
Output tokens per second =   7628.94

Llama-3.1-8B-Instruct-FP8
====================================
First Token Latency P90 =    56ms 
First Token Latency P95 =    59ms
 First Token Latency P99 =    139ms
 Total Response Latency AVG = 13.18s
 Requests per second =        9.49 
Output tokens per second =   6915.62

Llama-3.1-8B-Instruct w EAGLE3
====================================
First Token Latency P90 =    98ms 
First Token Latency P95 =    99ms
 First Token Latency P99 =    130ms
 Total Response Latency AVG = 11.53s
 Requests per second =        11.06 
Output tokens per second =   8493.62

Llama-3.1-8B-Instruct-FP8 w EAGLE3
====================================
First Token Latency P90 =    87ms 
First Token Latency P95 =    90ms 
First Token Latency P99 =    127ms
 Total Response Latency AVG = 10.03s
 Requests per second =        12.67
 Output tokens per second =   9324.91
```

Could you please clarify any possible reasons for such behaviour? Can this be considered a bug with inference of quantized models on PyTorch backend?

Moreover, do you have any ideas for what reason my GPU seems to be underutilized when I use the PyTorch backend? For every setup from the list above my GPU utilization did not go beyond 70%, but when the Triton Backend (https://github.com/triton-inference-server/backend) is used, the GPU is utilized fully.

### Expected behavior

Full GPU utilization when using PyTorch backend, and FP8 model performing better than FP16 model (with and without EAGLE, correspondingly)

### actual behavior

GPU underutilized (<70%), and FP8 model performs worse than FP16 in all scenarios except the scenario with the largest load (128 concurrent connections + EAGLE3)

### additional notes

-

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Poor performance after FP8 Quantization for Llama 3.1 on PyTorch backend #5370

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Poor performance after FP8 Quantization for Llama 3.1 on PyTorch backend #5370

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions