Skip to content

Poor performance after FP8 Quantization for Llama 3.1 on PyTorch backend #5370

Open
@geaned

Description

@geaned

System Info

  • CPU architecture: x86_64
  • CPU memory size: ~1TB
  • GPU properties:
    • GPU name: NVIDIA H100
    • GPU memory size: 80GB
  • Libraries:
    • Using docker image nvcr.io/nvidia/tritonserver:25.04-pyt-python-py3 with the following libraries installed afterwards:
      • tensorrt-llm==0.20.0rc3
      • torch==2.7.0
      • transformers==4.51.3
  • NVIDIA driver version: 535.161.08
  • CUDA version: 12.9
  • OS: 24.04.2 LTS

Who can help?

@Tracin @kaiyux

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am currently trying to estimate potential throughput increase with implementing EAGLE-3 and noticed that the quantized model actually performs worse that non-quantized if speculative decoding is not used, and on the other side quantized model performs better with EAGLE-3 enabled.

Steps to reproduce the behaviour:

  1. Obtain Llama 3.1 8B Instruct and the FP8 its quantized version from corresponding repos (meta-llama/Llama-3.1-8B-Instruct and nvidia/Llama-3.1-8B-Instruct-FP8). Also obtain pretrained EAGLE-3 weights for Llama 3.1 from https://huggingface.co/lmsys/sglang-EAGLE3-LLaMA3.1-Instruct-8B
  2. Obtain the dataset targeted to generation speed measurement from https://pastebin.com/42mig2um
    2.1. Input tokens are built using the model.tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True) command
  3. Launch a concurrent benchmark with 16 & 128 connections for approximately 10 minutes for each scenario

My result for 16 connections

Llama-3.1-8B-Instruct
====================================
First Token Latency P90 =    29ms

First Token Latency P95 =    29ms

First Token Latency P99 =    30ms

Total Response Latency AVG = 10.34s

Requests per second =        1.53

Output tokens per second =   1149.48

Llama-3.1-8B-Instruct-FP8
====================================
First Token Latency P90 =    32ms

First Token Latency P95 =    32ms

First Token Latency P99 =    37ms

Total Response Latency AVG = 11.25s

Requests per second =        1.41
Output tokens per second =   1056.64

Llama-3.1-8B-Instruct w EAGLE3
====================================
First Token Latency P90 =    40ms

First Token Latency P95 =    41ms

First Token Latency P99 =    46ms

Total Response Latency AVG = 4.81s

Requests per second =        3.32
Output tokens per second =   2538.31

Llama-3.1-8B-Instruct-FP8 w EAGLE3
====================================
First Token Latency P90 =    44ms

First Token Latency P95 =    45ms

First Token Latency P99 =    52ms

Total Response Latency AVG = 5.15s

Requests per second =        3.11

Output tokens per second =   2303.76

My results for 128 connections

Llama-3.1-8B-Instruct
====================================
First Token Latency P90 =    52ms

First Token Latency P95 =    65ms

First Token Latency P99 =    144ms

Total Response Latency AVG = 12.49s

Requests per second =        10.01

Output tokens per second =   7628.94

Llama-3.1-8B-Instruct-FP8
====================================
First Token Latency P90 =    56ms

First Token Latency P95 =    59ms

First Token Latency P99 =    139ms

Total Response Latency AVG = 13.18s

Requests per second =        9.49

Output tokens per second =   6915.62

Llama-3.1-8B-Instruct w EAGLE3
====================================
First Token Latency P90 =    98ms

First Token Latency P95 =    99ms

First Token Latency P99 =    130ms

Total Response Latency AVG = 11.53s

Requests per second =        11.06

Output tokens per second =   8493.62

Llama-3.1-8B-Instruct-FP8 w EAGLE3
====================================
First Token Latency P90 =    87ms

First Token Latency P95 =    90ms

First Token Latency P99 =    127ms

Total Response Latency AVG = 10.03s

Requests per second =        12.67

Output tokens per second =   9324.91

Could you please clarify any possible reasons for such behaviour? Can this be considered a bug with inference of quantized models on PyTorch backend?

Moreover, do you have any ideas for what reason my GPU seems to be underutilized when I use the PyTorch backend? For every setup from the list above my GPU utilization did not go beyond 70%, but when the Triton Backend (https://github.com/triton-inference-server/backend) is used, the GPU is utilized fully.

Expected behavior

Full GPU utilization when using PyTorch backend, and FP8 model performing better than FP16 model (with and without EAGLE, correspondingly)

actual behavior

GPU underutilized (<70%), and FP8 model performs worse than FP16 in all scenarios except the scenario with the largest load (128 concurrent connections + EAGLE3)

additional notes

Metadata

Metadata

Assignees

Labels

InvestigatingPerformanceTRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.bugSomething isn't workingtriagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions