Abnormal Performance Scaling of W4AFP8 vs FP8 on H20-141G with Deepseek-R1 Models

**Environment**

- Hardware: NVIDIA H20-141G
- Models: [Deepseek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) vs [Deepseek-r1-W4AFP8](https://huggingface.co/Barrrrry/DeepSeek-R1-W4AFP8)
- TensorRT-LLM: 0.21.0.rc0 and latest(ddfe4fceb3e9bc7ed5ec47635e98441cc0b6885f)

**Performance Anomaly (Decoding Phase with Short Inputs)**
Intuitively, in memory-bound scenarios (small batch sizes), W4AFP8 should outperform FP8 due to reduced memory bandwidth requirements. However, our tests show:

- Small batches (≤32): W4AFP8 has higher latency than FP8 
- Large batches (>32): W4AFP8 shows significantly better scaling than FP8 

| Batch Size | W4AFP8 ITL (ms) | FP8 ITL (ms) | 
|------------|-------------------|------------------|
| 1          | 20                | 19               | 
| 8          | 25                | 22               | 
| 16         | 29                | 26               | 
| 32         | 34                | 33               | 
| 64         | 38                | 53               | 
| 128        | 44                | 95               | 
| 256        | 55                | 105              | 

This contradicts fundamental expectation for weight-quantized models in memory-bound regimes.

**Additional Context**
Pure decoding performance with average input sequence length ≤100 tokens (attention ops have minimal impact, observed bottleneck is GEMM)

@kaiyux Could you please help investigate this performance anomaly? The reversed scaling behavior suggests potential optimization opportunities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Abnormal Performance Scaling of W4AFP8 vs FP8 on H20-141G with Deepseek-R1 Models #5127

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Abnormal Performance Scaling of W4AFP8 vs FP8 on H20-141G with Deepseek-R1 Models #5127

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions