Skip to content

Abnormal Performance Scaling of W4AFP8 vs FP8 on H20-141G with Deepseek-R1 Models #5127

@Nekofish-L

Description

@Nekofish-L

Environment

Performance Anomaly (Decoding Phase with Short Inputs)
Intuitively, in memory-bound scenarios (small batch sizes), W4AFP8 should outperform FP8 due to reduced memory bandwidth requirements. However, our tests show:

  • Small batches (≤32): W4AFP8 has higher latency than FP8
  • Large batches (>32): W4AFP8 shows significantly better scaling than FP8
Batch Size W4AFP8 ITL (ms) FP8 ITL (ms)
1 20 19
8 25 22
16 29 26
32 34 33
64 38 53
128 44 95
256 55 105

This contradicts fundamental expectation for weight-quantized models in memory-bound regimes.

Additional Context
Pure decoding performance with average input sequence length ≤100 tokens (attention ops have minimal impact, observed bottleneck is GEMM)

@kaiyux Could you please help investigate this performance anomaly? The reversed scaling behavior suggests potential optimization opportunities.

Metadata

Metadata

Assignees

Labels

InvestigatingPerformanceTRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.triagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions