-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Open
Labels
InvestigatingPerformanceTRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.TRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.triagedIssue has been triaged by maintainersIssue has been triaged by maintainers
Description
Environment
- Hardware: NVIDIA H20-141G
- Models: Deepseek-R1 vs Deepseek-r1-W4AFP8
- TensorRT-LLM: 0.21.0.rc0 and latest(ddfe4fc)
Performance Anomaly (Decoding Phase with Short Inputs)
Intuitively, in memory-bound scenarios (small batch sizes), W4AFP8 should outperform FP8 due to reduced memory bandwidth requirements. However, our tests show:
- Small batches (≤32): W4AFP8 has higher latency than FP8
- Large batches (>32): W4AFP8 shows significantly better scaling than FP8
Batch Size | W4AFP8 ITL (ms) | FP8 ITL (ms) |
---|---|---|
1 | 20 | 19 |
8 | 25 | 22 |
16 | 29 | 26 |
32 | 34 | 33 |
64 | 38 | 53 |
128 | 44 | 95 |
256 | 55 | 105 |
This contradicts fundamental expectation for weight-quantized models in memory-bound regimes.
Additional Context
Pure decoding performance with average input sequence length ≤100 tokens (attention ops have minimal impact, observed bottleneck is GEMM)
@kaiyux Could you please help investigate this performance anomaly? The reversed scaling behavior suggests potential optimization opportunities.
Metadata
Metadata
Assignees
Labels
InvestigatingPerformanceTRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.TRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.triagedIssue has been triaged by maintainersIssue has been triaged by maintainers