-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fp8 support #54
base: main
Are you sure you want to change the base?
fp8 support #54
Conversation
9c088fe
to
6199b0b
Compare
awesome, did you mind adding a compile flag to save the time when FP8 is not needed? Thanks |
Of course. Already Done |
Great work! However, I can’t merge this PR at the moment because, based on our tests, per-sequence kvcache scaling significantly reduces accuracy for MLA. |
What about the granularity of PerPageBlock? I can easily adapt it |
We think PerPageBlock is neither enough. kv_rope (64) needs to be bf16. |
Got it! |
Functionality
Support FP8 WGMMA based on the async pipeline design of FlashMLA. The TransV part draws on the implementation of SmemTranspose64x64 in Fa3.
Currently, Q/K/V only support symmetric PerTensor quantization. Since the maximum value of P does not exceed 1, the f32tofp8_cast is directly used for quantization.
Performance
On the H20, MLA typically demonstrate a high degree of arithmetic intensity. Consequently, the Memory Floating - point Utilization (MFU) is employed as a performance metric.

On the H800, MLA typically encounter memory-bound situations. Consequently, the Memory Bandwidth Utilization (MBU) metric is adopted to evaluate the performance of the kernel. There is still a lot of room for optimization on the H800. Look forward to working together.

Reproduction