fp8 support #54

endurehero · 2025-02-28T14:37:17Z

Functionality

Support FP8 WGMMA based on the async pipeline design of FlashMLA. The TransV part draws on the implementation of SmemTranspose64x64 in Fa3.
Currently, Q/K/V only support symmetric PerTensor quantization. Since the maximum value of P does not exceed 1, the f32tofp8_cast is directly used for quantization.

Performance

cuda driver version: 535.183.06
nvcc version: 12.8
torch version: 2.6

On the H20, MLA typically demonstrate a high degree of arithmetic intensity. Consequently, the Memory Floating - point Utilization (MFU) is employed as a performance metric.

On the H800, MLA typically encounter memory-bound situations. Consequently, the Memory Bandwidth Utilization (MBU) metric is adopted to evaluate the performance of the kernel. There is still a lot of room for optimization on the H800. Look forward to working together.

Reproduction

python3 ./tests/test_flash_mla.py --dtype e4m3

csrc/fp8_transpose_v.h

sijiac · 2025-03-01T04:37:03Z

awesome, did you mind adding a compile flag to save the time when FP8 is not needed? Thanks

endurehero · 2025-03-01T07:08:15Z

awesome, did you mind adding a compile flag to save the time when FP8 is not needed? Thanks

Of course. Already Done

beginlner · 2025-03-01T10:14:09Z

Great work! However, I can’t merge this PR at the moment because, based on our tests, per-sequence kvcache scaling significantly reduces accuracy for MLA.

endurehero · 2025-03-01T10:29:50Z

Great work! However, I can’t merge this PR at the moment because, based on our tests, per-sequence kvcache scaling significantly reduces accuracy for MLA.

What about the granularity of PerPageBlock? I can easily adapt it

beginlner · 2025-03-01T10:34:47Z

What about the granularity of PerPageBlock? I can easily adapt it

We think PerPageBlock is neither enough. kv_rope (64) needs to be bf16.

endurehero · 2025-03-01T11:07:27Z

What about the granularity of PerPageBlock? I can easily adapt it

We think PerPageBlock is neither enough. kv_rope (64) needs to be bf16.

Got it!

chenhongmin.will added 28 commits February 24, 2025 21:12

init fp8

dae0690

enable fp8

d833dbd

update gmem

b67a18f

fp8 shared mem

fed0499

enable fp8 compile

7409203

fix compile

c50d29d

enable fp8 api

dfe8ffc

add fp8 ut

8704188

update ut

ef644a5

update fp8 api

4b314cd

change to use per_tensor

f6fab1b

debug mode

29de9e0

fix Vt illegal

59f6917

add transv barrier

6a4eb63

add TransV

6dcea49

fix sV

dbd8c30

try fix

1757a6d

use mm1's Aregs instead of mma0's Cregs

d1689ab

use 64x64 transpose_v

855c985

fix compile

1df91af

reorg

0337732

use fa'3 transv

061af5f

fix mma0

fd1e662

fix combine

bfe38ab

reorg ut

4e055a6

enable scale

8b93985

Merge branch 'main' into will_fp8_mr

c7143a7

update readme

9887a55

endurehero closed this Feb 28, 2025

endurehero changed the title ~~support fp8~~ fp8 support Feb 28, 2025

update ut

9028983

endurehero reopened this Feb 28, 2025

endurehero mentioned this pull request Feb 28, 2025

FP8 Support #56

Open

tridao reviewed Feb 28, 2025

View reviewed changes

csrc/fp8_transpose_v.h Show resolved Hide resolved

update desc

6199b0b

endurehero force-pushed the will_fp8_mr branch from 9c088fe to 6199b0b Compare February 28, 2025 23:54

add env

7fafcd2

endurehero force-pushed the will_fp8_mr branch from 1eddaa3 to 7fafcd2 Compare March 1, 2025 07:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp8 support #54

fp8 support #54

endurehero commented Feb 28, 2025 •

edited

Loading

sijiac commented Mar 1, 2025

endurehero commented Mar 1, 2025 •

edited

Loading

beginlner commented Mar 1, 2025 •

edited

Loading

endurehero commented Mar 1, 2025

beginlner commented Mar 1, 2025 •

edited

Loading

endurehero commented Mar 1, 2025

fp8 support #54

Are you sure you want to change the base?

fp8 support #54

Conversation

endurehero commented Feb 28, 2025 • edited Loading

Functionality

Performance

Reproduction

sijiac commented Mar 1, 2025

endurehero commented Mar 1, 2025 • edited Loading

beginlner commented Mar 1, 2025 • edited Loading

endurehero commented Mar 1, 2025

beginlner commented Mar 1, 2025 • edited Loading

endurehero commented Mar 1, 2025

endurehero commented Feb 28, 2025 •

edited

Loading

endurehero commented Mar 1, 2025 •

edited

Loading

beginlner commented Mar 1, 2025 •

edited

Loading

beginlner commented Mar 1, 2025 •

edited

Loading