Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

on H20, Could you help me improve the TFLOPS and GB/s? #30

Open
robotzheng opened this issue Feb 24, 2025 · 6 comments
Open

on H20, Could you help me improve the TFLOPS and GB/s? #30

robotzheng opened this issue Feb 24, 2025 · 6 comments

Comments

@robotzheng
Copy link

the paper is 580 TFLOPS, 3000 GB/s on H800

python tests/test_flash_mla.py
b=128, s_q=1, mean_sk=4096, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=False
0.601 ms, 30 TFLOPS, 1012 GB/s
b=128, s_q=1, mean_sk=4096, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=True
0.627 ms, 29 TFLOPS, 975 GB/s
b=128, s_q=2, mean_sk=4096, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=False
0.606 ms, 60 TFLOPS, 1012 GB/s
b=128, s_q=2, mean_sk=4096, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=True
0.629 ms, 60 TFLOPS, 1009 GB/s
b=128, s_q=3, mean_sk=4096, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=False
0.609 ms, 90 TFLOPS, 1013 GB/s
b=128, s_q=3, mean_sk=4096, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=True
0.644 ms, 89 TFLOPS, 1007 GB/s
b=128, s_q=4, mean_sk=4096, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=False
0.614 ms, 119 TFLOPS, 1012 GB/s
b=128, s_q=4, mean_sk=4096, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=True
0.617 ms, 118 TFLOPS, 1003 GB/s
b=128, s_q=5, mean_sk=4096, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.168 ms, 78 TFLOPS, 536 GB/s
b=128, s_q=5, mean_sk=4096, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.194 ms, 75 TFLOPS, 514 GB/s
b=128, s_q=6, mean_sk=4096, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.172 ms, 93 TFLOPS, 538 GB/s
b=128, s_q=6, mean_sk=4096, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.246 ms, 92 TFLOPS, 527 GB/s
b=128, s_q=7, mean_sk=4096, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.175 ms, 109 TFLOPS, 541 GB/s
b=128, s_q=7, mean_sk=4096, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.220 ms, 106 TFLOPS, 526 GB/s
b=128, s_q=8, mean_sk=4096, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.176 ms, 124 TFLOPS, 544 GB/s
b=128, s_q=8, mean_sk=4096, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.242 ms, 121 TFLOPS, 530 GB/s
b=128, s_q=1, mean_sk=4096, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=False
0.603 ms, 61 TFLOPS, 1017 GB/s
b=128, s_q=1, mean_sk=4096, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=True
0.630 ms, 58 TFLOPS, 973 GB/s
b=128, s_q=2, mean_sk=4096, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=False
0.613 ms, 119 TFLOPS, 1014 GB/s
b=128, s_q=2, mean_sk=4096, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=True
0.659 ms, 119 TFLOPS, 1008 GB/s
b=128, s_q=3, mean_sk=4096, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.171 ms, 94 TFLOPS, 539 GB/s
b=128, s_q=3, mean_sk=4096, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.161 ms, 90 TFLOPS, 519 GB/s
b=128, s_q=4, mean_sk=4096, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.177 ms, 124 TFLOPS, 543 GB/s
b=128, s_q=4, mean_sk=4096, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.157 ms, 120 TFLOPS, 529 GB/s
b=128, s_q=5, mean_sk=4096, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.753 ms, 104 TFLOPS, 370 GB/s
b=128, s_q=5, mean_sk=4096, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.722 ms, 102 TFLOPS, 362 GB/s
b=128, s_q=6, mean_sk=4096, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.755 ms, 125 TFLOPS, 375 GB/s
b=128, s_q=6, mean_sk=4096, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.846 ms, 122 TFLOPS, 366 GB/s
b=128, s_q=7, mean_sk=4096, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=False
2.370 ms, 108 TFLOPS, 281 GB/s
b=128, s_q=7, mean_sk=4096, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=True
2.600 ms, 105 TFLOPS, 272 GB/s
b=128, s_q=8, mean_sk=4096, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=False
2.376 ms, 123 TFLOPS, 284 GB/s
b=128, s_q=8, mean_sk=4096, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=True
2.375 ms, 121 TFLOPS, 279 GB/s
b=128, s_q=1, mean_sk=4096, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=False
0.611 ms, 119 TFLOPS, 1017 GB/s
b=128, s_q=1, mean_sk=4096, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=True
0.629 ms, 115 TFLOPS, 976 GB/s
b=128, s_q=2, mean_sk=4096, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.177 ms, 124 TFLOPS, 544 GB/s
b=128, s_q=2, mean_sk=4096, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.288 ms, 122 TFLOPS, 531 GB/s
b=128, s_q=3, mean_sk=4096, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.749 ms, 125 TFLOPS, 376 GB/s
b=128, s_q=3, mean_sk=4096, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.937 ms, 123 TFLOPS, 367 GB/s
b=128, s_q=4, mean_sk=4096, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=False
2.380 ms, 123 TFLOPS, 284 GB/s
b=128, s_q=4, mean_sk=4096, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=True
2.487 ms, 120 TFLOPS, 276 GB/s
b=128, s_q=5, mean_sk=4096, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=False
3.013 ms, 121 TFLOPS, 230 GB/s
b=128, s_q=5, mean_sk=4096, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=True
3.032 ms, 119 TFLOPS, 227 GB/s
b=128, s_q=6, mean_sk=4096, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=False
3.473 ms, 126 TFLOPS, 205 GB/s
b=128, s_q=6, mean_sk=4096, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=True
3.367 ms, 124 TFLOPS, 202 GB/s
b=128, s_q=7, mean_sk=4096, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=False
4.074 ms, 125 TFLOPS, 179 GB/s
b=128, s_q=7, mean_sk=4096, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=True
3.988 ms, 122 TFLOPS, 175 GB/s
b=128, s_q=8, mean_sk=4096, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=False
4.969 ms, 118 TFLOPS, 150 GB/s
b=128, s_q=8, mean_sk=4096, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=True
5.007 ms, 115 TFLOPS, 148 GB/s
b=128, s_q=1, mean_sk=4096, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.174 ms, 124 TFLOPS, 545 GB/s
b=128, s_q=1, mean_sk=4096, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.184 ms, 121 TFLOPS, 530 GB/s
b=128, s_q=2, mean_sk=4096, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=False
2.379 ms, 123 TFLOPS, 284 GB/s
b=128, s_q=2, mean_sk=4096, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=True
2.494 ms, 120 TFLOPS, 278 GB/s
b=128, s_q=3, mean_sk=4096, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=False
3.460 ms, 127 TFLOPS, 205 GB/s
b=128, s_q=3, mean_sk=4096, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=True
3.627 ms, 124 TFLOPS, 200 GB/s
b=128, s_q=4, mean_sk=4096, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=False
4.993 ms, 117 TFLOPS, 150 GB/s
b=128, s_q=4, mean_sk=4096, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=True
4.649 ms, 115 TFLOPS, 149 GB/s
b=128, s_q=5, mean_sk=4096, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=False
6.366 ms, 115 TFLOPS, 123 GB/s
b=128, s_q=5, mean_sk=4096, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=True
6.800 ms, 114 TFLOPS, 121 GB/s
b=128, s_q=6, mean_sk=4096, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=False
7.381 ms, 119 TFLOPS, 111 GB/s
b=128, s_q=6, mean_sk=4096, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=True
7.548 ms, 117 TFLOPS, 109 GB/s
b=128, s_q=7, mean_sk=4096, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=False
8.840 ms, 116 TFLOPS, 97 GB/s
b=128, s_q=7, mean_sk=4096, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=True
9.163 ms, 114 TFLOPS, 95 GB/s
b=128, s_q=8, mean_sk=4096, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=False
11.029 ms, 106 TFLOPS, 81 GB/s
b=128, s_q=8, mean_sk=4096, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=True
11.711 ms, 104 TFLOPS, 78 GB/s
b=128, s_q=1, mean_sk=8192, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.148 ms, 32 TFLOPS, 1056 GB/s
b=128, s_q=1, mean_sk=8192, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.217 ms, 31 TFLOPS, 1036 GB/s
b=128, s_q=2, mean_sk=8192, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.154 ms, 63 TFLOPS, 1055 GB/s
b=128, s_q=2, mean_sk=8192, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.126 ms, 62 TFLOPS, 1030 GB/s
b=128, s_q=3, mean_sk=8192, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.161 ms, 94 TFLOPS, 1052 GB/s
b=128, s_q=3, mean_sk=8192, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.174 ms, 93 TFLOPS, 1034 GB/s
b=128, s_q=4, mean_sk=8192, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.166 ms, 125 TFLOPS, 1051 GB/s
b=128, s_q=4, mean_sk=8192, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.226 ms, 123 TFLOPS, 1034 GB/s
b=128, s_q=5, mean_sk=8192, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=False
2.263 ms, 81 TFLOPS, 544 GB/s
b=128, s_q=5, mean_sk=8192, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=True
2.248 ms, 80 TFLOPS, 536 GB/s
b=128, s_q=6, mean_sk=8192, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=False
2.262 ms, 97 TFLOPS, 546 GB/s
b=128, s_q=6, mean_sk=8192, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=True
2.318 ms, 96 TFLOPS, 538 GB/s
b=128, s_q=7, mean_sk=8192, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=False
2.267 ms, 113 TFLOPS, 547 GB/s
b=128, s_q=7, mean_sk=8192, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=True
2.315 ms, 111 TFLOPS, 540 GB/s
b=128, s_q=8, mean_sk=8192, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=False
2.270 ms, 129 TFLOPS, 548 GB/s
b=128, s_q=8, mean_sk=8192, h_q=16, h_kv=1, d=576, dv=512, causal=True, varlen=True
2.130 ms, 126 TFLOPS, 539 GB/s
b=128, s_q=1, mean_sk=8192, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.154 ms, 63 TFLOPS, 1055 GB/s
b=128, s_q=1, mean_sk=8192, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.203 ms, 62 TFLOPS, 1037 GB/s
b=128, s_q=2, mean_sk=8192, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.166 ms, 125 TFLOPS, 1051 GB/s
b=128, s_q=2, mean_sk=8192, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.206 ms, 123 TFLOPS, 1032 GB/s
b=128, s_q=3, mean_sk=8192, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=False
2.262 ms, 97 TFLOPS, 546 GB/s
b=128, s_q=3, mean_sk=8192, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=True
2.252 ms, 95 TFLOPS, 537 GB/s
b=128, s_q=4, mean_sk=8192, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=False
2.276 ms, 128 TFLOPS, 546 GB/s
b=128, s_q=4, mean_sk=8192, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=True
2.330 ms, 127 TFLOPS, 539 GB/s
b=128, s_q=5, mean_sk=8192, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=False
3.393 ms, 108 TFLOPS, 369 GB/s
b=128, s_q=5, mean_sk=8192, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=True
3.596 ms, 106 TFLOPS, 365 GB/s
b=128, s_q=6, mean_sk=8192, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=False
3.399 ms, 129 TFLOPS, 371 GB/s
b=128, s_q=6, mean_sk=8192, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=True
3.495 ms, 127 TFLOPS, 364 GB/s
b=128, s_q=7, mean_sk=8192, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=False
4.634 ms, 110 TFLOPS, 274 GB/s
b=128, s_q=7, mean_sk=8192, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=True
4.707 ms, 109 TFLOPS, 271 GB/s
b=128, s_q=8, mean_sk=8192, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=False
4.628 ms, 126 TFLOPS, 276 GB/s
b=128, s_q=8, mean_sk=8192, h_q=32, h_kv=1, d=576, dv=512, causal=True, varlen=True
4.695 ms, 125 TFLOPS, 274 GB/s
b=128, s_q=1, mean_sk=8192, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=False
1.163 ms, 126 TFLOPS, 1054 GB/s
b=128, s_q=1, mean_sk=8192, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=True
1.219 ms, 123 TFLOPS, 1036 GB/s
b=128, s_q=2, mean_sk=8192, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=False
2.272 ms, 129 TFLOPS, 547 GB/s
b=128, s_q=2, mean_sk=8192, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=True
2.464 ms, 127 TFLOPS, 540 GB/s
b=128, s_q=3, mean_sk=8192, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=False
3.407 ms, 129 TFLOPS, 370 GB/s
b=128, s_q=3, mean_sk=8192, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=True
3.433 ms, 128 TFLOPS, 367 GB/s
b=128, s_q=4, mean_sk=8192, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=False
4.641 ms, 126 TFLOPS, 276 GB/s
b=128, s_q=4, mean_sk=8192, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=True
4.766 ms, 125 TFLOPS, 272 GB/s
b=128, s_q=5, mean_sk=8192, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=False
5.852 ms, 125 TFLOPS, 222 GB/s
b=128, s_q=5, mean_sk=8192, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=True
6.502 ms, 124 TFLOPS, 218 GB/s
b=128, s_q=6, mean_sk=8192, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=False
6.748 ms, 130 TFLOPS, 195 GB/s
b=128, s_q=6, mean_sk=8192, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=True
7.136 ms, 129 TFLOPS, 192 GB/s
b=128, s_q=7, mean_sk=8192, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=False
7.965 ms, 128 TFLOPS, 167 GB/s
b=128, s_q=7, mean_sk=8192, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=True
8.291 ms, 127 TFLOPS, 165 GB/s
b=128, s_q=8, mean_sk=8192, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=False
9.711 ms, 120 TFLOPS, 139 GB/s
b=128, s_q=8, mean_sk=8192, h_q=64, h_kv=1, d=576, dv=512, causal=True, varlen=True
10.579 ms, 119 TFLOPS, 137 GB/s
b=128, s_q=1, mean_sk=8192, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=False
2.279 ms, 128 TFLOPS, 546 GB/s
b=128, s_q=1, mean_sk=8192, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=True
2.314 ms, 126 TFLOPS, 537 GB/s
b=128, s_q=2, mean_sk=8192, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=False
4.665 ms, 125 TFLOPS, 274 GB/s
b=128, s_q=2, mean_sk=8192, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=True
4.627 ms, 124 TFLOPS, 271 GB/s
b=128, s_q=3, mean_sk=8192, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=False
6.804 ms, 129 TFLOPS, 193 GB/s
b=128, s_q=3, mean_sk=8192, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=True
7.320 ms, 127 TFLOPS, 190 GB/s
b=128, s_q=4, mean_sk=8192, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=False
9.788 ms, 119 TFLOPS, 138 GB/s
b=128, s_q=4, mean_sk=8192, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=True
9.200 ms, 118 TFLOPS, 138 GB/s
b=128, s_q=5, mean_sk=8192, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=False
12.573 ms, 116 TFLOPS, 110 GB/s
b=128, s_q=5, mean_sk=8192, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=True
11.811 ms, 115 TFLOPS, 110 GB/s
b=128, s_q=6, mean_sk=8192, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=False
14.703 ms, 119 TFLOPS, 97 GB/s
b=128, s_q=6, mean_sk=8192, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=True
15.041 ms, 118 TFLOPS, 96 GB/s
b=128, s_q=7, mean_sk=8192, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=False
17.604 ms, 116 TFLOPS, 83 GB/s
b=128, s_q=7, mean_sk=8192, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=True
17.742 ms, 115 TFLOPS, 82 GB/s
b=128, s_q=8, mean_sk=8192, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=False
21.947 ms, 106 TFLOPS, 68 GB/s
b=128, s_q=8, mean_sk=8192, h_q=128, h_kv=1, d=576, dv=512, causal=True, varlen=True
21.731 ms, 105 TFLOPS, 68 GB/s

@mmdbhs
Copy link

mmdbhs commented Feb 24, 2025

H20 max TFLOPS is 148

@robotzheng
Copy link
Author

Maybe using fp8,the TFLOPS is 296,the GB/s can reach 4TB/s, because it's Memory bandwidth is 4.0 Tb/s?

@JoeyYoung
Copy link

Maybe using fp8,the TFLOPS is 296,the GB/s can reach 4TB/s, because it's Memory bandwidth is 4.0 Tb/s?

I thought flash mla doesn't support fp8 yet?

@beginlner
Copy link
Collaborator

No plan to optimize on H20 yet :(

@bmkor
Copy link

bmkor commented Feb 26, 2025

No plan to optimize on H20 yet :(

Any chance? H20 is only version we still can buy😭. Sanction is coming... ☠️ I am eager to learn and help if only I am able to.

@shenRugu
Copy link

+1

目前还没有在 H20 上优化的计划:(

有机会吗?H20 是我们唯一可以购买的😭版本。制裁即将来临...☠️ 只要我有能力,我就渴望学习和帮助。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants