Are the test performance results of tests/test_flash_mla.py accurate? #59

pipul · 2025-03-03T03:10:53Z

    def flash_mla():
        torch.cuda.synchronize()
        tile_scheduler_metadata, num_splits = get_mla_metadata(cache_seqlens, s_q * h_q // h_kv, h_kv)

I added a sync(), and found that the performance was much worse. With sync(), it took 360us, while without it, it only took 50us.

Why does it feel like the cost time is a CPU's time? (The kernel submits asynchronously and hasn't finished executing yet.)

The text was updated successfully, but these errors were encountered:

pipul changed the title ~~python tests/test_flash_mla.py 测试性能结果准确吗？~~ Are the test performance results of tests/test_flash_mla.py accurate? Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are the test performance results of tests/test_flash_mla.py accurate? #59

Are the test performance results of tests/test_flash_mla.py accurate? #59

pipul commented Mar 3, 2025 •

edited

Loading

Are the test performance results of tests/test_flash_mla.py accurate? #59

Are the test performance results of tests/test_flash_mla.py accurate? #59

Comments

pipul commented Mar 3, 2025 • edited Loading

pipul commented Mar 3, 2025 •

edited

Loading