perf(store): Cache output stride parameters in registers to reduce global loads #6

garethpaul · 2025-02-24T02:47:22Z

Summary

This PR optimizes the store() function in the FlashMLA kernel by caching frequently used output stride parameters (i.e., o_batch_stride, o_row_stride, and o_head_stride) into local registers. By using the __ldg intrinsic to read these values once per thread, we reduce repetitive global memory accesses and potentially lower memory latency. This improvement should help improve kernel performance without affecting functionality.

Key Changes

Added local caching for params.o_batch_stride, params.o_row_stride, and params.o_head_stride in the store() function.
Utilizes the __ldg intrinsic to hint to the compiler that these values are read-only.
Maintains functional consistency, ensuring that the remainder of the function remains unchanged.

Impact

This change is a targeted performance optimization that minimizes redundant memory loads from global memory, improving efficiency without altering correctness.

beginlner · 2025-03-01T10:05:58Z

I don't think the changes make any difference on performance. If so, we should deal with other attributes of params in the same way. Could you please update a benchmark comparison?

garethpaul added 5 commits February 23, 2025 18:23

Stage accumulator fragment to shared memory using tiled copy

9f361aa

Stage accumulator fragment to shared memory using tiled copy

5fb94d6

Cache output stride parameters in registers to reduce global loads

ccb208b

Cache output stride parameters in registers to reduce global loads

46bafd9

implement the index

33e110b

sijiac approved these changes Feb 24, 2025

View reviewed changes

mhsbz approved these changes Feb 24, 2025

View reviewed changes

sanggusti approved these changes Feb 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(store): Cache output stride parameters in registers to reduce global loads #6

perf(store): Cache output stride parameters in registers to reduce global loads #6

garethpaul commented Feb 24, 2025

beginlner commented Mar 1, 2025

perf(store): Cache output stride parameters in registers to reduce global loads #6

Are you sure you want to change the base?

perf(store): Cache output stride parameters in registers to reduce global loads #6

Conversation

garethpaul commented Feb 24, 2025

Summary

Key Changes

Impact

beginlner commented Mar 1, 2025