Replies: 1 comment 2 replies
-
|
Good question. We don't currently use that because that instruction requires that the source data be in shared memory, but the current stream-K and split-K implementations in CUTLASS do not stage partial accumulations in shared memory before reducing them in global memory. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I was just hopelessly scrolling the PTX ISA 8.3 page and found a section on the
cp.reduce.async.bulkinstructions.From my very amateur look, these instructions seem useful for many Split-K / Streamed-K ideas, such as GEMM, FMHA, etc.
Are there plans to support them in future versions of CuTe / CUTLASS?
Disclaimer: it's totally possible that I am embarrassing myself and these are already in CUTLASS. If that's the case, please educate me 😄.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions