CUTLASS Ex 77 FMHA Softmax Instruction Interleaving (FMA, EX2,Dtype conversion) #2593
manishucsd
started this conversation in
General
Replies: 1 comment
-
the idea is that we want only one softmax wg to execute at a time, but we also want to minimize the gap between the two barriers executing. barrier synchronization has latency, so the arrive will not immediately release the wait. as such, we try to pull it forward so it happens a little bit earlier, i.e. 10 is an attempt at tuning it so that the gap is minimal. note that in practice synchronizing code like this is tricky since ptxas often moves stuff around, and kinda empiric.
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @v0i0 , Can you please explain what is going on in the below code:
It is from here in detail?
Specifically the magic number
6
and const int kReleasePipeCount = 10; // must be multiple of 2. The sequence of code interleaves FFMA, 2xexp2 and 2xF32-to-2xB16, but I don't get the issuance of order_s.arrive() based on the magic number kReleasePipeCount.Beta Was this translation helpful? Give feedback.
All reactions