Why does NamedBarrier in epilogue use NumMmaThreads(256) + NumThreadsPerWarp(32)? #1389

ziyuhuang123 · 2024-12-16T10:11:50Z

In the NamedBarrier implementation here, the number passed is NumMmaThreads(256) + NumThreadsPerWarp(32).

I searched for FwdNamedBarriers::ValueEmpty and found it only in the epilogue's store function. The value 32 seems related to the producer's 32 threads, but I couldn't locate any explicit use of it in the producer. Could someone clarify the rationale behind this?

tridao · 2024-12-16T18:11:33Z

The last warp then sync on that barrier here, that's why there's an extra 32:

flash-attention/hopper/epilogue_fwd_sm90_tma.hpp

Line 259 in 0dfb281

cutlass::arch::NamedBarrier::sync(

ziyuhuang123 · 2024-12-17T10:13:33Z

Ah... I actually didn't quite understand. My current understanding of the named barrier mechanism is that sync causes threads to wait, while arrive signals arrival. Only when the required number of threads have arrived will the barrier proceed. In this case, does it mean that we need 256+32 threads to arrive before the barrier can proceed? (In fact, there are only 256 threads in the consumer, so having 256+32 threads seems impossible.)

I noticed that you mentioned a sync being executed by a single warp. However, I feel that at most, there could be multiple separate arrive calls, but how could there be separate sync calls?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does NamedBarrier in epilogue use NumMmaThreads(256) + NumThreadsPerWarp(32)? #1389

Why does NamedBarrier in epilogue use NumMmaThreads(256) + NumThreadsPerWarp(32)? #1389

ziyuhuang123 commented Dec 16, 2024

tridao commented Dec 16, 2024

ziyuhuang123 commented Dec 17, 2024

Why does NamedBarrier in epilogue use NumMmaThreads(256) + NumThreadsPerWarp(32)? #1389

Why does NamedBarrier in epilogue use NumMmaThreads(256) + NumThreadsPerWarp(32)? #1389

Comments

ziyuhuang123 commented Dec 16, 2024

tridao commented Dec 16, 2024

ziyuhuang123 commented Dec 17, 2024