Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does NamedBarrier in epilogue use NumMmaThreads(256) + NumThreadsPerWarp(32)? #1389

Open
ziyuhuang123 opened this issue Dec 16, 2024 · 2 comments

Comments

@ziyuhuang123
Copy link

In the NamedBarrier implementation here, the number passed is NumMmaThreads(256) + NumThreadsPerWarp(32).

I searched for FwdNamedBarriers::ValueEmpty and found it only in the epilogue's store function. The value 32 seems related to the producer's 32 threads, but I couldn't locate any explicit use of it in the producer. Could someone clarify the rationale behind this?

@tridao
Copy link
Contributor

tridao commented Dec 16, 2024

The last warp then sync on that barrier here, that's why there's an extra 32:

cutlass::arch::NamedBarrier::sync(

@ziyuhuang123
Copy link
Author

Ah... I actually didn't quite understand. My current understanding of the named barrier mechanism is that sync causes threads to wait, while arrive signals arrival. Only when the required number of threads have arrived will the barrier proceed. In this case, does it mean that we need 256+32 threads to arrive before the barrier can proceed? (In fact, there are only 256 threads in the consumer, so having 256+32 threads seems impossible.)

I noticed that you mentioned a sync being executed by a single warp. However, I feel that at most, there could be multiple separate arrive calls, but how could there be separate sync calls?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants