Fix TMEM address read/write race in example 77 #2835
Draft
+20
−20
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The MMA warp category's TMEM address write to shared memory is made visible to the epilogue/correction warp categories implicitly due to intermediate barriers between the warps which synchronize the MMA output. But in the 0 KV tile case, there is no such barrier with the epilogue so the write might not be visible. In that case the epilogue warp can read undefined data from shared memory for the TMEM address and the
tcgen05::deallocmay fail.This patch attempts to fix this issue by only allocating the TMEM when there is at least 1 KV tile by a persistent CTA. If there is at least one KV tile, the TMEM address should be correctly synchronized, even if the final work tile processed has 0 KV.
Another potential solution is to perform the
deallocin the MMA warp category and add extra synchronization from the TMEM consumers signaling that their usage is complete.P.S.
PTX ISA Manual states that
which is ambiguously phrased in my opinion, as this could either mean that the same warp must perform the
allocanddealloc(which is not the case here), or different single warps can performallocanddeallocfor a given TMEM allocation.