Replies: 1 comment 1 reply
-
The A and B layouts have projections in the threads which are difficult to depict in these diagrams. T64 is "missing" from the A Layout. T64 will read the same values that T0 reads in A. T32 is "missing" from the B Layout. T32 will read the same values that T0 reads in B. Your understanding is correct -- all threads hold parts of the data of matrices A, B, and C, but that data may actually be reproduced across multiple threads. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have read all the documents of CuTe, and I have always been puzzled about the TileMMA thread layout setting ThrLayoutVMNK (_32,_2,_2,_1):(_1,_32,_64,_0). When I use print_latex to print, I see that the data of matrix A is distributed among threads 0-31 and 32-63. Does this mean that the two warps of thread idx 64~127 do not hold any data of matrix A? Similarly, matrix B is also distributed among the threads of 2 warps (0-31, 64-95), but the data of matrix C is distributed within the full 4 warps (0-127). My current understanding is that all threads hold parts of the data of matrices A, B, and C, it's just that print_latex cannot print them out. I would be very grateful if someone could answer this!
And the output latex as follow:

Beta Was this translation helpful? Give feedback.
All reactions