Replies: 1 comment 1 reply
-
It should be pretty straightforward to modify the CuTe tutorials to do this. In particular, you want this very common sequence to partition tensors, copy trivially, and apply MMA: // TUTORIAL: Example of partitioning via a TiledMMA
TiledMMA mma = make_tiled_mma(SM70_8x8x4_F32F16F16F32_TN{});
ThrMMA thr_mma = mma.get_slice(threadIdx.x);
Tensor tCgA = thr_mma.partition_A(gA); // (MMA,MMA_M,MMA_K)
Tensor tCgB = thr_mma.partition_B(gB); // (MMA,MMA_N,MMA_K)
Tensor tCgC = thr_mma.partition_C(gC); // (MMA,MMA_M,MMA_N)
// Allocate the accumulators -- same size as the projected data
Tensor tCrA = thr_mma.make_fragment_A(tCgA); // (MMA,MMA_M,MMA_K)
Tensor tCrB = thr_mma.make_fragment_B(tCgB); // (MMA,MMA_N,MMA_K)
Tensor tCrC = thr_mma.make_fragment_C(tCgC); // (MMA,MMA_M,MMA_N)
...
copy(tCgA, tCrA);
copy(tCgB, tCrB);
clear(tCrC);
gemm(mma, tCsA, tCsB, tCrC);
copy(tCrC, tCgC); Just construct a |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to do some experiments that look at the result of a single ptx mma instruction.
Really, what I want to do is manually invoke the particular ptx instruction for a specific architecture (e.g., volta's mma.sync.m8n8k4 with the proper types).
I have been looking through the documentation of Cute, but the tutorials are too generalized for what I am trying to accomplish.
i.e., I need one kernel where I provide matrices A, B, C, and D (whose shape is exactly the size expected by one ptx mma instruction). So, for SM70's 884, then matrix A is 8x4, B is 4x8, and both C and D are 8x8
I know that I can manually invoke the ptx using the MMA atom's
call
method.However, I also know that this requires each thread provide the registers
I know that I can figure out how to manually get values needed for each tensor here via the ptx spec, but I'm pretty sure that CuTe gives me the tools I need to do this. And, since I'll need to do this for each different ptx instruction on each generation, it feels cumbersome to manually write out this register mapping, particularly given that cutlass does it.
I've read through https://docs.nvidia.com/cutlass/media/docs/cpp/cute/index.html, and yet cannot wrap my head around how to do this.
Beta Was this translation helpful? Give feedback.
All reactions