How to use Cute's builtin MMA Atoms and manually invoke ptx #2426

bavalpey · 2025-06-26T17:36:12Z

bavalpey
Jun 26, 2025

I am trying to do some experiments that look at the result of a single ptx mma instruction.
Really, what I want to do is manually invoke the particular ptx instruction for a specific architecture (e.g., volta's mma.sync.m8n8k4 with the proper types).

I have been looking through the documentation of Cute, but the tutorials are too generalized for what I am trying to accomplish.

i.e., I need one kernel where I provide matrices A, B, C, and D (whose shape is exactly the size expected by one ptx mma instruction). So, for SM70's 884, then matrix A is 8x4, B is 4x8, and both C and D are 8x8

I know that I can manually invoke the ptx using the MMA atom's call method.

However, I also know that this requires each thread provide the registers

__global__ void my_volta_884(half *A, half *B, float *C, float *D) {
    auto my_mma = MMA_Atom<SM70_8x8x4_F32F16F16F32_TN>{};
    /* get the thread's laneID, assume I invoke kernel with block dim (32, 1, 1)*/
    auto laneId = threadIdx.x;
    auto local_A = ???; /* load the 4 fp16 values for A for this lane into a tensor*/
    auto local_B = ???; /* load the 4 fp16 values for B needed by this lane */
    auto local_C = ???; /* load the 8 fp32 values for C needed by this lane */
    auto local_D = ???; /* same layout as C, but shaped properly */

   my_mma.call(local_D, local_A, local_B, local_C);
   
   /* now, D will have the resulting 8x8 matrix */
}

I know that I can figure out how to manually get values needed for each tensor here via the ptx spec, but I'm pretty sure that CuTe gives me the tools I need to do this. And, since I'll need to do this for each different ptx instruction on each generation, it feels cumbersome to manually write out this register mapping, particularly given that cutlass does it.

I've read through https://docs.nvidia.com/cutlass/media/docs/cpp/cute/index.html, and yet cannot wrap my head around how to do this.

ccecka · 2025-06-26T18:20:28Z

ccecka
Jun 26, 2025

It should be pretty straightforward to modify the CuTe tutorials to do this.
https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/sgemm_1.cu
http://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/sgemm_2.cu

In particular, you want this very common sequence to partition tensors, copy trivially, and apply MMA:

  // TUTORIAL: Example of partitioning via a TiledMMA

  TiledMMA mma   = make_tiled_mma(SM70_8x8x4_F32F16F16F32_TN{});
  ThrMMA thr_mma = mma.get_slice(threadIdx.x);
  Tensor tCgA = thr_mma.partition_A(gA);                               // (MMA,MMA_M,MMA_K)
  Tensor tCgB = thr_mma.partition_B(gB);                               // (MMA,MMA_N,MMA_K)
  Tensor tCgC = thr_mma.partition_C(gC);                               // (MMA,MMA_M,MMA_N)

  // Allocate the accumulators -- same size as the projected data
  Tensor tCrA = thr_mma.make_fragment_A(tCgA);                         // (MMA,MMA_M,MMA_K)
  Tensor tCrB = thr_mma.make_fragment_B(tCgB);                         // (MMA,MMA_N,MMA_K)
  Tensor tCrC = thr_mma.make_fragment_C(tCgC);                         // (MMA,MMA_M,MMA_N)

  ...

  copy(tCgA, tCrA);
  copy(tCgB, tCrB);
  clear(tCrC);

  gemm(mma, tCsA, tCsB, tCrC);

  copy(tCrC, tCgC);

Just construct a gA tensor of the right shape, a gB tensor of the right type and shape, and a gC tensor of the right type and shape.

1 reply

bavalpey Jun 26, 2025
Author

Hmm.. This didn't quite work.
I assume I'm doing something wrong.

__global__ void my_MMA(half *A, half *B, float *C, float *D) {
    TiledMMA mma = make_tiled_mma(MMA_884{});
    ThrMMA thr_mma = mma.get_slice(threadIdx.x);
    Tensor tCgA = thr_mma.partition_A(make_tensor(make_gmem_ptr(A), MMA_884::ALayout{}));
    Tensor tCgB = thr_mma.partition_B(make_tensor(make_gmem_ptr(B), MMA_884::BLayout{}));
    Tensor tCgC = thr_mma.partition_C(make_tensor(make_gmem_ptr(C), MMA_884::CLayout{}));
    Tensor tCgD = thr_mma.partition_C(make_tensor(make_gmem_ptr(D), MMA_884::CLayout{}));

    Tensor tCrA = thr_mma.make_fragment_A(tCgA);
    Tensor tCrB = thr_mma.make_fragment_B(tCgB);
    Tensor tCrC = thr_mma.make_fragment_C(tCgC);
    Tensor tCrD = thr_mma.make_fragment_C(tCgD);

    copy(tCgA, tCrA);
    copy(tCgB, tCrB);
    copy(tCgC, tCrC);
    clear(tCgD);

    gemm(mma, tCrD, tCrA, tCrB, tCrC);

    copy(tCrD, tCgD);

}

I just pass in A, B, C, and D, which are contiguously allocated and initialized.

int main() {
    using MMA_884 = MMA_Atom<SM70_8x8x4_F32F16F16F32_TN>;


    thrust::host_vector<half> h_A(size(MMA_884::ALayout{}));
    thrust::host_vector<half> h_B(size(MMA_884::BLayout{}));
    thrust::host_vector<float> h_C(size(MMA_884::CLayout{}));

    // Initialize A and B
    for (size_t i = 0; i < h_A.size(); ++i)
        h_A[i] = __float2half(static_cast<float>(i / 4 + 1));
    
    for (size_t i = 0; i < h_B.size(); ++i)
        h_B[i] = __float2half(1.0f);
    
    // Initialize C and D
    for (size_t i = 0; i < h_C.size(); ++i) {
        h_C[i] = 0.0f;    
    }

    // create device vectors
    thrust::device_vector<half> d_A = h_A;
    thrust::device_vector<half> d_B = h_B;
    thrust::device_vector<float> d_C = h_C;
    thrust::device_vector<float> d_D(h_C.size());

    my_MMA<<<1, 32>>>(thrust::raw_pointer_cast(d_A.data()),
                      thrust::raw_pointer_cast(d_B.data()),
                      thrust::raw_pointer_cast(d_C.data()),
                      thrust::raw_pointer_cast(d_D.data()));
    
    cudaDeviceSynchronize();
    // Copy the result back to host
    thrust::host_vector<float> h_D = d_D;

    // print the values of D
    for (size_t i = 0; i < h_D.size(); ++i) {
        std::cout << "D[" << i << "] = " << h_D[i] << std::endl;
    }

    return EXIT_SUCCESS;
}

My test sets the values of A to their 1-based row number, and the values in B to all 1s.
So, I expect that D[0] to be 4. When I do this, I'm getting 12.

Do I need to do something different with the layout to transfer the raw elements?

(Also, nvcc complains about using a host side constexpr on the device, and asks me to use --expt-relaxed-constexpr, is this expected?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to use Cute's builtin MMA Atoms and manually invoke ptx #2426

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

How to use Cute's builtin MMA Atoms and manually invoke ptx #2426

Uh oh!

Uh oh!

bavalpey Jun 26, 2025

Replies: 1 comment · 1 reply

Uh oh!

ccecka Jun 26, 2025

Uh oh!

Uh oh!

bavalpey Jun 26, 2025 Author

bavalpey
Jun 26, 2025

Replies: 1 comment 1 reply

ccecka
Jun 26, 2025

bavalpey Jun 26, 2025
Author