Add dual-GEMM examples for SM90 (Hopper) and SM120 (Blackwell) #2694

Inodayy · 2025-10-13T14:50:57Z

Summary

Implements dual-GEMM examples for SM90 (Hopper) and SM120 (Blackwell) using CUTLASS 3.x.

The dual-GEMM operation implemented is:

  D0 = epilogue0(X @ B0, C0)
  D1 = epilogue1(X @ B1, C1)
  D2 = element_wise(D0, D1)

Based on the single-GEMM examples 48_hopper_warp_specialized_gemm.cu
and 79a_blackwell_geforce_nvfp4_bf16_gemm.cu
B0 and B1 layouts are not decoupled, but both are passed separately to the builders for potential future flexibility.
(Blackwell supports only TN layout; Hopper assumes NK layout for make_tma_copy_B_sm90 etc.)
D2 performs LeftSiLUAndMul similar to example 45_dual_gemm, implemented in collective/sm90_epilogue_tma_warpspecialized_dual.hpp store()
D0 and D1 are intermediate results only and are not stored.
Added template<class Op0, class Op1> in fusion/sm90_callbacks… to allow distinct operations for D0 and D1.

SM90 (Hopper)

SM120 (Blackwell)

Problem size: 2048×2048×2048
Avg runtime: 0.155648 ms
GFLOPS: 220,753
≈30% slower than two single-GEMM baseline (haven’t been able to find the root cause yet)

I am relatively new to CUTLASS C++; this work was implemented as a learning exercise. I followed example structure similar to 63_hopper_gemm_with_weight_prefetch.
The SM120 example was an initial local starting point and can be removed if unnecessary

Closes #1123

Add dual-GEMM examples for SM90 (Hopper) and SM120 (Blackwell)

6f6e8c2