Skip to content

Conversation

Inodayy
Copy link
Contributor

@Inodayy Inodayy commented Oct 13, 2025

Summary

Implements dual-GEMM examples for SM90 (Hopper) and SM120 (Blackwell) using CUTLASS 3.x.

The dual-GEMM operation implemented is:

  D0 = epilogue0(X @ B0, C0)
  D1 = epilogue1(X @ B1, C1)
  D2 = element_wise(D0, D1)

Implementation details

  • Based on the single-GEMM examples 48_hopper_warp_specialized_gemm.cu
    and 79a_blackwell_geforce_nvfp4_bf16_gemm.cu

  • B0 and B1 layouts are not decoupled, but both are passed separately to the builders for potential future flexibility.
    (Blackwell supports only TN layout; Hopper assumes NK layout for make_tma_copy_B_sm90 etc.)

  • D2 performs LeftSiLUAndMul similar to example 45_dual_gemm, implemented in collective/sm90_epilogue_tma_warpspecialized_dual.hpp store()

  • D0 and D1 are intermediate results only and are not stored.

  • Added template<class Op0, class Op1> in fusion/sm90_callbacks… to allow distinct operations for D0 and D1.

Performance (keeping all configurations same as single-GEMM examples)

SM90 (Hopper)

  • Problem size: 2048×2048×2048
  • Rasterization: Heuristic with max CTA swizzle 2
  • Avg runtime: 0.20429 ms
  • GFLOPS: 168,191
  • ≈10% faster than two single-GEMM baseline

SM120 (Blackwell)

  • Problem size: 2048×2048×2048
  • Avg runtime: 0.155648 ms
  • GFLOPS: 220,753
  • ≈30% slower than two single-GEMM baseline (haven’t been able to find the root cause yet)

Notes

  • I am relatively new to CUTLASS C++; this work was implemented as a learning exercise. I followed example structure similar to 63_hopper_gemm_with_weight_prefetch.
  • The SM120 example was an initial local starting point and can be removed if unnecessary

Closes #1123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[QST] Are there plans to add specialisations for Sm90?

1 participant