[mosaic-gpu] add multicast ptr support to TMA with overlapped gemm and all reduce examples #28679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

Amir-19 wants to merge 2 commits into jax-ml:main from Amir-19:mc_tma_gemm_ar

Contributor

Amir-19 commented May 12, 2025 •

edited

Loading

added multicast(mc) ptr support to copy_smem_to_gmem
mc ptr currently only supports team 0 which is NVSHMEM_TEAM_WORLD
added binding to get number of SMs for gpu, this is used in our examples to launch persistent kernels.
add simple GEMM examples overlapped with all reduce. both of one-shot and two-shot GEMM+AR.

this PR is dependent on #28595 to use a newer version of PTX ISA. while compiling the kernels, jax would pick PTX ISA 8.0 which does not allow us to use multimem.ld_reduce

Amir-19 added 2 commits

May 11, 2025 23:25


          add binding to get gpu SM count


          add mc ptr support to tma with overlapped gemm and all reduce examples

a0ef68f

apaszke requested changes

View reviewed changes

Member

apaszke left a comment

Please split off the examples into another PR. Adding a Pallas or Mosaic feature should be done on its own, with appropriate tests in tests/mosaic/gpu_test.py and tests/pallas/mosaic_gpu_test.py (which are missing here)

jax/_src/pallas/mosaic_gpu/primitives.py

@@ @@ -361,6 +364,7 @@ def copy_smem_to_gmem( @@
                   reduction_op: If set, perform the specified reduction operation when storing
                     to GMEM. For example, using ``"add"`` is conceptually equivalent to
                     doing ``src += dst``.
+                  team_id: if set, dst ref would be translated to a multicast memory addr

Member

apaszke May 19, 2025

How does a Pallas user get control over teams? It's not a JAX-level concept so it doesn't make sense to surface it here. How does XLA manage that?

I think what you can do on the JAX level is take an axis name, and perform the reduction along that JAX mesh axis.

Finally: does it ever make sense to use team_id without reduction_op? If not, we should add checks

jax/experimental/mosaic/gpu/core.py

@@ @@ -639,6 +646,7 @@ def as_gpu_kernel( @@
                   kernel_name: str | None = None,
                   ir_version: int | None = None,
                   thread_semantics: LoweringSemantics = LoweringSemantics.Lane,
+                  input_output_aliases: tuple[tuple[int, int], ...] = (),

Member

apaszke May 19, 2025

The addition of input_output_aliases is a separate change. Please send an independent PR for this

jax/experimental/mosaic/gpu/utils.py

+                llvm.inline_asm(
+                    i32,
+                    [mc_ptr, x, y, z, w],
+                    "multimem.st.relaxed.sys.global.v4.f32 [$1], {$2, $3, $4, $5};",

Member

apaszke May 19, 2025

Is this a broadcast?

jax/experimental/mosaic/gpu/utils.py

		return return_regs[0], return_regs[1], return_regs[2], return_regs[3]


		def multimem_st_128(mc_ptr, x, y, z, w):

Member

apaszke May 19, 2025

Please check that the args have the f32 type

jax/experimental/mosaic/gpu/utils.py

+              def multimem_st_128(mc_ptr, x, y, z, w):
+                i32 = ir.IntegerType.get_signless(32)
+                llvm.inline_asm(
+                    i32,

Member

apaszke May 19, 2025

This snippet does not have a result. Use ir.Type.parse("!llvm.void") and remove the result register constraint

jax/experimental/mosaic/gpu/utils.py

+                return_regs = [
+                      llvm.extractvalue(i32, return_struct, [i]) for i in range(4)
+                ]
+                return return_regs[0], return_regs[1], return_regs[2], return_regs[3]

Member

apaszke May 19, 2025

If you use f16x2 then you should return the results after bitcasting them to ir.VectorType.get((2,), bf16). You might be able to use the bitcast from this file

jax/experimental/mosaic/gpu/utils.py

		)


		def wait_loop(uc_ptr, num_gpus=8, is_relaxed=False):

Member

apaszke May 19, 2025

We already have semaphores in Pallas. Is that not enough?

jax/experimental/mosaic/gpu/utils.py

                     result = vector.insertelement(elem, result, position=c(offset + i, index))
                   offset += vty.shape[0]
                 return result
+              def signal_with_red(mc_ptr, is_relaxed=False):

Member

apaszke May 19, 2025

What's the purpose of this function?

jax/experimental/mosaic/gpu/core.py

                 i64 = ir.IntegerType.get_signless(64)
-                arg_tys = [ptr_ty, ptr_ty, i64, i64, ptr_ty, ptr_ty, i64, ptr_ty]
+                arg_tys = [ptr_ty, ptr_ty, i64, i64, ptr_ty, ptr_ty, i64, ptr_ty, i32]

Member

apaszke May 19, 2025

Instead of changing the signature of this function, you could just add a new function in runtime.cc and call it to translate a regular pointer to an mc pointer. It doesn't have to be bundled with the TMA desc initialization

apaszke self-assigned this

Member

apaszke commented May 22, 2025

#28941 might be relevant to you btw (it adds support for TMA to remote addresses)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet