Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inlining error in Hopper matmul with AxisMapping and grid swizzling #3671

Open
jacobhinkle opened this issue Jan 6, 2025 · 0 comments
Open
Assignees
Labels

Comments

@jacobhinkle
Copy link
Collaborator

The inlining logic for MmaOp with AxisMapping checks that unmapped dimensions are Broadcast. We expect to have something like this

t0 {
  logical: [ iS0{M} iS1{K} ]
  loop: [ iS2{ ceilDiv(M, 256) }  bS5{1}  iS3{256} bS6{256} ... ]
    Split iS0 by 256 -> iS2, iS3
    Split bS4 by 256 -> bS5, bS6
      ...
  additional ids: bS4{1}
}
t1 {
  logical: [ iS10{N} iS11{K} ]
  loop: [ bS13{1} iS15{ ceilDiv(N, 256) } bS14{256} iS16{256} ... ]
    Split bS12 by 256 -> bS13, bS14
    Split iS10 by 256 -> iS15, iS16
      ...
  additional ids: bS12{1}
}

In this case, we are able to inline the mma operation that consumes these two tensors, but we check that the unmapped IDs 5, 6, 13, and 14 are Broadcast and that the operation is an MmaOp.

In the case of grid swizzling by a factor 4, we will do some further scheduling here. For example we will have

t0 {
  logical: [ iS0{M} iS1{K} ]
  loop: [ iS21{ ceilDiv( ceilDiv(M, 256), 4) } iS22{4} iS3{256} bS6{256} ... ]
    Split iS0 by 256 -> iS2, iS3
    Split bS4 by 256 -> bS5, bS6
    Split iS2 by 4 -> iS20, iS21{4}
    Merge bS5 with iS21{4} -> iS22{4}
      ...
  additional ids: bS4{1}
}

Now we have mixed the first two outer dimensions with this swizzle and what used to be a simple split of a loop broadcast (bS5) is now an iteration ID iS22{4} resulting from the merge.

I am not sure yet how to address this. I don't think we can just inline here without some other changes since when I disable this check I get errors in expression sorting.

@jacobhinkle jacobhinkle self-assigned this Jan 6, 2025
jacobhinkle added a commit that referenced this issue Jan 8, 2025
This updates the default (non-plugin) matmul heuristic to support Hopper
matmuls. This change means that we can not run matmuls on Hopper
similarly to how we do it on Ampere and Turing, including using the
Python interface.

I tried to make the default heuristic somewhat thoughtful and not just a
placeholder. Here are some notes about the Hopper heuristic in its
current form:
- I set the macro to Hopper_64_64_16. I intended to always use the
largest macro for which the N size divided the problem's N, but this led
to lower perf on the handful of examples I looked at. We should
benchmark more and find out why this is once we have warp specialization
and register stealing fully plumbed in, but for the time being I simply
left it at N=64.
- Once the instruction tile is set we set the warp tile equal to the
instruction tile (we can revisit this in the future). Then to find the
CTA tile we double the instruction tile in the M or N dimension until we
run out of registers.
- We start with 8 circular buffering stages and decrease until the
circular buffers fit into smem.
- We use `use_smem_epilogue` when possible. Whenever that is possible we
_always_ use `promote_prologue_smem_reuse` even if it's not needed. This
is to try and avoid bugs like #3602.
- I set the tile rasterization order so that the fast axis is the axis
with the fewest tiles, which should encourage more L2 hits unless there
are tons of tiles in each dimension.
- I cannot yet set grid swizzling due to #3671, but I placed a TODO
comment and some code to do the proper swizzling.

---------

Co-authored-by: Ryan Spring <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant