Skip to content

Prevent 2d block loads with dimensions larger than the tensor block size #4088

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

alexbaden
Copy link
Contributor

While working to generate shuffle vectors using chained linear layouts, I noticed that the size of the 2D block load can exceed the Tensor block size. In that case we generate and provide extra shuffle vectors for the values that are > the block size, but it appears those shuffle vectors are never used. This behavior makes it challenging to use linear layouts, since the layout ends up generating a different sequence of shuffle vectors - I can work around it, but it didn't make sense to me to generate big loads if we are not using all the data. So, this PR attempts to downscale a 2D block load to be equal to or less than the block size. I ran the benchmarks and, surprisingly, there appears to be a noticeable uplift in softmax but no change in gemm.
image
there may be a very, very small uplift in attn as well - or it could just be noise.
image

I tried interpreting the attn tables, but the numbers are all over the place, with this PR benchmarks tag generally being better than or worse than the default tag by about 10 tflops.

Regardless, even if this is performance neutral it makes it much cleaner to generate the 2D block loads from linear layouts, so I'd like to land it if possible.

Copy link
Contributor

@whitneywhtsang whitneywhtsang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, can you please add a simple lit test.

@alexbaden alexbaden linked an issue May 5, 2025 that may be closed by this pull request
@alexbaden alexbaden force-pushed the alex/downscale_2d_block_loads branch from 0876553 to 6ebc5b9 Compare May 5, 2025 14:41
@alexbaden alexbaden marked this pull request as ready for review May 5, 2025 14:42
@alexbaden
Copy link
Contributor Author

Added a lit test.

@alexbaden
Copy link
Contributor Author

There is an issue with transpose but testing it is blocked by #4100


numOperandsOuterDimPerLoad = std::min(
numOperandsOuterDimPerLoad,
mlir::ceil<unsigned>(tensorShape[dimOuter], elemsPerDPASInst[0]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we align the dim index to dimOuter on elemsPerDPASInst?

%c0_i32 = arith.constant 0 : i32
%c1_i64 = arith.constant 1 : i64
%ptrB = tt.make_tensor_ptr %arg1, [%arg4, %arg3], [%arg7, %c1_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<64x16xi8, #dot1>>
// CHECK-COUNT-1: llvm.call spir_funccc @_Z51intel_sub_group_2d_block_read_transform_8b_32r16x2cPU3AS1viiiDv2_iPj({{.*}}) {{.*}} : (!llvm.ptr<1>{{.*}}, i32, i32, i32, vector<2xi32>, !llvm.ptr{{.*}}) -> ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the semantic of the type of <tensor<64x16xi8, #dot1>>, the elements of the N dim (The second dimension) is duplicated to the threads.

The triton-tensor-layout output information shows the value N and value N + 32 should be the same value.

[[   T0:0|  T0:32|  T16:0| T16:32|  T32:0| T32:32|  T48:0| T48:32, ...

But the lowering code seems fills the value with 0 by the block IO.

There will be flakey if the return value is not aligned to the semantic of the type.

The expected block IO shape seems should be 8b_32r16x1c, and use the return value to fill up the [32, 32] tile.

@alexbaden alexbaden marked this pull request as draft May 7, 2025 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not generate 2D block loads with sizes > the block size
4 participants