-
Notifications
You must be signed in to change notification settings - Fork 62
Prevent 2d block loads with dimensions larger than the tensor block size #4088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, can you please add a simple lit test.
0876553
to
6ebc5b9
Compare
Added a lit test. |
third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
Co-authored-by: Whitney Tsang <[email protected]>
There is an issue with transpose but testing it is blocked by #4100 |
|
||
numOperandsOuterDimPerLoad = std::min( | ||
numOperandsOuterDimPerLoad, | ||
mlir::ceil<unsigned>(tensorShape[dimOuter], elemsPerDPASInst[0])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we align the dim index to dimOuter
on elemsPerDPASInst
?
%c0_i32 = arith.constant 0 : i32 | ||
%c1_i64 = arith.constant 1 : i64 | ||
%ptrB = tt.make_tensor_ptr %arg1, [%arg4, %arg3], [%arg7, %c1_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<64x16xi8, #dot1>> | ||
// CHECK-COUNT-1: llvm.call spir_funccc @_Z51intel_sub_group_2d_block_read_transform_8b_32r16x2cPU3AS1viiiDv2_iPj({{.*}}) {{.*}} : (!llvm.ptr<1>{{.*}}, i32, i32, i32, vector<2xi32>, !llvm.ptr{{.*}}) -> () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the semantic of the type of <tensor<64x16xi8, #dot1>>
, the elements of the N dim (The second dimension) is duplicated to the threads.
The triton-tensor-layout output information shows the value N and value N + 32 should be the same value.
[[ T0:0| T0:32| T16:0| T16:32| T32:0| T32:32| T48:0| T48:32, ...
But the lowering code seems fills the value with 0 by the block IO.
There will be flakey if the return value is not aligned to the semantic of the type.
The expected block IO shape seems should be 8b_32r16x1c
, and use the return value to fill up the [32, 32]
tile.
While working to generate shuffle vectors using chained linear layouts, I noticed that the size of the 2D block load can exceed the Tensor block size. In that case we generate and provide extra shuffle vectors for the values that are > the block size, but it appears those shuffle vectors are never used. This behavior makes it challenging to use linear layouts, since the layout ends up generating a different sequence of shuffle vectors - I can work around it, but it didn't make sense to me to generate big loads if we are not using all the data. So, this PR attempts to downscale a 2D block load to be equal to or less than the block size. I ran the benchmarks and, surprisingly, there appears to be a noticeable uplift in softmax but no change in gemm.


there may be a very, very small uplift in attn as well - or it could just be noise.
I tried interpreting the
attn
tables, but the numbers are all over the place, with this PR benchmarks tag generally being better than or worse than the default tag by about 10 tflops.Regardless, even if this is performance neutral it makes it much cleaner to generate the 2D block loads from linear layouts, so I'd like to land it if possible.