Prevent 2d block loads with dimensions larger than the tensor block size #4088

alexbaden · 2025-05-03T21:43:42Z

While working to generate shuffle vectors using chained linear layouts, I noticed that the size of the 2D block load can exceed the Tensor block size. In that case we generate and provide extra shuffle vectors for the values that are > the block size, but it appears those shuffle vectors are never used. This behavior makes it challenging to use linear layouts, since the layout ends up generating a different sequence of shuffle vectors - I can work around it, but it didn't make sense to me to generate big loads if we are not using all the data. So, this PR attempts to downscale a 2D block load to be equal to or less than the block size. I ran the benchmarks and, surprisingly, there appears to be a noticeable uplift in softmax but no change in gemm.

there may be a very, very small uplift in attn as well - or it could just be noise.

I tried interpreting the attn tables, but the numbers are all over the place, with this PR benchmarks tag generally being better than or worse than the default tag by about 10 tflops.

Regardless, even if this is performance neutral it makes it much cleaner to generate the 2D block loads from linear layouts, so I'd like to land it if possible.

whitneywhtsang

make sense, can you please add a simple lit test.

alexbaden · 2025-05-05T14:42:18Z

Added a lit test.

test/TritonIntelGPU/blockptr_load.mlir

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

Co-authored-by: Whitney Tsang <[email protected]>

alexbaden · 2025-05-05T20:57:57Z

There is an issue with transpose but testing it is blocked by #4100

chengjunlu · 2025-05-06T01:13:39Z

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

+
+    numOperandsOuterDimPerLoad = std::min(
+        numOperandsOuterDimPerLoad,
+        mlir::ceil<unsigned>(tensorShape[dimOuter], elemsPerDPASInst[0]));


Should we align the dim index to dimOuter on elemsPerDPASInst?

chengjunlu · 2025-05-06T01:39:46Z

test/TritonIntelGPU/blockptr_load.mlir

+    %c0_i32 = arith.constant 0 : i32
+    %c1_i64 = arith.constant 1 : i64
+    %ptrB = tt.make_tensor_ptr %arg1, [%arg4, %arg3], [%arg7, %c1_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<64x16xi8, #dot1>>
+    // CHECK-COUNT-1: llvm.call spir_funccc @_Z51intel_sub_group_2d_block_read_transform_8b_32r16x2cPU3AS1viiiDv2_iPj({{.*}}) {{.*}} : (!llvm.ptr<1>{{.*}}, i32, i32, i32, vector<2xi32>, !llvm.ptr{{.*}}) -> ()


From the semantic of the type of <tensor<64x16xi8, #dot1>>, the elements of the N dim (The second dimension) is duplicated to the threads.

The triton-tensor-layout output information shows the value N and value N + 32 should be the same value.

[[ T0:0| T0:32| T16:0| T16:32| T32:0| T32:32| T48:0| T48:32, ...

But the lowering code seems fills the value with 0 by the block IO.

There will be flakey if the return value is not aligned to the semantic of the type.

The expected block IO shape seems should be 8b_32r16x1c, and use the return value to fill up the [32, 32] tile.

alexbaden requested review from whitneywhtsang and chengjunlu May 3, 2025 21:43

whitneywhtsang reviewed May 3, 2025

View reviewed changes

alexbaden linked an issue May 5, 2025 that may be closed by this pull request

Do not generate 2D block loads with sizes > the block size #4092

Open

alexbaden added 2 commits May 5, 2025 14:41

Prevent 2d block loads with dimensions larger than the tensor block size

b090692

add lit test

6ebc5b9

alexbaden force-pushed the alex/downscale_2d_block_loads branch from 0876553 to 6ebc5b9 Compare May 5, 2025 14:41

alexbaden marked this pull request as ready for review May 5, 2025 14:42

whitneywhtsang reviewed May 5, 2025

View reviewed changes

test/TritonIntelGPU/blockptr_load.mlir Show resolved Hide resolved

test/TritonIntelGPU/blockptr_load.mlir Outdated Show resolved Hide resolved

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp Outdated Show resolved Hide resolved

alexbaden and others added 2 commits May 5, 2025 12:40

Update test/TritonIntelGPU/blockptr_load.mlir

2bff578

Co-authored-by: Whitney Tsang <[email protected]>

use mlir::ceil

37370e4

whitneywhtsang approved these changes May 5, 2025

View reviewed changes

chengjunlu reviewed May 6, 2025

View reviewed changes

operate on num operands per 2D load pre assignment to outer/inner

410e9ed

etiotto approved these changes May 7, 2025

View reviewed changes

alexbaden marked this pull request as draft May 7, 2025 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent 2d block loads with dimensions larger than the tensor block size #4088

Prevent 2d block loads with dimensions larger than the tensor block size #4088

alexbaden commented May 3, 2025

whitneywhtsang left a comment

alexbaden commented May 5, 2025

alexbaden commented May 5, 2025

chengjunlu May 6, 2025

chengjunlu May 6, 2025

Prevent 2d block loads with dimensions larger than the tensor block size #4088

Are you sure you want to change the base?

Prevent 2d block loads with dimensions larger than the tensor block size #4088

Conversation

alexbaden commented May 3, 2025

whitneywhtsang left a comment

Choose a reason for hiding this comment

alexbaden commented May 5, 2025

alexbaden commented May 5, 2025

chengjunlu May 6, 2025

Choose a reason for hiding this comment

chengjunlu May 6, 2025

Choose a reason for hiding this comment