[Performance] Add the support of tensor of pointer in the prefetching and loop pipelining #3634

chengjunlu · 2025-03-10T07:39:55Z

This is the first implementation of the prefetching the memory referred by tensor of pointers.
There are no degradations and the targeted workload (gemm-tensor-of-ptr) exhibits a performance improvement close to 100% (geomean from 17.1 TFlops to 33 TFlops)

whitneywhtsang · 2025-03-14T10:38:36Z

Can you please show the performance impact in % instead of the new TFlops?

whitneywhtsang

We may want to spend some time on doing code refactoring, similar code pattern seems to be repeating.

third_party/intel/backend/compiler.py

third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUOps.td

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

Note: this change split from #3634. Signed-off-by: Whitney Tsang <[email protected]>

This PR limits prefetch to only dense memory, to avoid polluting the cache. Benchmark CI: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14647344471 No performance impact to key benchmarks. Note: this change comes partially from #3634. --------- Signed-off-by: Whitney Tsang <[email protected]>

This PR adds a new argument `mask` to the prefetch operation. It is to prepare for handling prefetching of tensor of pointers, as loads from tensor of pointers can be masked. Note: this change comes partially from #3634. --------- Signed-off-by: Whitney Tsang <[email protected]>

… operation. Add a mask operand for boundary check.

Signed-off-by: Whitney Tsang <[email protected]>

Signed-off-by: Tiotto, Ettore <[email protected]>

…efetch

Signed-off-by: Tiotto, Ettore <[email protected]>

…r of pointers (#4064) Loads with tensor of pointers operands that have been proven to reference memory in row major (and are contained in a scf.for loop) order are now prefetched using 2D prefetching intrinsics. Note: This PR is derived from PR #3634 --------- Signed-off-by: Tiotto, Ettore <[email protected]>

…efetch

etiotto · 2025-05-02T15:38:22Z

Benchmark results (https://benchmarks.glados.intel.com/d/1pXX4hUSz/microbenchmarks?orgId=1&var-tag=ci%7Creport%7Cpr3634&var-table=ci&var-bench=All&var-device=Intel%28R%29%20Data%20Center%20GPU%20Max%201550&var-compiler=triton&var-backend=All&var-baseline_backend=triton-ci-XPU%201550&var-target_backend=triton-ci-XPU%201550&from=now-23d&to=now)

Signed-off-by: Tiotto, Ettore <[email protected]>

etiotto · 2025-05-02T19:35:23Z

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

@@ -427,6 +427,9 @@ struct PrefetchOpConversion
      // Swap the shape to make it row major and then get the tiling
      // size base on row major shape.
      std::swap(tensorShape[0], tensorShape[1]);
+      tensorType = RankedTensorType::get(


This change will cause GEMM with A transpose to fail.

yup, just for @chengjunlu to know what changes are not merged from original change.

Where is the case? I'd like to check the bug.

Likely is the one reported in https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14777632147/job/41503097292.

It is that one

This was linked to issues Mar 10, 2025

[Performance] Enable prefetching for tt.load with tensor of pointer #3484

Closed

[Performance] Enhance the software loop pipelining for tt.load with tensor of pointer #3485

Closed

chengjunlu force-pushed the chengjun/tensorptr_prefetch branch 2 times, most recently from ab40c73 to ddd2c17 Compare March 12, 2025 06:51

chengjunlu changed the title ~~[Draft] Add the support of tensor of pointer in the prefetching and loop pipelining~~ [Performance] Add the support of tensor of pointer in the prefetching and loop pipelining Mar 12, 2025

chengjunlu requested review from etiotto, whitneywhtsang and alexbaden March 12, 2025 06:52

chengjunlu force-pushed the chengjun/tensorptr_prefetch branch 2 times, most recently from 2a5b53e to f6b90ca Compare March 13, 2025 07:38

whitneywhtsang reviewed Mar 14, 2025

View reviewed changes

third_party/intel/backend/compiler.py Show resolved Hide resolved

third_party/intel/backend/compiler.py Outdated Show resolved Hide resolved

etiotto reviewed Mar 20, 2025

View reviewed changes

etiotto requested a review from a team April 15, 2025 14:15

whitneywhtsang mentioned this pull request Apr 24, 2025

Limit prefetch to only dense memory #4009

Merged

whitneywhtsang force-pushed the chengjun/tensorptr_prefetch branch from f6b90ca to 6334ed5 Compare April 24, 2025 17:28

This was referenced Apr 24, 2025

[TTIG_PrefetchOp] Add mask argument #4011

Merged

[NFC][MatmulLoopPipeline] Remove unnecessary argument #4012

Merged

whitneywhtsang added a commit that referenced this pull request Apr 24, 2025

[NFC][MatmulLoopPipeline] Remove unnecessary argument (#4012)

5b4e69e

Note: this change split from #3634. Signed-off-by: Whitney Tsang <[email protected]>

whitneywhtsang force-pushed the chengjun/tensorptr_prefetch branch from 3b523ab to 1f05df2 Compare April 25, 2025 01:39

whitneywhtsang mentioned this pull request Apr 25, 2025

[MatmulLoopPipeline] Predicate PrefetchOp #4016

Merged

whitneywhtsang force-pushed the chengjun/tensorptr_prefetch branch 2 times, most recently from 2a0f6b8 to 4bbb4f9 Compare April 27, 2025 22:23

whitneywhtsang mentioned this pull request Apr 28, 2025

[MatmulLoopPipeline] Populate LoadOp mask to PrefetchOp #4030

Merged

whitneywhtsang force-pushed the chengjun/tensorptr_prefetch branch 2 times, most recently from 75eb507 to ad37157 Compare April 28, 2025 01:02

etiotto assigned whitneywhtsang and etiotto Apr 28, 2025

etiotto assigned chengjunlu Apr 28, 2025

whitneywhtsang force-pushed the chengjun/tensorptr_prefetch branch from 24ea0fa to 08e88f3 Compare April 28, 2025 21:55

chengjunlu and others added 6 commits April 29, 2025 03:55

Support tensor of pointer as the pointer parameter of the prefetching…

9096ffe

… operation. Add a mask operand for boundary check.

Support the tensor of pointer in the matmul loop pipelining.

6d4aad2

Fix failures

6fb6ced

Signed-off-by: Whitney Tsang <[email protected]>

[TritonIntelGPUPipeline] Remove supportRegularPtr option

d36a089

Signed-off-by: Whitney Tsang <[email protected]>

address review comment

78b26b8

Fix failing CI test

e3d441a

Signed-off-by: Tiotto, Ettore <[email protected]>

whitneywhtsang force-pushed the chengjun/tensorptr_prefetch branch from 08e88f3 to e3d441a Compare April 29, 2025 03:59

Only prefetch 2D loads

41971ca

Signed-off-by: Tiotto, Ettore <[email protected]>

etiotto mentioned this pull request Apr 29, 2025

[MatmulLoopPipeline]: Prefetch 2D loads #4051

Merged

etiotto added 2 commits April 30, 2025 14:43

Merge remote-tracking branch 'origin/main' into chengjun/tensorptr_pr…

eae1e7c

…efetch

Refactor LoadStoreOpToLLVM.cpp

2bf17c7

Signed-off-by: Tiotto, Ettore <[email protected]>

etiotto marked this pull request as draft April 30, 2025 19:54

whitneywhtsang mentioned this pull request May 1, 2025

[prefetch]: Add support for prefetching load operations using a tensor of pointers #4064

Merged

Fix failing gemm bmk

d28fefe

Signed-off-by: Tiotto, Ettore <[email protected]>

Merge remote-tracking branch 'origin/main' into chengjun/tensorptr_pr…

1d80b59

…efetch

etiotto and others added 4 commits May 2, 2025 16:03

Fix merge

95f7126

Signed-off-by: Tiotto, Ettore <[email protected]>

Merge branch 'main' into chengjun/tensorptr_prefetch

fc66f85

Merge branch 'main' into chengjun/tensorptr_prefetch

b578e2e

recover original remaining changes

cad0433

etiotto reviewed May 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Add the support of tensor of pointer in the prefetching and loop pipelining #3634

[Performance] Add the support of tensor of pointer in the prefetching and loop pipelining #3634

chengjunlu commented Mar 10, 2025 •

edited by etiotto

Loading

whitneywhtsang commented Mar 14, 2025

whitneywhtsang left a comment

etiotto commented May 2, 2025

etiotto May 2, 2025

whitneywhtsang May 2, 2025

etiotto May 2, 2025

chengjunlu May 6, 2025

whitneywhtsang May 6, 2025

etiotto May 8, 2025

[Performance] Add the support of tensor of pointer in the prefetching and loop pipelining #3634

Are you sure you want to change the base?

[Performance] Add the support of tensor of pointer in the prefetching and loop pipelining #3634

Conversation

chengjunlu commented Mar 10, 2025 • edited by etiotto Loading

whitneywhtsang commented Mar 14, 2025

whitneywhtsang left a comment

Choose a reason for hiding this comment

etiotto commented May 2, 2025

etiotto May 2, 2025

Choose a reason for hiding this comment

whitneywhtsang May 2, 2025

Choose a reason for hiding this comment

etiotto May 2, 2025

Choose a reason for hiding this comment

chengjunlu May 6, 2025

Choose a reason for hiding this comment

whitneywhtsang May 6, 2025

Choose a reason for hiding this comment

etiotto May 8, 2025

Choose a reason for hiding this comment

chengjunlu commented Mar 10, 2025 •

edited by etiotto

Loading