Skip to content

[Performance] Add the support of tensor of pointer in the prefetching and loop pipelining #3634

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

chengjunlu
Copy link
Contributor

@chengjunlu chengjunlu commented Mar 10, 2025

This is the first implementation of the prefetching the memory referred by tensor of pointers.
There are no degradations and the targeted workload (gemm-tensor-of-ptr) exhibits a performance improvement close to 100% (geomean from 17.1 TFlops to 33 TFlops)

@chengjunlu chengjunlu force-pushed the chengjun/tensorptr_prefetch branch 2 times, most recently from ab40c73 to ddd2c17 Compare March 12, 2025 06:51
@chengjunlu chengjunlu changed the title [Draft] Add the support of tensor of pointer in the prefetching and loop pipelining [Performance] Add the support of tensor of pointer in the prefetching and loop pipelining Mar 12, 2025
@chengjunlu chengjunlu force-pushed the chengjun/tensorptr_prefetch branch 2 times, most recently from 2a5b53e to f6b90ca Compare March 13, 2025 07:38
@whitneywhtsang
Copy link
Contributor

Can you please show the performance impact in % instead of the new TFlops?

Copy link
Contributor

@whitneywhtsang whitneywhtsang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to spend some time on doing code refactoring, similar code pattern seems to be repeating.

@etiotto etiotto requested a review from a team April 15, 2025 14:15
@whitneywhtsang whitneywhtsang force-pushed the chengjun/tensorptr_prefetch branch from f6b90ca to 6334ed5 Compare April 24, 2025 17:28
whitneywhtsang added a commit that referenced this pull request Apr 24, 2025
Note: this change split from #3634.

Signed-off-by: Whitney Tsang <[email protected]>
whitneywhtsang added a commit that referenced this pull request Apr 25, 2025
This PR limits prefetch to only dense memory, to avoid polluting the
cache.
Benchmark CI:
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14647344471
No performance impact to key benchmarks.

Note: this change comes partially from #3634.

---------

Signed-off-by: Whitney Tsang <[email protected]>
@whitneywhtsang whitneywhtsang force-pushed the chengjun/tensorptr_prefetch branch from 3b523ab to 1f05df2 Compare April 25, 2025 01:39
whitneywhtsang added a commit that referenced this pull request Apr 25, 2025
This PR adds a new argument `mask` to the prefetch operation. It is to
prepare for handling prefetching of tensor of pointers, as loads from
tensor of pointers can be masked.

Note: this change comes partially from #3634.

---------

Signed-off-by: Whitney Tsang <[email protected]>
@whitneywhtsang whitneywhtsang force-pushed the chengjun/tensorptr_prefetch branch 2 times, most recently from 2a0f6b8 to 4bbb4f9 Compare April 27, 2025 22:23
@whitneywhtsang whitneywhtsang force-pushed the chengjun/tensorptr_prefetch branch 2 times, most recently from 75eb507 to ad37157 Compare April 28, 2025 01:02
@whitneywhtsang whitneywhtsang force-pushed the chengjun/tensorptr_prefetch branch from 08e88f3 to e3d441a Compare April 29, 2025 03:59
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
etiotto added a commit that referenced this pull request May 2, 2025
…r of pointers (#4064)

Loads with tensor of pointers operands that have been proven to
reference memory in row major (and are contained in a scf.for loop)
order are now prefetched using 2D prefetching intrinsics.

Note: This PR is derived from PR #3634

---------

Signed-off-by: Tiotto, Ettore <[email protected]>
@@ -427,6 +427,9 @@ struct PrefetchOpConversion
// Swap the shape to make it row major and then get the tiling
// size base on row major shape.
std::swap(tensorShape[0], tensorShape[1]);
tensorType = RankedTensorType::get(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will cause GEMM with A transpose to fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, just for @chengjunlu to know what changes are not merged from original change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the case? I'd like to check the bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is that one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants