-
Notifications
You must be signed in to change notification settings - Fork 62
[Performance] Add the support of tensor of pointer in the prefetching and loop pipelining #3634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ab40c73
to
ddd2c17
Compare
2a5b53e
to
f6b90ca
Compare
Can you please show the performance impact in % instead of the new TFlops? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to spend some time on doing code refactoring, similar code pattern seems to be repeating.
third_party/intel/include/Dialect/TritonIntelGPU/IR/TritonIntelGPUOps.td
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
f6b90ca
to
6334ed5
Compare
Note: this change split from #3634. Signed-off-by: Whitney Tsang <[email protected]>
This PR limits prefetch to only dense memory, to avoid polluting the cache. Benchmark CI: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14647344471 No performance impact to key benchmarks. Note: this change comes partially from #3634. --------- Signed-off-by: Whitney Tsang <[email protected]>
3b523ab
to
1f05df2
Compare
This PR adds a new argument `mask` to the prefetch operation. It is to prepare for handling prefetching of tensor of pointers, as loads from tensor of pointers can be masked. Note: this change comes partially from #3634. --------- Signed-off-by: Whitney Tsang <[email protected]>
2a0f6b8
to
4bbb4f9
Compare
75eb507
to
ad37157
Compare
24ea0fa
to
08e88f3
Compare
… operation. Add a mask operand for boundary check.
Signed-off-by: Whitney Tsang <[email protected]>
Signed-off-by: Whitney Tsang <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
08e88f3
to
e3d441a
Compare
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
…r of pointers (#4064) Loads with tensor of pointers operands that have been proven to reference memory in row major (and are contained in a scf.for loop) order are now prefetched using 2D prefetching intrinsics. Note: This PR is derived from PR #3634 --------- Signed-off-by: Tiotto, Ettore <[email protected]>
@@ -427,6 +427,9 @@ struct PrefetchOpConversion | |||
// Swap the shape to make it row major and then get the tiling | |||
// size base on row major shape. | |||
std::swap(tensorShape[0], tensorShape[1]); | |||
tensorType = RankedTensorType::get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change will cause GEMM with A transpose to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, just for @chengjunlu to know what changes are not merged from original change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the case? I'd like to check the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely is the one reported in https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/14777632147/job/41503097292.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is that one
This is the first implementation of the prefetching the memory referred by tensor of pointers.
There are no degradations and the targeted workload (gemm-tensor-of-ptr) exhibits a performance improvement close to 100% (geomean from 17.1 TFlops to 33 TFlops)