-
Notifications
You must be signed in to change notification settings - Fork 191
[midend][tests] Add broadcast BatchMatMul optimization pass and tests. #187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
//===- BatchMatMulOptimize.cpp | ||
//-------------------------------------------------===// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reformat the title comment for neatness.
Value N = rewriter.create<memref::DimOp>(loc, B, 2); // b_col | ||
Value K = rewriter.create<memref::DimOp>(loc, B, 1); // b_row | ||
|
||
// build loop body |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure the comments are decent(i.e. capitalized and expressed formally). You should also double-check the other comments in this PR.
Thanks! @EllisLambda I have left some suggestions to make it better, and I think these are enough for this PR. More verifications in real scenarios(e.g. simultaneously process @zhanghb97 you may also see buddy-compiler/buddy-benchmark#73. |
@EllisLambda Why add batch_matmul pass to opt? I think it is inappropriate to add a certain op pass in opt. |
Nevertheless, it would be better if |
In fact, there is some truth to this, but I think this PR should explain what optimizations have been made? Where has the optimization been made? |
The optimization styles between Matmul and BatchMatmul are quite distinct. On different hardware, the performance of these two operations may differ. Separating might be better? |
Ok, I will add the explanation. |
Yes, we decide to develop Yumin @EllisLambda , about the "bug" you mention in your description, we can unify the support of floating point and integer with another interface. If not, starting a new discussion about integer operations would be more appropriate so that the relevant comments in your description could be removed, as this PR just needs to take floating point operations into account. |
Ok! I see. |
@xlinsist I have amended my PR. |
… optimization pass and tests.
Add "-batchmatmul-optimize" option in buddy-opt, the Matmul part is similar to MatMulBroadcast.mlir in buddy-benchmark. The batch-level loop utilizes affine.parallel. In the case of multiple batches, OpenMP can be utilized for multi-threaded acceleration. The A[batch_idx, 0, 0] which broadcasts in the Matmul will use
affine.prefetch
to prefetch in the outer loop. It could make the Op slightly faster. There's a bug,if (A_elementType.isIntOrFloat())
cannot work properly. Even if the input type is floating-point, the integer branch will still execute. So I modified toif (A_elementType.isIntOrFloat() && 0)
to skip the branch. So far the integer matrix is unavailable.