-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial resize scheduler #3556
Initial resize scheduler #3556
Conversation
5bde3d4
to
7e7db61
Compare
@@ -4096,64 +4108,85 @@ TEST_F(ResizeTest, PropagateSliceToInputs) { | |||
auto tv0 = makeConcreteTensor(shape); | |||
fusion.addInput(tv0); | |||
|
|||
auto tv1 = set(tv0); | |||
// Dont't use set here as it gets taken by the no-op scheduler | |||
auto tv1 = sin(tv0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes from set
to sin
or cos
are just to avoid the preseg transformation from kicking in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing changed with the tests here (except replacing set
with sin
and one disabled test) but just extended some of the existing tests to use the resize scheduler as well. Not all patterns are supported yet, so they just call GTEST_SKIP
for now.
csrc/scheduler/tools/domain_map.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just moved from pointwise_utils.h
csrc/scheduler/tools/domain_map.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just moved from pointwise_utils to domain_map
|
||
namespace nvfuser { | ||
|
||
bool ResizeScheduler::canScheduleCompileTime(Fusion* fusion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this initial version, I'm trying to make it very restrictive. Will have several follow-up PRs to schedule the whole RoPE module.
#include <scheduler/utils.h> | ||
|
||
namespace nvfuser { | ||
namespace pointwise_utils { | ||
|
||
// DomainMap uses the ComputeAtMap to find a reference TensorView |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is moved to scheduler/tools/domain_map.h
@@ -29,37 +29,6 @@ namespace { | |||
// Unused at the moment, commenting for clang tidy | |||
constexpr int64_t kThreadX = 128; | |||
|
|||
class DomainMap : public pointwise_utils::DomainMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is moved to pointwise_utils.h so that it can be also used from the resize scheduler
csrc/scheduler/pointwise_utils.h
Outdated
@@ -74,5 +30,44 @@ inline int64_t nRootDims(const TensorView* tv) { | |||
return tv_n_dims; | |||
} | |||
|
|||
class DomainMap : public scheduler_tools::DomainMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is moved from pointwise.cpp
@@ -432,19 +403,11 @@ std::unique_ptr<PointwiseParams> getPointwiseHeuristics( | |||
return params; | |||
} | |||
|
|||
// Return reference tensor view. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just moved to pointwise_utils
csrc/scheduler/pointwise_utils.h
Outdated
}; | ||
|
||
// Return reference tensor view. | ||
inline TensorView* getReferenceTensor(Fusion* fusion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from pointwise.cpp. Also shortened the name a bit (was getReferenceTensorView
)
!test |
tests/cpp/test_alias.cpp
Outdated
@@ -520,6 +520,9 @@ TEST_F(AliasTest, AliasOutputBeforeNonAliasOutput) { | |||
testValidate( | |||
executor_cache.fusion(), out_tensors, {in_tensor}, __LINE__, __FILE__); | |||
|
|||
// TODO: Fix the alias support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is broken for now. Need to understand how it actually works before this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -959,34 +962,6 @@ TEST_F(AliasTest, SourceIsBothInputAndOutput) { | |||
EXPECT_EQ(in_tensor.data_ptr(), out_tensors[1].data_ptr()); | |||
} | |||
|
|||
TEST_F(AliasTest, SegmentBoundary) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not relevant as this isn't segmented anymore
tests/cpp/test_gpu3.cpp
Outdated
const auto num_segments = kernel_runtime->fusionSegments()->groups().size(); | ||
NVF_CHECK(num_segments == 3, "Expect 3 segments, got: ", num_segments); | ||
for (const auto& exec : kernel_runtime->executors()) { | ||
EXPECT_EQ(num_segments, 2) << "Expect 2 segments, got: " << num_segments; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now just segmented to two kernels
if (!exec->isA<KernelExecutor>()) { | ||
continue; | ||
} | ||
if (kernel_runtime->schedulerHeuristics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gmem requirement isn't relevant for the resize scheduler
4ad2ff7
to
e8cb381
Compare
!test |
!test |
!test |
!test |
!test |
!test |
…bolicSizes) (#3578) Stacked on #3585 `StmtSort::getStmtsTo` may not grab all active iter domains if IDs are connected in an unconventional way. For example, we can set the loop domain of a tensor as a producer of its logical domain, but due to the nature of `IterVisitor`, such ID dependency patterns are not supported, meaning `StmtSort::getStmtsTo` would fail to grab all valid IDs and their exprs. I just recently noticed this issue while working on #3556, specifically the issue got exposed as an inconsistent replacement of extent vals. I've been experimenting such patterns of domains, but I hadn't seen this before, likely because I was using just static shape tensors for convenience. To fix the issue, I added a variation of `StmtSort::getStmtsTo`, which traverses a fusion as usual but stops at TensorView. For each TensorView, instead of using `IterVisitor`, it uses `TensorDomain::getAllStatements()`, which combines both `TensorDomain::allIDs()` and `TensorDomain::allExprs()`, and traverse the IDs and exprs in the returned order. It's a bit naive implementation, but I think this is good enough for now and also I don't have any other immediate idea to try. I changed `ValReplacementMutator` to use the new interface. That's the only use for now. --------- Co-authored-by: Jacob Hinkle <[email protected]>
!test |
All tests passed as of c264867. I decided to make the scheduler opt-in for now since it's unlikely it gives any benefit yet. The |
Followup to #3556. Currently, the resize scheduler is only allowed with a single slice or pad. This PR allows for fusing multiple ops as long as they don't conflict. Please see the [comment](https://github.com/NVIDIA/Fuser/pull/3611/files#diff-b066c49d399243d3be36a44f1221490b9a2f50e41074feab836bc9bb6ee71180R25-R100) for `getNonExclusiveResizeInfo`. In this PR, if there's a conflict, the fusion is simply rejected. A followup PR will address this limitation by replicating computations.
This is a very preliminary version of a new scheduler mainly targeted for RoPE. I will incrementally extend this scheduler to be more flexible and performant, but for now it only handles a fusion that has pointwise ops and a single
Resize
-based tensor op such asSliceOp
andPadOp
. The scheduling strategy is currently pretty naive too and is manually demonstrated at #3549 and #3555, but the main point is that inputs of resize-based tensor ops likeSliceOp
orPadOp
no longer need to have their inputs as fusion inputs.The new scheduler is currently placed after the reduction scheduler and before the transpose and pointwise schedulers:
https://github.com/NVIDIA/Fuser/pull/3556/files#diff-c0d261d44c61935fa2d5398f0ac52bd6ea077c6892fb5629c03a425a55fc32f2R64-R74
There are several small changes with some of the existing tests, mainly those on segmentation and alias support since this new scheduler may change how a fusion is segmented when resize is used. There's one thing I haven't addressed (#3556 (comment)), which I'm tracking with a separate issue.