Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial resize scheduler #3556

Merged
merged 38 commits into from
Dec 17, 2024
Merged

Initial resize scheduler #3556

merged 38 commits into from
Dec 17, 2024

Conversation

naoyam
Copy link
Collaborator

@naoyam naoyam commented Dec 10, 2024

This is a very preliminary version of a new scheduler mainly targeted for RoPE. I will incrementally extend this scheduler to be more flexible and performant, but for now it only handles a fusion that has pointwise ops and a single Resize-based tensor op such as SliceOp and PadOp. The scheduling strategy is currently pretty naive too and is manually demonstrated at #3549 and #3555, but the main point is that inputs of resize-based tensor ops like SliceOp or PadOp no longer need to have their inputs as fusion inputs.

The new scheduler is currently placed after the reduction scheduler and before the transpose and pointwise schedulers:

SchedulerType::ExprEval,
    SchedulerType::NoOp,
    SchedulerType::Matmul,
    SchedulerType::Reduction,
    SchedulerType::Resize, <-- New
    SchedulerType::Transpose,
    SchedulerType::PointWise,
    SchedulerType::InnerPersistent,
    SchedulerType::OuterPersistent,
    SchedulerType::InnerOuterPersistent};

https://github.com/NVIDIA/Fuser/pull/3556/files#diff-c0d261d44c61935fa2d5398f0ac52bd6ea077c6892fb5629c03a425a55fc32f2R64-R74

There are several small changes with some of the existing tests, mainly those on segmentation and alias support since this new scheduler may change how a fusion is segmented when resize is used. There's one thing I haven't addressed (#3556 (comment)), which I'm tracking with a separate issue.

@naoyam naoyam force-pushed the resize_scheduler_initial_version branch 2 times, most recently from 5bde3d4 to 7e7db61 Compare December 10, 2024 20:05
@@ -4096,64 +4108,85 @@ TEST_F(ResizeTest, PropagateSliceToInputs) {
auto tv0 = makeConcreteTensor(shape);
fusion.addInput(tv0);

auto tv1 = set(tv0);
// Dont't use set here as it gets taken by the no-op scheduler
auto tv1 = sin(tv0);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes from set to sin or cos are just to avoid the preseg transformation from kicking in.

Copy link
Collaborator Author

@naoyam naoyam Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing changed with the tests here (except replacing set with sin and one disabled test) but just extended some of the existing tests to use the resize scheduler as well. Not all patterns are supported yet, so they just call GTEST_SKIP for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just moved from pointwise_utils.h

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moved from pointwise_utils to domain_map


namespace nvfuser {

bool ResizeScheduler::canScheduleCompileTime(Fusion* fusion) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this initial version, I'm trying to make it very restrictive. Will have several follow-up PRs to schedule the whole RoPE module.

#include <scheduler/utils.h>

namespace nvfuser {
namespace pointwise_utils {

// DomainMap uses the ComputeAtMap to find a reference TensorView
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is moved to scheduler/tools/domain_map.h

@@ -29,37 +29,6 @@ namespace {
// Unused at the moment, commenting for clang tidy
constexpr int64_t kThreadX = 128;

class DomainMap : public pointwise_utils::DomainMap {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is moved to pointwise_utils.h so that it can be also used from the resize scheduler

@@ -74,5 +30,44 @@ inline int64_t nRootDims(const TensorView* tv) {
return tv_n_dims;
}

class DomainMap : public scheduler_tools::DomainMap {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved from pointwise.cpp

Base automatically changed from rotation_residual_support to main December 10, 2024 22:46
@@ -432,19 +403,11 @@ std::unique_ptr<PointwiseParams> getPointwiseHeuristics(
return params;
}

// Return reference tensor view.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moved to pointwise_utils

};

// Return reference tensor view.
inline TensorView* getReferenceTensor(Fusion* fusion) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from pointwise.cpp. Also shortened the name a bit (was getReferenceTensorView)

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 11, 2024

!test

@@ -520,6 +520,9 @@ TEST_F(AliasTest, AliasOutputBeforeNonAliasOutput) {
testValidate(
executor_cache.fusion(), out_tensors, {in_tensor}, __LINE__, __FILE__);

// TODO: Fix the alias support
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is broken for now. Need to understand how it actually works before this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -959,34 +962,6 @@ TEST_F(AliasTest, SourceIsBothInputAndOutput) {
EXPECT_EQ(in_tensor.data_ptr(), out_tensors[1].data_ptr());
}

TEST_F(AliasTest, SegmentBoundary) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not relevant as this isn't segmented anymore

const auto num_segments = kernel_runtime->fusionSegments()->groups().size();
NVF_CHECK(num_segments == 3, "Expect 3 segments, got: ", num_segments);
for (const auto& exec : kernel_runtime->executors()) {
EXPECT_EQ(num_segments, 2) << "Expect 2 segments, got: " << num_segments;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now just segmented to two kernels

if (!exec->isA<KernelExecutor>()) {
continue;
}
if (kernel_runtime->schedulerHeuristics()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gmem requirement isn't relevant for the resize scheduler

@naoyam naoyam force-pushed the resize_scheduler_initial_version branch from 4ad2ff7 to e8cb381 Compare December 11, 2024 09:22
@naoyam naoyam changed the base branch from main to enable_id_model_for_resize December 11, 2024 09:22
@naoyam
Copy link
Collaborator Author

naoyam commented Dec 13, 2024

!test

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 13, 2024

!test

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 15, 2024

!test

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 15, 2024

!test

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 15, 2024

!test

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 16, 2024

!test

jacobhinkle added a commit that referenced this pull request Dec 16, 2024
…bolicSizes) (#3578)

Stacked on #3585 

`StmtSort::getStmtsTo` may not grab all active iter domains if IDs are
connected in an unconventional way. For example, we can set the loop
domain of a tensor as a producer of its logical domain, but due to the
nature of `IterVisitor`, such ID dependency patterns are not supported,
meaning `StmtSort::getStmtsTo` would fail to grab all valid IDs and
their exprs.

I just recently noticed this issue while working on #3556, specifically
the issue got exposed as an inconsistent replacement of extent vals.
I've been experimenting such patterns of domains, but I hadn't seen this
before, likely because I was using just static shape tensors for
convenience.

To fix the issue, I added a variation of `StmtSort::getStmtsTo`, which
traverses a fusion as usual but stops at TensorView. For each
TensorView, instead of using `IterVisitor`, it uses
`TensorDomain::getAllStatements()`, which combines both
`TensorDomain::allIDs()` and `TensorDomain::allExprs()`, and traverse
the IDs and exprs in the returned order.

It's a bit naive implementation, but I think this is good enough for now
and also I don't have any other immediate idea to try.

I changed `ValReplacementMutator` to use the new interface. That's the
only use for now.

---------

Co-authored-by: Jacob Hinkle <[email protected]>
naoyam added a commit that referenced this pull request Dec 17, 2024
Followup to #3514. Use `compareDomainWithReference` from the
TensorDomain constructors too.

This change is required for #3556.
@naoyam
Copy link
Collaborator Author

naoyam commented Dec 17, 2024

!test

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 17, 2024

All tests passed as of c264867. I decided to make the scheduler opt-in for now since it's unlikely it gives any benefit yet. The NVFUSER_ENABLE option of resize_scheduler can be used to enable the scheduler.

@naoyam naoyam merged commit a880557 into main Dec 17, 2024
48 checks passed
@naoyam naoyam deleted the resize_scheduler_initial_version branch December 17, 2024 09:27
naoyam added a commit that referenced this pull request Dec 20, 2024
Followup to #3556. Currently, the resize scheduler is only allowed with
a single slice or pad. This PR allows for fusing multiple ops as long as
they don't conflict. Please see the
[comment](https://github.com/NVIDIA/Fuser/pull/3611/files#diff-b066c49d399243d3be36a44f1221490b9a2f50e41074feab836bc9bb6ee71180R25-R100)
for `getNonExclusiveResizeInfo`.

In this PR, if there's a conflict, the fusion is simply rejected. A
followup PR will address this limitation by replicating computations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants