Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: dependency waiting blocks too aggressively #455

Open
karlkfi opened this issue Oct 28, 2021 · 11 comments
Open

Bug: dependency waiting blocks too aggressively #455

karlkfi opened this issue Oct 28, 2021 · 11 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@karlkfi
Copy link
Contributor

karlkfi commented Oct 28, 2021

Problems

The applier/destroyer pipelines add Apply/Prune tasks with Wait tasks, waiting for each apply/prune group to become Current (after apply) or NotFound (after prune). This behavior coupled with complex resource sets with multiple dependency branches has the following impact:

  1. apply/prune can be blocked for some object even if their dependencies are ready
  2. blocking waits for the slowest reconciliation in the previous apply phase, even if it doesn't time out
  3. because wait timeout currently causes the pipeline to terminate, any object that cause wait timeout blocks all objects in subsequent phases from being applied/pruned

Example 1

For example, here's an infra dependency chain with branches (just deploying two GKEs with a CC cluster):

  • namespace
    • rbac
    • GKE cluster1
      • node pool1 (depends-on cluster1)
    • GKE cluster2
      • node pool2 (depends-on cluster2)

If any cluster fails to apply (ex: error in config), then both node pools are blocked.

Example 2

Another example is just using the same apply for multiple namespaces or CRDs and namespace:

  • CRD
  • namespace1
    • deployment1
  • namespace2
    • deployment2

If any CRD or any namespace fails to apply (ex: blocked by policy webhook or config error), then all the deployments and everything in the namespaces are blocked.

Possible solutions

Continue on wait timeout

  • This helps with problem 1 and 3, but not problem 2, and has the consequence of making failure take longer, because all resources will be applied, even if we know their dependency isn't ready.

Dependency filter

  • This would help remediate the side effect of "continue on wait timeout" by skipping apply/prune for objects with unreconciled dependencies
  • This would need to be used on both apply and prune
  • This would need to share logic with the graph sorter, which currently handles identifying dependencies (depends-on, apply-time-mutation, CRDs, and namespaces)

Parallel apply

  • Applying objects in parallel during each Apply Task (and deleting in parallel for prune tasks) would speed up e2e apply time significantly, helping to mitigate dependency cost, but not actually solving either problem 1 or problem 2

Async (graph) apply

  • Building a branching pipeline, instead of flattening into a single synchronous pipeline would help isolate dependencies to only block things that depend on them.
  • This is probably the best solution, but also the most complex and risky.
  • This would probably require changing how events are tested/used. Consuming them in a single for loop might not be the best strategy any more; async listeners might be better.
  • This would make both Dependency filter and Parallel apply obsolete.
  • This might also make Continue on wait timeout obsolete, because there might be more than one task branch executing and one could terminate even if the others continue.
  • This is the only proposal that solve problem 2, which is going to become a bigger problem as soon as people start using depends-on and apply-time-mutation on infra resources (like GKE clusters) that take a long time to reconcile.
  • This solution might make it easier to add support for advanced features like blocking on Job completion or lifecycle directives (ex: delete before apply)
@karlkfi
Copy link
Contributor Author

karlkfi commented Oct 28, 2021

@mortent @haiyanmeng @seans3 I've tried to document the motivation behind these proposals we discussed, but haven't gone into a lot of detail about how they might work.

Should we tackle all of them in top down order?
Or should we skip straight to the hardest one which makes the others obsolete?
Got other ideas that would resolve the same issues?

@mortent
Copy link
Member

mortent commented Oct 29, 2021

So I don't think we should do "Continue on wait timeout" (where we essentially continue applying all resources after a timeout) as the default at all. It effectively breaks depends-on and apply-time mutations.

The dependency filter seems like a good temporary fix if we feel async apply is complex. It should have correct behavior, although there are some drawbacks.

The parallel apply is ok if we think it significantly improves performance. Otherwise I'm not convinced it is worth the effort.

Long term I think the async apply is the best solution. It does address several issues with both behavior and performance. It probably needs a design doc to get into some more detail about how it should work.

@seans3
Copy link
Contributor

seans3 commented Oct 29, 2021

Morten is wrong, and we should "Continue on wait timeout". While it does not work for depends-on and apply-time mutations, these are ALREADY BROKEN. We already continue on error--just not for wait timeouts. This change would NOT make the current situation worse. But it will be strictly better than the current task execution stoppage for a wait timeout. And it will fix the long-standing bug of not running the final inventory task on wait timeout.

The correct error handling for depends on and apply time mutation is something we should work for. But in the short term "continuing on wait timeout" is clearly better than the current state. I will be merging it shortly.

As far as the other priorities, I believe the async apply, since it will the the longest and hardest should not be our first effort.

@mortent
Copy link
Member

mortent commented Oct 29, 2021

So the PR in question doesn't make it the default, so I'm ok with that change. It has another issue around how we use the error event that I have pointed out on the PR.

I don't disagree that some of the behavior is broken today, I just think we should do the more permanent (since we all agree the async is a bigger change) fix with the dependency filter. The current behavior has been that way for a while, so I think we can take the time to fix it properly.

@karlkfi
Copy link
Contributor Author

karlkfi commented Nov 1, 2021

Per discussion elsewhere, we decided to change the behavior with #451 to always continue on wait timeouts (and other apply/destory errors, as was previous behavior).

This is a change in behavior, but the previous behavior was considered broken (or at least internally inconsistent).

Work on Dependency Filter and Parallel Apply should be unblocked. Work on them is mostly a question of priority. It sounds like there's disagreement on whether to do those incremental changes first or just skip to the Async Apply.

@karlkfi
Copy link
Contributor Author

karlkfi commented Dec 8, 2021

Status update...

Continue on wait timeout has already been implemented:

We also want to add an optional Continue on Validation Error feature as part of the WIP validation redesign: #488

Dependency Filter is planned and will become more urgent once we have Continue on Validation Error.

Parallel Apply is also planned (Q1 2022), but is primarily a speed concern, not a correctness concern.

Async (graph) apply will also hopefully be eventually implemented, because it solves multiple problems, but it hasn't been scheduled yet, and may take a long time to get right.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 8, 2022
@karlkfi
Copy link
Contributor Author

karlkfi commented Mar 25, 2022

Remaining fixes are still TODO

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 25, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 23, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 23, 2022
@karlkfi
Copy link
Contributor Author

karlkfi commented Jul 24, 2022

/remove-lifecycle rotten
/lifecycle frozen

FWIW, the dependency filter was added. Also, now that QPS has been disabled there’s not much value in adding parallel apply unless you want to add even more server load. The ideal solution will be to eventually rewrite the apply scheduler to use an asynchronous graph.

Unfortunately, it’s not high on the priority list yet, because of other issues. But hopefully we’ll get around to it soon.

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jul 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

5 participants