-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: dependency waiting blocks too aggressively #455
Comments
@mortent @haiyanmeng @seans3 I've tried to document the motivation behind these proposals we discussed, but haven't gone into a lot of detail about how they might work. Should we tackle all of them in top down order? |
So I don't think we should do "Continue on wait timeout" (where we essentially continue applying all resources after a timeout) as the default at all. It effectively breaks depends-on and apply-time mutations. The dependency filter seems like a good temporary fix if we feel async apply is complex. It should have correct behavior, although there are some drawbacks. The parallel apply is ok if we think it significantly improves performance. Otherwise I'm not convinced it is worth the effort. Long term I think the async apply is the best solution. It does address several issues with both behavior and performance. It probably needs a design doc to get into some more detail about how it should work. |
Morten is wrong, and we should "Continue on wait timeout". While it does not work for depends-on and apply-time mutations, these are ALREADY BROKEN. We already continue on error--just not for wait timeouts. This change would NOT make the current situation worse. But it will be strictly better than the current task execution stoppage for a wait timeout. And it will fix the long-standing bug of not running the final inventory task on wait timeout. The correct error handling for depends on and apply time mutation is something we should work for. But in the short term "continuing on wait timeout" is clearly better than the current state. I will be merging it shortly. As far as the other priorities, I believe the async apply, since it will the the longest and hardest should not be our first effort. |
So the PR in question doesn't make it the default, so I'm ok with that change. It has another issue around how we use the error event that I have pointed out on the PR. I don't disagree that some of the behavior is broken today, I just think we should do the more permanent (since we all agree the async is a bigger change) fix with the dependency filter. The current behavior has been that way for a while, so I think we can take the time to fix it properly. |
Per discussion elsewhere, we decided to change the behavior with #451 to always continue on wait timeouts (and other apply/destory errors, as was previous behavior). This is a change in behavior, but the previous behavior was considered broken (or at least internally inconsistent). Work on Dependency Filter and Parallel Apply should be unblocked. Work on them is mostly a question of priority. It sounds like there's disagreement on whether to do those incremental changes first or just skip to the Async Apply. |
Status update... Continue on wait timeout has already been implemented: We also want to add an optional Continue on Validation Error feature as part of the WIP validation redesign: #488 Dependency Filter is planned and will become more urgent once we have Continue on Validation Error. Parallel Apply is also planned (Q1 2022), but is primarily a speed concern, not a correctness concern. Async (graph) apply will also hopefully be eventually implemented, because it solves multiple problems, but it hasn't been scheduled yet, and may take a long time to get right. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Remaining fixes are still TODO /remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten FWIW, the dependency filter was added. Also, now that QPS has been disabled there’s not much value in adding parallel apply unless you want to add even more server load. The ideal solution will be to eventually rewrite the apply scheduler to use an asynchronous graph. Unfortunately, it’s not high on the priority list yet, because of other issues. But hopefully we’ll get around to it soon. |
Problems
The applier/destroyer pipelines add Apply/Prune tasks with Wait tasks, waiting for each apply/prune group to become Current (after apply) or NotFound (after prune). This behavior coupled with complex resource sets with multiple dependency branches has the following impact:
Example 1
For example, here's an infra dependency chain with branches (just deploying two GKEs with a CC cluster):
If any cluster fails to apply (ex: error in config), then both node pools are blocked.
Example 2
Another example is just using the same apply for multiple namespaces or CRDs and namespace:
If any CRD or any namespace fails to apply (ex: blocked by policy webhook or config error), then all the deployments and everything in the namespaces are blocked.
Possible solutions
Continue on wait timeout
Dependency filter
Parallel apply
Async (graph) apply
The text was updated successfully, but these errors were encountered: