Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator does not cleanly allow node drains to complete #1701

Open
braunsonm opened this issue Feb 12, 2024 · 11 comments
Open

Operator does not cleanly allow node drains to complete #1701

braunsonm opened this issue Feb 12, 2024 · 11 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@braunsonm
Copy link

braunsonm commented Feb 12, 2024

Describe the bug
On AWS EKS, nodes are set to SchedulingDisabled and pods are evicted in batches (not cordoned). With knative serving deployed using the operator, some workloads will never drain when HA is set to 3.

Expected behavior
The Knative Operator should allow these components to drain without user interaction.

To Reproduce

  1. Deploy Knative Serving on EKS with HA set to 3 using the Knative Operator
  2. Upgrade the version of Kubernetes by updating the AMI template. This will trigger AWS to do a rolling upgrade of the nodes
  3. Notice that knative-serving components cause the operation to hand indefinitely until a human forcibly kills the pod in question.

Knative release version
1.13.0

Additional context
I have enough nodes that the PDB shouldn't be violated.

@braunsonm braunsonm added the kind/bug Categorizes issue or PR as related to a bug. label Feb 12, 2024
@braunsonm
Copy link
Author

I found the problem. When HA is set to 3, the operator creates a PDB where minAvailable is set to 80%. This will never allow any of those pods to be evicted since 1 unavailable would be 66%.

@houshengbo
Copy link
Contributor

@braunsonm Thanks for reporting the issue. Do you have any suggestion on how operator can change or improve to avoid this issue?

@braunsonm
Copy link
Author

@houshengbo I think the operator should set maxUnavailable to 1 as a sensible default (roll each of these critical components one at a time). And continue to allow the user to override that.

@houshengbo
Copy link
Contributor

Is maxUnavailable for knative serving or knative eventing? I would rather say this configuration should be for them, right? Operator does not by default configure them, instead, it read the manifests for them and use the default values from serving or eventing. You can use operator CRs to configure PodDisruptionBudget.

@braunsonm
Copy link
Author

Yes and no. Max unavailable would be set for serving I think. But if you're configuring HA it would make sense that the operator creates a PDB so that HA is actually guaranteed. Otherwise you could still have an outage if the pods are evicted at the same time.

I agree allowing overrides though like you currently do.

Copy link

github-actions bot commented Jun 7, 2024

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 7, 2024
@9numbernine9
Copy link

/remove-lifecycle stale

@knative-prow knative-prow bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 7, 2024
Copy link

github-actions bot commented Sep 6, 2024

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 6, 2024
@braunsonm
Copy link
Author

Still a problem

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2024
Copy link

github-actions bot commented Dec 6, 2024

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 6, 2024
@braunsonm
Copy link
Author

/remove-lifecycle stale

@knative-prow knative-prow bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants