Ensure ContainerHealthy condition is set back to True #15503

SaschaSchwarze0 · 2024-09-09T13:56:22Z

Proposed Changes

This changes the Revision reconciler to contain a code path that changes the ContainerHealthy condition from False to True as the old code path is not active anymore (see linked issue). The criteria that has been chosen is whether the deployment has replicas and whether all of them are ready.

Release Note

A revision is now set to ContainerHealthy=True when all replicas of a deployment are ready

knative-prow · 2024-09-09T13:56:32Z

Hi @SaschaSchwarze0. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

skonto · 2024-09-09T13:58:01Z

/ok-to-test

skonto · 2024-09-09T13:58:27Z

cc @dprotaso for review.

codecov · 2024-09-09T14:01:51Z

Codecov Report

Attention: Patch coverage is 65.30612% with 17 lines in your changes missing coverage. Please review.

Project coverage is 83.48%. Comparing base (5717d19) to head (3575484).

Files with missing lines	Patch %	Lines
pkg/reconciler/revision/reconcile_resources.go	56.00%	6 Missing and 5 partials ⚠️
pkg/testing/functional.go	62.50%	5 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #15503      +/-   ##
==========================================
- Coverage   83.53%   83.48%   -0.06%     
==========================================
  Files         219      219              
  Lines       17427    17456      +29     
==========================================
+ Hits        14558    14573      +15     
- Misses       2498     2507       +9     
- Partials      371      376       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

skonto · 2024-10-03T11:28:00Z

@dprotaso gentle ping, any objections on this one? It seems ok to me.

skonto · 2024-10-06T16:23:02Z

pkg/reconciler/revision/reconcile_resources.go

@@ -117,6 +117,10 @@ func (c *Reconciler) reconcileDeployment(ctx context.Context, rev *v1.Revision)
 		}
 	}

+	if *deployment.Spec.Replicas > 0 && *deployment.Spec.Replicas == deployment.Status.ReadyReplicas {


@SaschaSchwarze0 hi, should we relax the condition?

We set the revision as containerUnhealthy when a "permanent" like failure is detected:

// If a container keeps crashing (no active pods in the deployment although we want some) if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 {

Resetting this should happen when we have at least one pod up (deployment.Status.AvailableReplicas>0) no ?
I am thinking of a bursty load scenario where not all pods become ready (eg. some new stay in pending state and some old recover from the old issue and become ready). However, then we keep the revision in false ready status if we don't reach the desired number of pods as set in the deployment replicas field.
Could this be true if we have a bursty load where deployment is set to replicas> currentScale>= minsCale >= 1 directly due to some autoscaling decision? 🤔

Hi @skonto, sorry for the late response. I was out for some time.

Can you please help me and clarify what you're after. The code change I am making is to set container healthy to true. You now come up with a discussion and a related condition about when container healthy should be set to false?

Hi @SaschaSchwarze0 I am saying that a revision is serving traffic even when pods are not all ready right? So resetting this to true only when *deployment.Spec.Replicas == deployment.Status.ReadyReplicas seems a bit too strict no?

I see.

Correct, a revision may be serving traffic even if it does have unhealthy containers in some of the replicas.

Whether that means that it is okay to set ContainerHealthy to True if already one of the replicas is running fine, I do not know.

My proposed condition basically was when I think that ContainerHealthy should definitely be set to true (because all replicas are fully ready).

How strict or relaxed we can be, I do not know. At some point I tried to find a spec on the exact meaning of the conditions of the different resources. But I had not found something.

What your condition can btw lead to is the following:

You have a revision with two replicas, one is not healthy, the other one is.

The code in https://github.com/knative/serving/blob/v0.42.2/pkg/reconciler/revision/reconcile_resources.go#L86 checks the first pod only. If that one is not healthy, it will set ContainerHealthy=False.

I think, the code place here runs later. That one will notice that there are two replicas and one is ready, and set ContainerHealthy to True.

Would not happen with my proposed code because if one pod is not ready, there are not all replicas of the deployment ready. Anyway, just checking the first pod to determine whether ContainerHealthy is set to False is another questionable piece of code because it leads to different results when one has different pod statuses depending on the order in which the pods are returned.

I agree in general. Btw I am also trying to figure out the semantics. Maybe we should have at least one ready to reset to true (assuming we list all pods and we have containerHealthy=false). 🤔

Anyway, just checking the first pod to determine whether ContainerHealthy is set to False is another questionable piece of code because it leads to different results when one has different pod statuses depending on the order in which the pods are returned.

This should happen only when all pods fail with probably the same reason, that is my understanding for the intention thus the check for available replicas=0.

if *deployment.Spec.Replicas > 0 && deployment.Status.AvailableReplicas == 0 {

@dprotaso any background info to add for the limitations here?

The semantics here are terrible, because distributed systems. Sorry!

What users want:

When my application is able to handle requests, Ready should be true

If Ready is false, there's a clear condition as to why

If a container crashes on startup, users would like to be about to see the exit message

The real world laughs at this:

Containers may pass an initial health check and then have a subsequent failure in application code (for example, if a backend dependency becomes unavailable)

Kubernetes mostly assumes that nodes are interchangeable, but what if pods on one node have different behavior than pods on another node (e.g. network partition)

There are a bunch of pre-container-executing failures that can happen, like failure to schedule, pull images, mount volumes, etc. I'm not sure that's relevant here, but worth considering in the big picture

The end result is that "does a container become ready" is not really deterministic (and we don't have a good diagnostic on whether failures are permanent or transient), so all we have here are heuristics.

My gut would be relax the check to closer to what Stavros is suggesting: container ready can become true if there is at least one ready container. While it's trivial to build counter-example containers for this rule, I think it's probably close enough for most users.

Sorry if all that is unsatisfying...

I think @skonto is suggesting:

Suggested change

if *deployment.Spec.Replicas > 0 && *deployment.Spec.Replicas == deployment.Status.ReadyReplicas {

if deployment.Status.ReadyReplicas >= 1 {

Is that right?

Thanks for the good discussion.

I think from the comments I conclude the following:

There is currently no definition on the ContainerHealthy condition of a Revision.

We acknowledge that there may be temporary glitches that we do not necessary want ContainerHealthy to turn False.

To address this, based on what I personally think and what I read as opinions from others, would we agree on the following?

If there is no Pod, we do not change the current status.

If any Pod is ready, we set it to True.

If no Pod is ready and if there is at least one non-terminating Pod and if all non-terminating Pods have a container with LastTerminationState.Terminated != nil, then we set it to False.

If neither (2) nor (3) apply, we also do not change the current status.

The code for (2) would be simple because it can be evaluated simply by looking at the deployment status:

if deployment.Status.AvailableReplicas > 0 { set ContainerHealthy=true }

The code for (3) is more complex as it requires a look at the Pods. I guess we would need to make https://github.com/knative/serving/blob/v0.42.2/pkg/reconciler/revision/reconcile_resources.go#L84-L117 more complex (= not just look at the first pod, though it can break out immediatly if a pod is ready). (2) can probably also be solved here.

This should be safe for scale up from 0 to 1. In that case, we will have one Pod, none are ready, but none have containers with restarts. We would not change the current status until this one Pod crashes.

ScaleDown from 1 to 0 also works. The one Pod may have a restart that can be long ago (OOM or similar), it will not be Ready while it terminates, but because it is terminating, that is ignored.

Wdyt?

For reference, more discussion in https://cloud-native.slack.com/archives/C04LMU0AX60/p1733333864902479

I think @skonto is suggesting:

Is that right?

Yes. We tag a revision with containerHealthy=false because all pods are down for a specific condition. After that the only thing that can happen is this condition to be reverted to unblock the revision. If the condition is reverted for one pod it implies that it has been reverted for all pods independently of when those pods will become ready again. Afaik that was the initial design. So we just need a signal for the blocking condition not holding any more. Serving should handle traffic as usual depending on the number of pods that are ready at any given time (my assumption).

There can be the case that some pods are not scheduled yet and we do cover that. If there are no pods (some failure has not created the pods) we do cover that I think since we propagate the deployment status a few lines above (replicaset failures etc).

knative-prow · 2024-12-05T10:54:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SaschaSchwarze0
Once this PR has been reviewed and has the lgtm label, please ask for approval from dprotaso. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Co-authored-by: Matthias Diester <[email protected]>

SaschaSchwarze0 · 2024-12-05T11:05:02Z

I updated the PR based on the discussion in https://cloud-native.slack.com/archives/C04LMU0AX60/p1733333864902479.

skonto · 2024-12-05T12:12:06Z

pkg/reconciler/revision/reconcile_resources.go

@@ -94,29 +93,47 @@ func (c *Reconciler) reconcileDeployment(ctx context.Context, rev *v1.Revision)
 				}
 			}

+			// if a Pod is terminating already, we do not care it being not ready because this is expected to be the case
+			if pod.DeletionTimestamp != nil {


We want to report a potential error though during the exit no?

I am actually not anymore sure if this is needed. What I had in mind was the following:

a Pod is created and runs fine

at some point the user-container goes out of memory and is restarted, from now on it has LastTerminationState.Terminated set.

And now it runs fine, an hour, a day maybe.

And then it gets scaled down.

The loop runs.

The Pod will not be ready so it will check the container status. Idea was to prevent reporting a failure that happened long time ago.

Though, the only relevant case would be something like a scaleDown from 2 to 1 of the ready Pod that had this OOM long time ago which the other Pod must be down so that the deployment has zero available replicas. Is very unlikely to happen and definitely not a homogenous state of the workload across pods.

skonto · 2024-12-05T12:14:17Z

pkg/reconciler/revision/reconcile_resources.go

+				continue
+			}
+
+			// if a Pod is ready, then do not check it. The fact it is ready may not yet be reflected on the Deployment status.


Previous code covered that already as if a pod is ready if t := status.LastTerminationState.Terminated; would never be true for a container, no?
Not sure how often this inconsistency will happen (deployment status vs pod status) tbh to be sure about the benefits of the optimization but does not hurt either I suppose.

The scenario I have in mind is a following:

a Pod is created and runs fine

at some point the user-container goes out of memory and is restarted, from now on it has LastTerminationState.Terminated set.

The container finshes restarting and becomes ready. Though, this may not yet be reflected in the deployment status due to the asynchronous nature of the controller.

now this loop runs

The Pod will be Ready but if we never check this we would still iterate the container statuses. And there status.LastTerminationState.Terminated is not nil.

I could probably still get rid of it, if I change the loop to:

// Iterate the container status and report the first failure for _, status := range pod.Status.ContainerStatuses { if status.Name == resources.QueueContainerName || status.Ready { continue } ... }

Note the added || status.Ready at the beginning. This might actually be nicer because we then generally do not report a potentially days-ago failure on a ready container (which would matter if on a single pod multiple containers have restarts). Wdyt?

I see. Thinking out loud.
If a pod is ready all containers are running. Let's say we are keep crashing/restarting at some point. The question is then if we ever are distinguishing between old and recent failures.

I think if we wanted to detect this reliably without reporting old failures, we would need a pod watcher, check this that explains it. This way we would know if we have old issues or that if some container is keep restarting but requires to track restart counts over sometime (keep state).

An implicit way to detect the same I suspect is to check if we are in waiting state when status.LastTerminationState.Terminated != nil, check status.State.Waiting.
Before a container is restarted there is a backoff time and container is in waiting stay with reason = "CrashLoopBackOff". If this holds then we are in a bad state.

Checking if a pod is ready (at the top) is also a way to avoid old terminations since if we are not ready something is wrong probably
and the LastTerminationState is recent (meaning it is related). So thinking again this your approach seems fine to me.

The question for any approach is if reconciliation will hit the time period where pods are ready only (imagine a crashloop scenario which lasts for some time). We have a test for this: TestContainerExitingMsg but we don't test the scenario where the container is not exiting permanently but randomly. 🤔
Btw there is a KEP to make crashloopbackoff tunable kubernetes/kubernetes#57291 (comment) which may affect stuff long term.

skonto · 2024-12-05T12:22:53Z

pkg/reconciler/revision/reconcile_resources.go


+	podsLoop:
+		for _, pod := range pods.Items {


Are we sure that we will not be triggered enough times to go through a reconciliation phase where actually all pods are in the same status so choosing an arbitrary pod will not hurt us? And that would signal a more permanent issue?
Afaik a revision reconciliation will be triggered due to the updates of any of the following: "Certificate, PA, K8s Deployment". I expect multiple reconciliations that will capture eventually a permanent, blocking issue. 🤔 I guess we are shifting the semantics to capture less permanent errors (one failing pod will be enough to report anything)? Do we still want to go for:

// Arbitrarily grab the very first pod, as they all should be crashing

and make sure all crash (and only then report an issue)? 🤔 Or is it that with this PR we capture the error a bit earlier, due to some container exiting, and eventually they will be all crashing anyway?

I am also worried (maybe not an issue?) about relying on the fact that on one hand we use an informer based lister for the deployments and direct api server call for pods using the go client, see for example. 🤔

In this scenario, I would probably be fine to not try to be as accurate as possible because it is just about the revision status.

What I tried is to properly handle the eventual consistency that Kubernetes has due to its persistence and asynchronous nature of informers and controllers. Reducing those considerations to the minimum to prevent crashes (the old code knew from the deployment spec that there should be more than zero replicas but still ensured that there is an item in the pod item list before accessing the first one) would also be okay for me here.

I am also worried (maybe not an issue?) about relying on the fact that on one hand we use an informer based lister for the deployments and direct api server call for pods using the go client, kubernetes/client-go#1383 (comment). 🤔

Would be independent from my changes, but still interesting as I had not considered that an issue. I am not too firm into how Knative sets up the Kubernetes client. In case of controller-runtime, any object access will immediatly be cached unless you configured controller-runtime to not cache that object type.

If that pod list call really goes to Kubernetes API and if we go back to just looking at the first item, we could at least run the list call with limit=1.

The call is:

pods, err := c.kubeclient.CoreV1().Pods(ns).List(ctx, metav1.ListOptions{LabelSelector: metav1.FormatLabelSelector(deployment.Spec.Selector)})

This is not informer based afaik.

In general, Knative projects do **NOT** use controller-runtime lib to create client or use it access the Kubernetes object.

Knative uses:

Use kubeclient to access the kube core or app objects, if we know the info of groupkindversion.

Use dynamicClient to access the kube objects, if we do not know the info of groupkindversion, but need to figure out dynamically in the code. For example, https://github.com/knative/serving/blob/main/pkg/reconciler/autoscaling/kpa/scaler.go#L318

Use the generated client, to access the CR, customized/created for knative. For example, to get the route, we use https://github.com/knative/serving/blob/main/pkg/reconciler/service/service.go#L248

yuzisun · 2024-12-30T20:01:16Z

@SaschaSchwarze0 Are you able to address the comments and wrap up the PR? We are looking forward with this fix.

evankanderson · 2025-01-03T22:04:57Z

@SaschaSchwarze0 -- would you like some help getting this over the finish line? I can PR or commit to your branch if you won't have time to complete it; this is apparently also causing pain for @houshengbo and @yuzisun.

knative-prow bot requested review from izabelacg and ReToCode September 9, 2024 13:56

knative-prow bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 9, 2024

knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 9, 2024

SaschaSchwarze0 mentioned this pull request Sep 9, 2024

Revisions stay in ContainerHealthy=False status forever #15487

Open

SaschaSchwarze0 force-pushed the sascha-set-container-healthy-true branch from 94af51c to 1e89b8d Compare September 9, 2024 14:19

ReToCode assigned dprotaso Sep 17, 2024

ReToCode requested review from dprotaso and removed request for izabelacg and ReToCode September 17, 2024 11:37

skonto reviewed Oct 6, 2024

View reviewed changes

skonto mentioned this pull request Oct 7, 2024

Revision stays in ContainerMissing condition forever after a temporary failure of digest resolution #15466

Open

skonto added this to the v1.16.0 milestone Oct 7, 2024

skonto mentioned this pull request Oct 7, 2024

Fix deployment status propagation when scaling from zero #15550

Open

SaschaSchwarze0 force-pushed the sascha-set-container-healthy-true branch from 1e89b8d to f2356b7 Compare December 5, 2024 10:53

knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 5, 2024

SaschaSchwarze0 force-pushed the sascha-set-container-healthy-true branch 3 times, most recently from e0cc076 to fb7e958 Compare December 5, 2024 10:58

Ensure ContainerHealthy condition is set back to True

3575484

Co-authored-by: Matthias Diester <[email protected]>

SaschaSchwarze0 force-pushed the sascha-set-container-healthy-true branch from fb7e958 to 3575484 Compare December 5, 2024 11:00

skonto reviewed Dec 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure ContainerHealthy condition is set back to True #15503

Ensure ContainerHealthy condition is set back to True #15503

SaschaSchwarze0 commented Sep 9, 2024

knative-prow bot commented Sep 9, 2024

skonto commented Sep 9, 2024

skonto commented Sep 9, 2024

codecov bot commented Sep 9, 2024 •

edited

Loading

skonto commented Oct 3, 2024

skonto Oct 6, 2024 •

edited

Loading

SaschaSchwarze0 Oct 17, 2024

skonto Oct 24, 2024

SaschaSchwarze0 Oct 24, 2024

skonto Oct 24, 2024 •

edited

Loading

evankanderson Dec 4, 2024

evankanderson Dec 4, 2024

SaschaSchwarze0 Dec 4, 2024

SaschaSchwarze0 Dec 5, 2024

skonto Dec 5, 2024 •

edited

Loading

knative-prow bot commented Dec 5, 2024

SaschaSchwarze0 commented Dec 5, 2024

skonto Dec 5, 2024 •

edited

Loading

SaschaSchwarze0 Dec 5, 2024

skonto Dec 5, 2024 •

edited

Loading

SaschaSchwarze0 Dec 5, 2024

skonto Dec 6, 2024 •

edited

Loading

skonto Dec 5, 2024 •

edited

Loading

skonto Dec 5, 2024 •

edited

Loading

SaschaSchwarze0 Dec 5, 2024

SaschaSchwarze0 Dec 5, 2024

skonto Dec 5, 2024 •

edited

Loading

houshengbo Dec 9, 2024 •

edited

Loading

yuzisun commented Dec 30, 2024

evankanderson commented Jan 3, 2025

	if deployment.Spec.Replicas > 0 && deployment.Spec.Replicas == deployment.Status.ReadyReplicas {
	if deployment.Status.ReadyReplicas >= 1 {

Ensure ContainerHealthy condition is set back to True #15503

Are you sure you want to change the base?

Ensure ContainerHealthy condition is set back to True #15503

Conversation

SaschaSchwarze0 commented Sep 9, 2024

Proposed Changes

knative-prow bot commented Sep 9, 2024

skonto commented Sep 9, 2024

skonto commented Sep 9, 2024

codecov bot commented Sep 9, 2024 • edited Loading

Codecov Report

skonto commented Oct 3, 2024

skonto Oct 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

knative-prow bot commented Dec 5, 2024

SaschaSchwarze0 commented Dec 5, 2024

skonto Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

skonto Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

skonto Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skonto Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

houshengbo Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

yuzisun commented Dec 30, 2024

evankanderson commented Jan 3, 2025

codecov bot commented Sep 9, 2024 •

edited

Loading

skonto Oct 6, 2024 •

edited

Loading

skonto Oct 24, 2024 •

edited

Loading

skonto Dec 5, 2024 •

edited

Loading

skonto Dec 5, 2024 •

edited

Loading

skonto Dec 5, 2024 •

edited

Loading

skonto Dec 6, 2024 •

edited

Loading

skonto Dec 5, 2024 •

edited

Loading

skonto Dec 5, 2024 •

edited

Loading

skonto Dec 5, 2024 •

edited

Loading

houshengbo Dec 9, 2024 •

edited

Loading