Make queue-proxy NOT count the health check requests by the users #14581

houshengbo · 2023-11-01T15:46:58Z

Describe the feature

We leverage a platform managing all the knative services. All knative services report the health status via a separate URL/path, and the path is NOT consistent across different knative services.(Just confirmed internally with the team)

For example, the health check path of the knative service is determined by the model deployed in the knative service. Different models have different url for health check.

Each time, when the health check is called, the request is counted by the queue-proxy, which makes the metrics data NOT accurate, since all the health check requests are counted as well.

We are thinking of whether knative serving can distinguish the normal requests and the health check requests, so the health check requests do not count by the queue-proxy.

@skonto @dprotaso

skonto · 2023-11-02T11:36:08Z

Hi @houshengbo

Each time, when the health check is called, the request is counted by the queue-proxy, which makes the metrics data NOT accurate, since all the health check requests are counted as well.

Could you specify which metrics are not accurate, are these Knative metrics or external ones?
I see that we filter probes here and locally I didn't see these type of requests being counted.

Btw do you use KServe and https://github.com/kserve/kserve/blob/master/qpext/cmd/qpext/main.go?

houshengbo · 2024-01-22T20:40:50Z

@skonto We are using KServe to launch the inference services, which launch the knative services as the backend supporting services. The health check is from external of the inference services. The inference services usually have endpoint to call to run the health check. When there is a request coming to the inference service, no matter it is directly calling the inference service to serve or just checking the health. It will always be counted as the valid requests for that specific service, and the number is added into the metrics. In fact, it is more meaningful for us to count only the requests calling the inference service, not including the requests calling the health check endpoint.

I saw the filter here, is this a certain key for the header that we can use, so it will not be counted?

skonto · 2024-01-23T11:29:28Z

The health check is from external of the inference services.

Is that kubelet that checks the health of the isvc? The filter mentioned above checks for the K-Kubelet-Probe header (among others) and does not count the request for metrics reporting. Could you provide more details about how you use KServe and a reproducer for the issue? Could you show an example with KServe which I can test?
For example, do you use https://github.com/kserve/kserve/tree/master/qpext? At the opendatahub side we have been discussing probes lately for KServe but haven't noticed something related.

yuzisun · 2024-01-27T11:56:09Z

@skonto We use consul for service discovery and consul sends health checks to the KServe inference services which are currently counted as normal request by Knative. Are you suggesting to add the K-Kubelet-Probe header to consul health checks ?

skonto · 2024-01-29T09:28:49Z

Hi @yuzisun I see a couple of options as I am not familiar with the architecture you have:

Use consul capabilities if you are using their service mesh feature. I haven't tested that but it seems a similar case as with Istio where it rewrites probes, in any case it should unify probes with kubelet. It could be though that you are running Consul elsewhere where you just check remote services so you could define a script check that gets readiness status for the pods on K8s.
Try use K-Kubelet-Probe at the consul health check definition along with some other custom header (so you can distinguish the consul healthcheck req from the real kubelet ones) in order to workaround the current probe mechanism in Knative (needs testing).
Extend the probing mechanism for cases like this by allowing user defined/external probes which are handled from QP. Knative should not be affected by these requests and they should just call the K8s defined probe for the user container (this extension will not allow to define new probes for external health checkers though). For example we could extend the list here (also stated previously): https://github.com/knative/serving/blob/main/pkg/queue/sharedmain/handlers.go#L92 with user provided headers that trigger pod probing, however I am not sure if it is going to work in all cases such as with Istio mtls (Attempt to fix our probes with mTLS enabled. #4017).
cc @dprotaso, @ReToCode might offer more options.

github-actions · 2024-04-29T01:24:52Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

skonto · 2024-05-15T09:40:29Z

/remove-lifecycle stale

github-actions · 2024-08-14T01:28:00Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

skonto · 2024-08-28T10:49:02Z

/remove-lifecycle stale

github-actions · 2024-11-27T01:29:03Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

dprotaso · 2024-11-27T15:28:24Z

/lifecycle frozen

dprotaso · 2024-11-27T15:55:33Z

/triage accepted

houshengbo added the kind/feature Well-understood/specified features, ready for coding. label Nov 1, 2023

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2024

skonto mentioned this issue May 15, 2024

Need help: KSVCs are not scaling down to zero #15191

Closed

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 16, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 14, 2024

knative-prow bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 28, 2024

skonto mentioned this issue Sep 10, 2024

How to detect a permanent service failure? #15478

Closed

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 27, 2024

knative-prow bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 27, 2024

knative-prow bot added the triage/accepted Issues which should be fixed (post-triage) label Nov 27, 2024

skonto mentioned this issue Dec 6, 2024

knative queue proxy metric corrupted by different service monitor #15646

Closed

jan-stanek linked a pull request Dec 13, 2024 that will close this issue

Add support to ignore request in metrics #15653

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make queue-proxy NOT count the health check requests by the users #14581

Make queue-proxy NOT count the health check requests by the users #14581

houshengbo commented Nov 1, 2023 •

edited

Loading

skonto commented Nov 2, 2023

houshengbo commented Jan 22, 2024

skonto commented Jan 23, 2024 •

edited

Loading

yuzisun commented Jan 27, 2024

skonto commented Jan 29, 2024 •

edited

Loading

github-actions bot commented Apr 29, 2024

skonto commented May 15, 2024

github-actions bot commented Aug 14, 2024

skonto commented Aug 28, 2024

github-actions bot commented Nov 27, 2024

dprotaso commented Nov 27, 2024

dprotaso commented Nov 27, 2024

Make queue-proxy NOT count the health check requests by the users #14581

Make queue-proxy NOT count the health check requests by the users #14581

Comments

houshengbo commented Nov 1, 2023 • edited Loading

Describe the feature

skonto commented Nov 2, 2023

houshengbo commented Jan 22, 2024

skonto commented Jan 23, 2024 • edited Loading

yuzisun commented Jan 27, 2024

skonto commented Jan 29, 2024 • edited Loading

github-actions bot commented Apr 29, 2024

skonto commented May 15, 2024

github-actions bot commented Aug 14, 2024

skonto commented Aug 28, 2024

github-actions bot commented Nov 27, 2024

dprotaso commented Nov 27, 2024

dprotaso commented Nov 27, 2024

houshengbo commented Nov 1, 2023 •

edited

Loading

skonto commented Jan 23, 2024 •

edited

Loading

skonto commented Jan 29, 2024 •

edited

Loading