Reducing hcloud API calls for hcloudmachines that are up and running #1336

janiskemper · 2024-06-11T17:45:24Z

/kind proposal

Sometimes we hit the rate-limit because the caph controller does too many calls to the hcloud API.

Checks we do for a running hcloudmachine

validate that labels are still correctly set and let the machine fail if not
server status is off and gets switched on
server doesn't exist anymore, so it gets created again (-> this is actually even a problem - we don't want to re-create a server if the first one was deleted while having the same machine object)
server is not attached to network anymore and gets attached again
server (control plane) is not a target of the load balancer anymore and gets added again

The following points are based on an action of the user. The user removes a label of the server in the HCloud UI, and we cannot validate it. The user deletes a server manually and we realize that. The user removes the server from the load balancer or the network and we add it again.

These are potentially valid use cases - the question is whether they are so relevant that we need to keep them.

Possible ways of handling these checks

Right now we do the following:
One extreme (current one): Do all API calls to check everything in every reconcile loop.

The other extreme: Stop doing any API calls once the server is up and running. If something is wrong with the server, the Machine Health Checks should discover that. We don't do anything if the user actively misconfigures something and for example removes a server from the load balancer.

Middle way 1: To specific checks and stop doing all others
We could stop checking that the server is part of the network and continue checking that it is added as target to the load balancer. For example. Any combination of things that are important to us is possible.

Middle way 2: Heavily cache API calls once a server is running
We could also use a cache to not call the API regularly. If something goes wrong, we would realize it later, but eventually we would.

Any thoughts?

I'm curious to hear opinions, also from people outside of Syself! The overall goal is to reduce the number of API calls that can be rather high. Hundreds of calls per hour for a stable (not scaling) cluster is normal.

A similar question could be asked also for the general load balancer, placement group and network configuration, which we reconcile in the hetznercluster-controller. I'm also looking forward to opinions there!

apricote · 2024-06-12T08:18:02Z

I do think all of the above requests should be fine. My question would be how often you are triggering reconciles of the HCloudMachines controller and if that can be optimized.

I described my previous solution to investigate this here: #926 (comment)

janiskemper · 2024-06-12T08:45:45Z

I think that we can probably slightly improve the work that you have started already @apricote .

Why I have written down my thoughts here is more that a large use case is more likely to run into a rate limit than a smaller one.

This is in general just a question of optimizing rather than fixing a certain bug or issue.

We currently reconcile all objects every three minutes as default, this is a controller-runtime setting that can also be used as a parameter to reduce the amount of reconcilements

guettli · 2024-08-13T14:28:46Z

Just for the records, you can get the metrics like this:

Run port-forwarding to the caph pod:

❯ k -n caph-system port-forward caph-controller-manager-fb8b96fd7-tf8ns 8080:8080

Show metrics of hcloud ap client:

❯ curl -sS localhost:8080/metrics | grep -P '^hcloud_api_request'
hcloud_api_request_duration_seconds_bucket{method="get",le="0.005"} 0
hcloud_api_request_duration_seconds_bucket{method="get",le="0.01"} 0
hcloud_api_request_duration_seconds_bucket{method="get",le="0.025"} 0
hcloud_api_request_duration_seconds_bucket{method="get",le="0.05"} 1
hcloud_api_request_duration_seconds_bucket{method="get",le="0.1"} 1
hcloud_api_request_duration_seconds_bucket{method="get",le="0.25"} 2
hcloud_api_request_duration_seconds_bucket{method="get",le="0.5"} 2
hcloud_api_request_duration_seconds_bucket{method="get",le="1"} 2
hcloud_api_request_duration_seconds_bucket{method="get",le="2.5"} 2
hcloud_api_request_duration_seconds_bucket{method="get",le="5"} 2
hcloud_api_request_duration_seconds_bucket{method="get",le="10"} 4
hcloud_api_request_duration_seconds_bucket{method="get",le="+Inf"} 4
hcloud_api_request_duration_seconds_sum{method="get"} 10.861937756999998
hcloud_api_request_duration_seconds_count{method="get"} 4
hcloud_api_requests_total{api_endpoint="/load_balancers",code="200",method="get"} 1
hcloud_api_requests_total{api_endpoint="/placement_groups",code="200",method="get"} 1
hcloud_api_requests_total{api_endpoint="/servers/",code="200",method="get"} 2

Maybe this helps.

At least on my current system the amount of hcloud-api calls for a machine which is up and running were low.

janiskemper changed the title ~~Reducing API calls for hcloudmachines that are up and running?~~ Reducing API calls for hcloudmachines that are up and running Jun 11, 2024

guettli changed the title ~~Reducing API calls for hcloudmachines that are up and running~~ Reducing hcloud API calls for hcloudmachines that are up and running Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing hcloud API calls for hcloudmachines that are up and running #1336

Reducing hcloud API calls for hcloudmachines that are up and running #1336

janiskemper commented Jun 11, 2024 •

edited by guettli

Loading

apricote commented Jun 12, 2024

janiskemper commented Jun 12, 2024

guettli commented Aug 13, 2024

Reducing hcloud API calls for hcloudmachines that are up and running #1336

Reducing hcloud API calls for hcloudmachines that are up and running #1336

Comments

janiskemper commented Jun 11, 2024 • edited by guettli Loading

Checks we do for a running hcloudmachine

Possible ways of handling these checks

Any thoughts?

apricote commented Jun 12, 2024

janiskemper commented Jun 12, 2024

guettli commented Aug 13, 2024

janiskemper commented Jun 11, 2024 •

edited by guettli

Loading