-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reducing hcloud API calls for hcloudmachines that are up and running #1336
Comments
I do think all of the above requests should be fine. My question would be how often you are triggering reconciles of the HCloudMachines controller and if that can be optimized. I described my previous solution to investigate this here: #926 (comment) |
I think that we can probably slightly improve the work that you have started already @apricote . Why I have written down my thoughts here is more that a large use case is more likely to run into a rate limit than a smaller one. This is in general just a question of optimizing rather than fixing a certain bug or issue. We currently reconcile all objects every three minutes as default, this is a controller-runtime setting that can also be used as a parameter to reduce the amount of reconcilements |
Just for the records, you can get the metrics like this: Run port-forwarding to the caph pod:
Show metrics of hcloud ap client:
Maybe this helps. At least on my current system the amount of hcloud-api calls for a machine which is up and running were low. |
/kind proposal
Sometimes we hit the rate-limit because the caph controller does too many calls to the hcloud API.
Checks we do for a running hcloudmachine
The following points are based on an action of the user. The user removes a label of the server in the HCloud UI, and we cannot validate it. The user deletes a server manually and we realize that. The user removes the server from the load balancer or the network and we add it again.
These are potentially valid use cases - the question is whether they are so relevant that we need to keep them.
Possible ways of handling these checks
Right now we do the following:
One extreme (current one): Do all API calls to check everything in every reconcile loop.
The other extreme: Stop doing any API calls once the server is up and running. If something is wrong with the server, the Machine Health Checks should discover that. We don't do anything if the user actively misconfigures something and for example removes a server from the load balancer.
Middle way 1: To specific checks and stop doing all others
We could stop checking that the server is part of the network and continue checking that it is added as target to the load balancer. For example. Any combination of things that are important to us is possible.
Middle way 2: Heavily cache API calls once a server is running
We could also use a cache to not call the API regularly. If something goes wrong, we would realize it later, but eventually we would.
Any thoughts?
I'm curious to hear opinions, also from people outside of Syself! The overall goal is to reduce the number of API calls that can be rather high. Hundreds of calls per hour for a stable (not scaling) cluster is normal.
A similar question could be asked also for the general load balancer, placement group and network configuration, which we reconcile in the hetznercluster-controller. I'm also looking forward to opinions there!
The text was updated successfully, but these errors were encountered: