Improve robustness during outages #127

jooola · 2024-11-14T09:25:47Z

Yesterday, the Hetzner Cloud API had an outage, and it appears that the docker machine driver did not handle it well.

You can see that from 2024-11-13 17:00:00 to 2024-11-14 08:00:00, the amount of requests to /server_types, /images and /locations is unexpectedly high. Also, the amount of requests for single action was also really high.

This leads into rate limits, while waiting for servers to be created.

I see a few possible improvements:

When waiting for action, use an exponential back off algorithm to spread the requests over time. You can cap the max waiting time to a sensible value. https://pkg.go.dev/github.com/hetznercloud/hcloud-go/v2/hcloud#WithPollOpts https://pkg.go.dev/github.com/hetznercloud/hcloud-go/v2/hcloud#ExponentialBackoffWithOpts
Use a single API call to wait for multiple related actions, using https://pkg.go.dev/github.com/hetznercloud/hcloud-go/v2/hcloud#ActionClient.WaitFor or https://pkg.go.dev/github.com/hetznercloud/hcloud-go/v2/hcloud#ActionClient.WaitForFunc (note that the Watch* API is deprecated).
Maybe cache the call the /locations, /server types and /images, those shouldn't change that often. Unless you are checking for a server type availability ?

The text was updated successfully, but these errors were encountered:

JonasProgrammer · 2024-11-14T22:36:23Z

Free stress testing, I don't see the issue.

Bad jokes aside, sorry this caused you headaches. I'll have a look to get the exponential back-off implemented soon. Regarding error handling in general, I am somewhat torn as to what the best approach is. We do have explicit retry with a set timeout, which was implemented as a feature request. The default behaviour is to fail-fast, as it always was, but it could be changed in a major version bump. When using the CLI this would be what I expect, but I do see the issue with some docker-machine RPC talking applications, such as Rancher, going for a request-storm in fail-fast mode.
As for the caching, I do get the point of them being stable. However, I cannot really be sure in which environment the driver is running. Granted, vanilla docker-machine would be useless without a writeable home directory. But given its PRC nature, it could be run with any kinds of restrictions, so long one takes care it can access provided SSH key files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve robustness during outages #127

Improve robustness during outages #127

jooola commented Nov 14, 2024 •

edited

Loading

JonasProgrammer commented Nov 14, 2024

Improve robustness during outages #127

Improve robustness during outages #127

Comments

jooola commented Nov 14, 2024 • edited Loading

JonasProgrammer commented Nov 14, 2024

jooola commented Nov 14, 2024 •

edited

Loading