-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent local extraction of binaries fails sometimes #457
Comments
Actually maybe |
The problem with But a possible solution to this issue might be to use the creates parameter of the unarchive module as it would make the extraction skip completely if the binaries already exist. |
Right, but what I'm confused about is that the task title already embeds the architecture as in my log above:
It already says arm64, so is the arch not already baked in? If I had a mix of host types, would this title include multiple binaries? I'm only a casual observer though, so don't pay this too much heed.
This still seems racy maybe? First host starts unpacking then this root specified in Maybe it should be creates + |
To answer my own question, yes this task does run with different variable values for different architectures, but the output in the task name just appears to use the variable names from the first host, so it simply appears that the binaries for one architecture is downloaded while under the covers we are downloading for all architectures. |
During the common install tasks, we download and unpack the binary to install on localhost and eventually upload it to each host (among many other things). Currently the "unpack" step, which extracts the gzipped tar archive, is perform N times if there are N hosts in the inventory, but the target directory (something like /tmp/node_exporter-linux-arm64/1.8.2) is the same for every host with the same architecture. This means that all the unarchive tasks are extracting in parallel in an unsynchronized manner to the same directory. It a miracle that this works, but at least usually it like it does, but sometimes tar will complain about a missing file, failing the install. I've confirmed locally that this is due to the race described above. To fix this, we use `throttle: 1` on the unpack task. This means that the first task (for each architecture) will do the unpack and the other tasks are effectively no-ops, they are reported as "ok" rather than changed in the output. Fixes prometheus-community#457.
During the common install tasks, we download and unpack the binary to install on localhost and eventually upload it to each host (among many other things). Currently the "unpack" step, which extracts the gzipped tar archive, is perform N times if there are N hosts in the inventory, but the target directory (something like /tmp/node_exporter-linux-arm64/1.8.2) is the same for every host with the same architecture. This means that all the unarchive tasks are extracting in parallel in an unsynchronized manner to the same directory. It a miracle that this works, but at least usually it like it does, but sometimes tar will complain about a missing file, failing the install. I've confirmed locally that this is due to the race described above. To fix this, we use `throttle: 1` on the unpack task. This means that the first task (for each architecture) will do the unpack and the other tasks are effectively no-ops, they are reported as "ok" rather than changed in the output. Fixes prometheus-community#457.
During the common install tasks, we download and unpack the binary to install on localhost and eventually upload it to each host (among many other things). Currently the "unpack" step, which extracts the gzipped tar archive, is perform N times if there are N hosts in the inventory, but the target directory (something like /tmp/node_exporter-linux-arm64/1.8.2) is the same for every host with the same architecture. This means that all the unarchive tasks are extracting in parallel in an unsynchronized manner to the same directory. It a miracle that this works, but at least usually it like it does, but sometimes tar will complain about a missing file, failing the install. I've confirmed locally that this is due to the race described above. To fix this, we use `throttle: 1` on the unpack task. This means that the first task (for each architecture) will do the unpack and the other tasks are effectively no-ops, they are reported as "ok" rather than changed in the output. Fixes prometheus-community#457. Signed-off-by: Travis Downs <[email protected]>
I have pushed a patch that fixes this issue. |
During the common install tasks, we download and unpack the binary to install on localhost and eventually upload it to each host (among many other things). Currently the "unpack" step, which extracts the gzipped tar archive, is perform N times if there are N hosts in the inventory, but the target directory (something like /tmp/node_exporter-linux-arm64/1.8.2) is the same for every host with the same architecture. This means that all the unarchive tasks are extracting in parallel in an unsynchronized manner to the same directory. It a miracle that this works, but at least usually it like it does, but sometimes tar will complain about a missing file, failing the install. I've confirmed locally that this is due to the race described above. To fix this, we use `throttle: 1` on the unpack task. This means that the first task (for each architecture) will do the unpack and the other tasks are effectively no-ops, they are reported as "ok" rather than changed in the output. Fixes prometheus-community#457. Co-authored-by: Ben Kochie <[email protected]> Signed-off-by: Travis Downs <[email protected]>
When the node_exporter role is used to install the exporter on > 1 hosts we see occasional failures of binary extraction after download like so:
It's probably less than 1% of total extraction executions, but with many hosts the overall failure rate is quite high.
We delegate the download & extraction to localhost, so this runs in parallel N times for N hosts but the paths are the same in every case, so it's concurrently extracting to the same destination path which I guess causes this, e.g, from the manual:
So there will be brief periods where a written file (in this case, the top level dir) will be deleted and replaced by a concurrent extraction and if some earlier writer tries to list it it will see it missing.
Besides the race, it's also inefficient to extract the same file N times.
I guess
run_once
would fix this if it wasn't for the complication of multi-arch. Maybeserial
execution + skip if already exists would work with little refactoring of the existing approach?I think this race existed before the big refactor in 0.20.0, but in the refactor
list_files
was added which may be the thing that triggers the specific failure above.The text was updated successfully, but these errors were encountered: