-
Couldn't load subscription status.
- Fork 60
Description
While testing #3868, I artificially induced some I/O errors when creating zone bundles. To test removal of bundles above the storage limit, I lowered the quota on the ZFS dataset used to store the zone bundles. This was only to keep from having to create a zillion bundles, with the original quota of 100GiB.
When I did this, I saw I/O errors when creating the zone bundles. Here is the sled-agent log at that time:
22:03:17.256Z INFO SledAgent (dropshot (SledAgent)): accepted connection
file = /home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/79c2a98/dropshot/src/server.rs:769
local_addr = [fd00:1122:3344:101::1]:12345
remote_addr = [fd00:1122:3344:101::1]:57748
22:03:17.256Z INFO SledAgent (BootstrapAgent): creating zone bundle
file = sled-agent/src/zone_bundle.rs:264
zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.257Z DEBG SledAgent (BootstrapAgent): creating bundle directory
dir = /pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.257Z DEBG SledAgent (BootstrapAgent): creating bundle directory
dir = /pool/int/a462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.257Z DEBG SledAgent (BootstrapAgent): created bundle tarball file
path = /pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d/8bceed20-01e2-445e-849f-69a1980ae1c0.tar.gz
zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.264Z DEBG SledAgent (BootstrapAgent): wrote zone bundle metadata
zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.264Z DEBG SledAgent (BootstrapAgent): running zone bundle command
command = ["ptree"]
zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.277Z DEBG SledAgent (BootstrapAgent): running zone bundle command
command = ["uptime"]
zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.284Z DEBG SledAgent (BootstrapAgent): running zone bundle command
command = ["last"]
zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.290Z DEBG SledAgent (BootstrapAgent): running zone bundle command
command = ["who"]
zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.296Z DEBG SledAgent (BootstrapAgent): running zone bundle command
command = ["svcs", "-p"]
zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.338Z DEBG SledAgent (BootstrapAgent): running zone bundle command
command = ["netstat", "-an"]
zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.395Z DEBG SledAgent (BootstrapAgent): enumerated service processes
procs = [ServiceProcess { service_name: "svc:/oxide/crucible/agent:default", binary: "/opt/oxide/crucible/bin/crucible-agent", pid: 7577, log_file: "/pool/ext/14b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d/root/var/svc/log/oxide-crucible-agent:default.log", rotated_log_files: [] }]
zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.398Z DEBG SledAgent (BootstrapAgent): running zone bundle command
command = ["pfiles", "7577"]
zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.407Z DEBG SledAgent (BootstrapAgent): running zone bundle command
command = ["pstack", "7577"]
zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.514Z DEBG SledAgent (BootstrapAgent): running zone bundle command
command = ["pargs", "7577"]
zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.616Z DEBG SledAgent (BootstrapAgent): appending current log file to zone bundle
log_file = /pool/ext/14b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d/root/var/svc/log/oxide-crucible-agent:default.log
zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.634Z DEBG SledAgent (BootstrapAgent): copying bundle
from = /pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d/8bceed20-01e2-445e-849f-69a1980ae1c0.tar.gz
to = /pool/int/a462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d/8bceed20-01e2-445e-849f-69a1980ae1c0.tar.gz
22:03:17.642Z INFO SledAgent (dropshot (SledAgent)): request completed
error_message_external = Internal Server Error
error_message_internal = Zone bundle error: I/O error
file = /home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/79c2a98/dropshot/src/server.rs:841
latency_us = 385450
local_addr = [fd00:1122:3344:101::1]:12345
method = POST
remote_addr = [fd00:1122:3344:101::1]:57748
req_id = e671797f-163c-4401-8bd7-b6c6a1b113e9
response_code = 500
uri = /zones/bundles/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
This is really unfortunate because this one actually occurred after we creating the bundle, and are replicating it to the other dataset. In particular, we can see things like this, where the tarball appears corrupted or otherwise kinda sad:
22:07:55.226Z INFO SledAgent (dropshot (SledAgent)): accepted connection
file = /home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/79c2a98/dropshot/src/server.rs:769
local_addr = [fd00:1122:3344:101::1]:12345
remote_addr = [fd00:1122:3344:101::1]:48554
22:07:55.226Z DEBG SledAgent: searching zone bundle directory
directory = "/pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone"
sled_id = 9e6db1dc-5dca-4e24-9eb2-de99345bd9f6
22:07:55.227Z DEBG SledAgent: searching zone bundle directory
directory = "/pool/int/a462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone"
sled_id = 9e6db1dc-5dca-4e24-9eb2-de99345bd9f6
22:07:55.228Z ERRO SledAgent: failed to read zone bundle metadata
error = BundleFailed(Zone bundle is missing metadata file)
file = sled-agent/src/zone_bundle.rs:743
sled_id = 9e6db1dc-5dca-4e24-9eb2-de99345bd9f6
22:07:55.228Z INFO SledAgent (dropshot (SledAgent)): request completed
error_message_external = Internal Server Error
error_message_internal = Zone bundle error: Failed to create bundle: Zone bundle is missing metadata file
file = /home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/79c2a98/dropshot/src/server.rs:841
latency_us = 1895
local_addr = [fd00:1122:3344:101::1]:12345
method = DELETE
remote_addr = [fd00:1122:3344:101::1]:48554
req_id = d7878ec7-5f29-4351-8c2c-21d4ec29f631
response_code = 500
uri = /zones/bundles/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d/f8706a09-5f97-4a86-817f-5180b0dd19df
The downside of this is that during removal of a bundle, we also check for valid metadata, to avoid removing things that we don't really own. That's what we're doing here, trying to remove the corrupted file.
So there are two options I see here:
- Remove bundles that are at the right or expected path, regardless of whether we can extract metadata, including during automatic cleanup. This path is something like
/pool/int/<UUID>/debug/bundle/zone/<ZONE_NAME>/<BUNDLE_UUID>.tar.gz. One issue with this is that we don't actually know the timestamp if we can't read the metadata. We could use the filesystem modification time, maybe, but that feels fraught. - Try harder to avoid storing corrupted files in the first place. One strategy would be: (1) write the bundle into a temporary directory; (2) copy that to all the storage directories; (3) remove all that we've successfully copied if we fail to copy any of them.