Skip to content

Zone bundle creation can fail, leaving unremovable files #3876

@bnaecker

Description

@bnaecker

While testing #3868, I artificially induced some I/O errors when creating zone bundles. To test removal of bundles above the storage limit, I lowered the quota on the ZFS dataset used to store the zone bundles. This was only to keep from having to create a zillion bundles, with the original quota of 100GiB.

When I did this, I saw I/O errors when creating the zone bundles. Here is the sled-agent log at that time:

22:03:17.256Z INFO SledAgent (dropshot (SledAgent)): accepted connection
    file = /home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/79c2a98/dropshot/src/server.rs:769
    local_addr = [fd00:1122:3344:101::1]:12345
    remote_addr = [fd00:1122:3344:101::1]:57748
22:03:17.256Z INFO SledAgent (BootstrapAgent): creating zone bundle
    file = sled-agent/src/zone_bundle.rs:264
    zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.257Z DEBG SledAgent (BootstrapAgent): creating bundle directory
    dir = /pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.257Z DEBG SledAgent (BootstrapAgent): creating bundle directory
    dir = /pool/int/a462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.257Z DEBG SledAgent (BootstrapAgent): created bundle tarball file
    path = /pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d/8bceed20-01e2-445e-849f-69a1980ae1c0.tar.gz
    zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.264Z DEBG SledAgent (BootstrapAgent): wrote zone bundle metadata
    zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.264Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["ptree"]
    zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.277Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["uptime"]
    zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.284Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["last"]
    zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.290Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["who"]
    zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.296Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["svcs", "-p"]
    zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.338Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["netstat", "-an"]
    zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.395Z DEBG SledAgent (BootstrapAgent): enumerated service processes
    procs = [ServiceProcess { service_name: "svc:/oxide/crucible/agent:default", binary: "/opt/oxide/crucible/bin/crucible-agent", pid: 7577, log_file: "/pool/ext/14b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d/root/var/svc/log/oxide-crucible-agent:default.log", rotated_log_files: [] }]
    zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.398Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["pfiles", "7577"]
    zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.407Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["pstack", "7577"]
    zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.514Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["pargs", "7577"]
    zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.616Z DEBG SledAgent (BootstrapAgent): appending current log file to zone bundle
    log_file = /pool/ext/14b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d/root/var/svc/log/oxide-crucible-agent:default.log
    zone = oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d
22:03:17.634Z DEBG SledAgent (BootstrapAgent): copying bundle
    from = /pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d/8bceed20-01e2-445e-849f-69a1980ae1c0.tar.gz
    to = /pool/int/a462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d/8bceed20-01e2-445e-849f-69a1980ae1c0.tar.gz
22:03:17.642Z INFO SledAgent (dropshot (SledAgent)): request completed
    error_message_external = Internal Server Error
    error_message_internal = Zone bundle error: I/O error
    file = /home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/79c2a98/dropshot/src/server.rs:841
    latency_us = 385450
    local_addr = [fd00:1122:3344:101::1]:12345
    method = POST
    remote_addr = [fd00:1122:3344:101::1]:57748
    req_id = e671797f-163c-4401-8bd7-b6c6a1b113e9
    response_code = 500
    uri = /zones/bundles/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d

This is really unfortunate because this one actually occurred after we creating the bundle, and are replicating it to the other dataset. In particular, we can see things like this, where the tarball appears corrupted or otherwise kinda sad:

22:07:55.226Z INFO SledAgent (dropshot (SledAgent)): accepted connection
    file = /home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/79c2a98/dropshot/src/server.rs:769
    local_addr = [fd00:1122:3344:101::1]:12345
    remote_addr = [fd00:1122:3344:101::1]:48554
22:07:55.226Z DEBG SledAgent: searching zone bundle directory
    directory = "/pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone"
    sled_id = 9e6db1dc-5dca-4e24-9eb2-de99345bd9f6
22:07:55.227Z DEBG SledAgent: searching zone bundle directory
    directory = "/pool/int/a462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone"
    sled_id = 9e6db1dc-5dca-4e24-9eb2-de99345bd9f6
22:07:55.228Z ERRO SledAgent: failed to read zone bundle metadata
    error = BundleFailed(Zone bundle is missing metadata file)
    file = sled-agent/src/zone_bundle.rs:743
    sled_id = 9e6db1dc-5dca-4e24-9eb2-de99345bd9f6
22:07:55.228Z INFO SledAgent (dropshot (SledAgent)): request completed
    error_message_external = Internal Server Error
    error_message_internal = Zone bundle error: Failed to create bundle: Zone bundle is missing metadata file
    file = /home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/79c2a98/dropshot/src/server.rs:841
    latency_us = 1895
    local_addr = [fd00:1122:3344:101::1]:12345
    method = DELETE
    remote_addr = [fd00:1122:3344:101::1]:48554
    req_id = d7878ec7-5f29-4351-8c2c-21d4ec29f631
    response_code = 500
    uri = /zones/bundles/oxz_crucible_0fb3bb32-aaa5-49a7-9501-b194292d941d/f8706a09-5f97-4a86-817f-5180b0dd19df

The downside of this is that during removal of a bundle, we also check for valid metadata, to avoid removing things that we don't really own. That's what we're doing here, trying to remove the corrupted file.

So there are two options I see here:

  1. Remove bundles that are at the right or expected path, regardless of whether we can extract metadata, including during automatic cleanup. This path is something like /pool/int/<UUID>/debug/bundle/zone/<ZONE_NAME>/<BUNDLE_UUID>.tar.gz. One issue with this is that we don't actually know the timestamp if we can't read the metadata. We could use the filesystem modification time, maybe, but that feels fraught.
  2. Try harder to avoid storing corrupted files in the first place. One strategy would be: (1) write the bundle into a temporary directory; (2) copy that to all the storage directories; (3) remove all that we've successfully copied if we fail to copy any of them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Sled AgentRelated to the Per-Sled Configuration and Management

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions