Check hashes during repair and upon startup #1791

mkeeter · 2025-10-14T21:02:08Z

⚠️ Do not actually merge this, it's for debugging #1788

This calls Extent::validate after repair and when opening an region, panicking on failures.

See #1792 for details about extent validation.

This is a spin-off from #1791; it adds validation to `crucible-downstairs` but does not validate extent after receiving them during repair operations. --------- Co-authored-by: Alan Hanson <[email protected]>

jmpesp · 2025-10-16T20:29:18Z

downstairs/src/region.rs

+                    "validation falied for extent {number}: {err:?}"
+                );
+            }
+            panic!("validation failed");


I worry about this panic because we can't recover from it like we can the "check during repair" panic. Eventually the downstairs service will be in maintenance and the Upstairs can't do anything to fix this.

Given we are "hunting" with this change, and not living with it forever, I think I like that we can't recover. I want things to stop as soon as we have detected a problem.

Looking for downstairs in maintenance (as well as core files) would help determine when we saw a problem.

I worry that perhaps we have a bad extent, but it gets "repaired" before we find it and we miss our window.

jmpesp · 2025-10-16T20:29:29Z

downstairs/src/region.rs

+                }
+            };
+            if let Err(e) = new_extent.validate() {
+                panic!("Failed to validate live-repair extent {eid}: {e:?}");


I believe when we fail here, it will leave behind a copy_dir.
That should be handled "properly" when the downstairs restarts, as it should discard it and make a new one. We would lose the "bad" file, but the downstairs log should tell us what we need to know

leftwo · 2025-10-16T20:42:10Z

downstairs/src/extent_inner_sqlite.rs

-        Err(CrucibleError::GenericError(
-            "`validate` is not implemented for Sqlite extent".to_owned(),
-        ))
+        // SQLite databases are always perfect and have no problems


leftwo · 2025-10-16T20:45:59Z

downstairs/src/extent.rs

-    extent_dir(dir, number).join(extent_file_name(number, ExtentType::Data))
+    let e = extent_file_name(number, ExtentType::Data);
+
+    // XXX terrible hack: if someone has already provided a full directory tree


We need this so we can verify the incoming copy of an extent?

Yeah, opening an extent takes the extent's root directory and number, then builds the extent file path internally. This is annoying if we want to open a specific raw file!

Since we're building a test image directly from this PR (and probably not merging it), I didn't bother to clean this up further.

leftwo · 2025-10-16T20:55:04Z

downstairs/src/region.rs

+                }
+            };
+            if let Err(e) = new_extent.validate() {
+                panic!("Failed to validate live-repair extent {eid}: {e:?}");


I believe when we fail here, it will leave behind a copy_dir.
That should be handled "properly" when the downstairs restarts, as it should discard it and make a new one. We would lose the "bad" file, but the downstairs log should tell us what we need to know

mkeeter · 2025-10-17T21:12:11Z

The CI failure is because checking downstairs on startup is slower, so there's a race between dsc and the downstairs actually starting up. In this case, dsc tried to talk to the downstairs at 20:02:30, but it didn't start serving until 20:02:42. We can fix this, but it's not a red flag about the rest of the checking.

Updated the verify command to allow you to specify how many threads to use while verifying.

mkeeter requested review from jmpesp and leftwo October 14, 2025 21:02

This comment was marked as outdated.

Sign in to view

mkeeter force-pushed the mkeeter/check-hashes branch 2 times, most recently from 01ac0e5 to db67a9f Compare October 15, 2025 15:25

mkeeter mentioned this pull request Oct 15, 2025

Add crucible-downstairs validate subcommand #1792

Merged

mkeeter force-pushed the mkeeter/check-hashes branch from db67a9f to 13b1577 Compare October 15, 2025 15:34

mkeeter force-pushed the mkeeter/check-hashes branch from d54ad9a to 6b2b366 Compare October 16, 2025 19:53

mkeeter changed the title ~~Check hashes in crucible-downstairs and during repair~~ Check hashes during repair Oct 16, 2025

Check hashes during repair

4f84cf7

mkeeter force-pushed the mkeeter/check-hashes branch from 6b2b366 to 4f84cf7 Compare October 16, 2025 20:14

mkeeter changed the title ~~Check hashes during repair~~ Check hashes during repair and upon startup Oct 16, 2025

jmpesp reviewed Oct 16, 2025

View reviewed changes

leftwo reviewed Oct 16, 2025

View reviewed changes

leftwo approved these changes Oct 16, 2025

View reviewed changes

Better logging of extent failure during startup

969f538

Add standalone crucible-verifier-raw

4d6d992

leftwo mentioned this pull request Oct 18, 2025

panic in crucible::client::DownstairsClient::process_io_completion during rack update #1788

Open

Alan Hanson and others added 8 commits October 18, 2025 19:24

Make verify raw print all on one line

d32e0d4

print extent in hex as well

b7a514d

Add verify-raw to hakari

0327ad5

Better logging; try to fix live_repair CI test

df7bc52

More poking at live_repair CI test

253483d

Shove more info into context and metadata; add to crucible-verify-raw

1c43cd8

Default to 8 threads for verify on startup

df77cda

Updated the verify command to allow you to specify how many threads to use while verifying.

Set dirty always, don't cache it

61913aa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Check hashes during repair and upon startup #1791

Check hashes during repair and upon startup #1791

Uh oh!

mkeeter commented Oct 14, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

jmpesp Oct 16, 2025

Uh oh!

leftwo Oct 16, 2025

Uh oh!

jmpesp Oct 16, 2025

Uh oh!

leftwo Oct 16, 2025

Uh oh!

leftwo Oct 16, 2025

Uh oh!

leftwo Oct 16, 2025

Uh oh!

mkeeter Oct 17, 2025

Uh oh!

leftwo Oct 16, 2025

Uh oh!

mkeeter commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Check hashes during repair and upon startup #1791

Are you sure you want to change the base?

Check hashes during repair and upon startup #1791

Uh oh!

Conversation

mkeeter commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkeeter commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mkeeter commented Oct 14, 2025 •

edited

Loading