Skip to content

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Oct 14, 2025

⚠️ Do not actually merge this, it's for debugging #1788

This calls Extent::validate after repair and when opening an region, panicking on failures.

See #1792 for details about extent validation.

@mkeeter mkeeter requested review from jmpesp and leftwo October 14, 2025 21:02
@leftwo

This comment was marked as outdated.

@mkeeter mkeeter force-pushed the mkeeter/check-hashes branch 2 times, most recently from 01ac0e5 to db67a9f Compare October 15, 2025 15:25
@mkeeter mkeeter force-pushed the mkeeter/check-hashes branch from db67a9f to 13b1577 Compare October 15, 2025 15:34
mkeeter added a commit that referenced this pull request Oct 15, 2025
This is a spin-off from #1791; it adds validation to
`crucible-downstairs` but does not validate extent after receiving them
during repair operations.

---------

Co-authored-by: Alan Hanson <[email protected]>
@mkeeter mkeeter force-pushed the mkeeter/check-hashes branch from d54ad9a to 6b2b366 Compare October 16, 2025 19:53
@mkeeter mkeeter changed the title Check hashes in crucible-downstairs and during repair Check hashes during repair Oct 16, 2025
@mkeeter mkeeter force-pushed the mkeeter/check-hashes branch from 6b2b366 to 4f84cf7 Compare October 16, 2025 20:14
@mkeeter mkeeter changed the title Check hashes during repair Check hashes during repair and upon startup Oct 16, 2025
"validation falied for extent {number}: {err:?}"
);
}
panic!("validation failed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about this panic because we can't recover from it like we can the "check during repair" panic. Eventually the downstairs service will be in maintenance and the Upstairs can't do anything to fix this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we are "hunting" with this change, and not living with it forever, I think I like that we can't recover. I want things to stop as soon as we have detected a problem.

Looking for downstairs in maintenance (as well as core files) would help determine when we saw a problem.

I worry that perhaps we have a bad extent, but it gets "repaired" before we find it and we miss our window.

}
};
if let Err(e) = new_extent.validate() {
panic!("Failed to validate live-repair extent {eid}: {e:?}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe when we fail here, it will leave behind a copy_dir.
That should be handled "properly" when the downstairs restarts, as it should discard it and make a new one. We would lose the "bad" file, but the downstairs log should tell us what we need to know

Err(CrucibleError::GenericError(
"`validate` is not implemented for Sqlite extent".to_owned(),
))
// SQLite databases are always perfect and have no problems
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😭

extent_dir(dir, number).join(extent_file_name(number, ExtentType::Data))
let e = extent_file_name(number, ExtentType::Data);

// XXX terrible hack: if someone has already provided a full directory tree
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this so we can verify the incoming copy of an extent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, opening an extent takes the extent's root directory and number, then builds the extent file path internally. This is annoying if we want to open a specific raw file!

Since we're building a test image directly from this PR (and probably not merging it), I didn't bother to clean this up further.

}
};
if let Err(e) = new_extent.validate() {
panic!("Failed to validate live-repair extent {eid}: {e:?}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe when we fail here, it will leave behind a copy_dir.
That should be handled "properly" when the downstairs restarts, as it should discard it and make a new one. We would lose the "bad" file, but the downstairs log should tell us what we need to know

@mkeeter
Copy link
Contributor Author

mkeeter commented Oct 17, 2025

The CI failure is because checking downstairs on startup is slower, so there's a race between dsc and the downstairs actually starting up. In this case, dsc tried to talk to the downstairs at 20:02:30, but it didn't start serving until 20:02:42. We can fix this, but it's not a red flag about the rest of the checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants