-
Notifications
You must be signed in to change notification settings - Fork 26
Check hashes during repair and upon startup #1791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
01ac0e5
to
db67a9f
Compare
db67a9f
to
13b1577
Compare
This is a spin-off from #1791; it adds validation to `crucible-downstairs` but does not validate extent after receiving them during repair operations. --------- Co-authored-by: Alan Hanson <[email protected]>
d54ad9a
to
6b2b366
Compare
crucible-downstairs
and during repair6b2b366
to
4f84cf7
Compare
"validation falied for extent {number}: {err:?}" | ||
); | ||
} | ||
panic!("validation failed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry about this panic because we can't recover from it like we can the "check during repair" panic. Eventually the downstairs service will be in maintenance and the Upstairs can't do anything to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we are "hunting" with this change, and not living with it forever, I think I like that we can't recover. I want things to stop as soon as we have detected a problem.
Looking for downstairs in maintenance (as well as core files) would help determine when we saw a problem.
I worry that perhaps we have a bad extent, but it gets "repaired" before we find it and we miss our window.
} | ||
}; | ||
if let Err(e) = new_extent.validate() { | ||
panic!("Failed to validate live-repair extent {eid}: {e:?}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe when we fail here, it will leave behind a copy_dir.
That should be handled "properly" when the downstairs restarts, as it should discard it and make a new one. We would lose the "bad" file, but the downstairs log should tell us what we need to know
Err(CrucibleError::GenericError( | ||
"`validate` is not implemented for Sqlite extent".to_owned(), | ||
)) | ||
// SQLite databases are always perfect and have no problems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😭
extent_dir(dir, number).join(extent_file_name(number, ExtentType::Data)) | ||
let e = extent_file_name(number, ExtentType::Data); | ||
|
||
// XXX terrible hack: if someone has already provided a full directory tree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this so we can verify the incoming copy of an extent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, opening an extent takes the extent's root directory and number, then builds the extent file path internally. This is annoying if we want to open a specific raw file!
Since we're building a test image directly from this PR (and probably not merging it), I didn't bother to clean this up further.
} | ||
}; | ||
if let Err(e) = new_extent.validate() { | ||
panic!("Failed to validate live-repair extent {eid}: {e:?}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe when we fail here, it will leave behind a copy_dir.
That should be handled "properly" when the downstairs restarts, as it should discard it and make a new one. We would lose the "bad" file, but the downstairs log should tell us what we need to know
The CI failure is because checking downstairs on startup is slower, so there's a race between |
Updated the verify command to allow you to specify how many threads to use while verifying.
This calls
Extent::validate
after repair and when opening an region, panicking on failures.See #1792 for details about extent validation.