Enable restoring an earlier snapshot if latest fails#4917
Open
pcholakov wants to merge 1 commit into
Open
Conversation
This was referenced Jun 11, 2026
Test Results 5 files 5 suites 1m 41s ⏱️ Results for commit 169d0cc. |
Test Results 8 files 8 suites 4m 46s ⏱️ Results for commit 0966914. ♻️ This comment has been updated with latest results. |
pcholakov
commented
Jun 11, 2026
| #[error("a partition snapshot is required")] | ||
| SnapshotRequired, | ||
| #[error("partition snapshot was found but unsuitable; it was taken before the log trim point")] | ||
| SnapshotUnsuitable, |
Contributor
Author
There was a problem hiding this comment.
No branching on this, collapsed into SnapshotRequired
When the latest snapshot fails to download (network error, corrupt metadata, missing files), the system now automatically tries older retained snapshots in descending LSN order until one succeeds or all candidates are exhausted. resolve: zmkrmxtk (restore fallback) conflicts resolve: zmkrmxtk absorbed chaos test conflict resolve: zmkrmxtk fmt absorb doc: clarify retained_snapshots ordering invariant wip: resolve fallback-fix conflicts on main fix: address review findings on fallback restore review round 2: enrich diagnostics, tighten visibility, fix docstring, metric
169d0cc to
0966914
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This change makes partition store restore resilient to a missing or corrupt latest snapshot: when the most recent retained snapshot fails to download (e.g. missing or corrupt SSTs), the partition processor now falls back to trying older retained snapshots instead of giving up.
Key changes:
SnapshotRepository::get_snapshot_candidates(partition_id)returns the partition's retained snapshot references (descending LSN). Selecting a snapshot suitable for a given target LSN is the caller's job.SnapshotRepository::get_snapshot(partition_id, snapshot_ref)downloads one specific snapshot (refactored out of the olddownload_latest_snapshot).Snapshots::download_snapshot(partition_id, target_lsn)filters candidates by the required LSN and loops over them, returning the first that downloads successfully.partition_store_manageropen path handles the new semantics.Correctness hardening (from review):
download_snapshotreturnsResult<Option<..>>:Ok(None)means "no snapshot present" (provision an empty store), while a transient repository error or an all-candidates-failed outcome returnsErrso the open fails and retries rather than silently provisioning an empty partition.min_applied_lsnis re-verified against the required target LSN; a mismatch falls back to the next candidate.Ok(None)).Metrics
restate.partition_store.snapshots.download.fallback.total(SNAPSHOT_DOWNLOAD_FALLBACK) — incremented when a restore succeeds using a non-latest (fallback) snapshot.restate.partition_store.snapshots.fast_forward.failed.total(SNAPSHOT_FAST_FORWARD_FAILED) — incremented when a fast-forward past a log trim gap cannot be satisfied (no suitable snapshot available).SNAPSHOT_DOWNLOAD_FAILEDis now also incremented when a required snapshot (target LSN present) cannot be obtained — i.e. every candidate failed to download, or none reaches the target LSN.This is the "quick fix" half of the original #4222, split out so it can merge against
mainon its own. The incremental-snapshot chaos test that stresses this path is in a separate follow-up PR (#4918). (#4222 was inadvertently auto-closed as merged during the stack reorder; this PR supersedes it.)Fixes #3930