Skip to content

Enable restoring an earlier snapshot if latest fails#4917

Open
pcholakov wants to merge 1 commit into
mainfrom
restore-earlier-snapshot
Open

Enable restoring an earlier snapshot if latest fails#4917
pcholakov wants to merge 1 commit into
mainfrom
restore-earlier-snapshot

Conversation

@pcholakov

@pcholakov pcholakov commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

This change makes partition store restore resilient to a missing or corrupt latest snapshot: when the most recent retained snapshot fails to download (e.g. missing or corrupt SSTs), the partition processor now falls back to trying older retained snapshots instead of giving up.

Key changes:

  • SnapshotRepository::get_snapshot_candidates(partition_id) returns the partition's retained snapshot references (descending LSN). Selecting a snapshot suitable for a given target LSN is the caller's job.
  • SnapshotRepository::get_snapshot(partition_id, snapshot_ref) downloads one specific snapshot (refactored out of the old download_latest_snapshot).
  • Snapshots::download_snapshot(partition_id, target_lsn) filters candidates by the required LSN and loops over them, returning the first that downloads successfully.
  • partition_store_manager open path handles the new semantics.

Correctness hardening (from review):

  • download_snapshot returns Result<Option<..>>: Ok(None) means "no snapshot present" (provision an empty store), while a transient repository error or an all-candidates-failed outcome returns Err so the open fails and retries rather than silently provisioning an empty partition.
  • Defense-in-depth: after downloading a candidate, its min_applied_lsn is re-verified against the required target LSN; a mismatch falls back to the next candidate.
  • Diagnostics distinguish "no snapshot present in the repository" from "snapshots exist but all predate the required LSN (log trimmed past them)" — different operator remediations.
  • Unit tests for candidate ordering and the fallback path (latest fails → older restored; all fail → error; target above all retained → Ok(None)).

Metrics

  • restate.partition_store.snapshots.download.fallback.total (SNAPSHOT_DOWNLOAD_FALLBACK) — incremented when a restore succeeds using a non-latest (fallback) snapshot.
  • restate.partition_store.snapshots.fast_forward.failed.total (SNAPSHOT_FAST_FORWARD_FAILED) — incremented when a fast-forward past a log trim gap cannot be satisfied (no suitable snapshot available).
  • SNAPSHOT_DOWNLOAD_FAILED is now also incremented when a required snapshot (target LSN present) cannot be obtained — i.e. every candidate failed to download, or none reaches the target LSN.

This is the "quick fix" half of the original #4222, split out so it can merge against main on its own. The incremental-snapshot chaos test that stresses this path is in a separate follow-up PR (#4918). (#4222 was inadvertently auto-closed as merged during the stack reorder; this PR supersedes it.)

Fixes #3930

@github-actions

Copy link
Copy Markdown

Test Results

 5 files   5 suites   1m 41s ⏱️
45 tests 45 ✅ 0 💤 0 ❌
68 runs  68 ✅ 0 💤 0 ❌

Results for commit 169d0cc.

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown

Test Results

  8 files    8 suites   4m 46s ⏱️
 60 tests  60 ✅ 0 💤 0 ❌
267 runs  267 ✅ 0 💤 0 ❌

Results for commit 0966914.

♻️ This comment has been updated with latest results.

#[error("a partition snapshot is required")]
SnapshotRequired,
#[error("partition snapshot was found but unsuitable; it was taken before the log trim point")]
SnapshotUnsuitable,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No branching on this, collapsed into SnapshotRequired

When the latest snapshot fails to download (network error, corrupt metadata,
missing files), the system now automatically tries older retained snapshots
in descending LSN order until one succeeds or all candidates are exhausted.

resolve: zmkrmxtk (restore fallback) conflicts

resolve: zmkrmxtk absorbed chaos test conflict

resolve: zmkrmxtk fmt absorb

doc: clarify retained_snapshots ordering invariant

wip: resolve fallback-fix conflicts on main

fix: address review findings on fallback restore

review round 2: enrich diagnostics, tighten visibility, fix docstring, metric
@pcholakov pcholakov force-pushed the restore-earlier-snapshot branch from 169d0cc to 0966914 Compare June 11, 2026 10:31
@pcholakov pcholakov marked this pull request as ready for review June 11, 2026 11:19
@pcholakov pcholakov requested a review from MohamedBassem June 11, 2026 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow to resume partition processor not only from the latest snapshot

1 participant