Skip to content

Conversation

@ask-kamal-nayan
Copy link

@ask-kamal-nayan ask-kamal-nayan commented Dec 24, 2025

Description

This PR implements the remote store recovery flow for CompositeEngine/DataFusion, enabling indices using non-Lucene data formats (Parquet, Arrow, etc.) to properly recover from remote store after node failures, restarts, and replica promotions.

Key Features

1. Format-Aware Recovery from Remote Store

  • Recovery flow now preserves FileMetadata format information (e.g., "parquet", "arrow") when syncing segments from remote store
  • syncSegmentsFromRemoteSegmentStore() uses format-aware FileMetadata keys instead of string-based keys
  • Format-aware checksum validation ensures data integrity during recovery

2. CompositeEngine Empty Store Handling

  • Handles FileNotFoundException during recovery when local store is empty (common in remote store recovery scenarios)
  • Reads translog UUID directly from translog header and creates initial empty commit
  • Properly initializes LocalCheckpointTracker even when no prior commits exist

3. Engine Reset for Recovery

  • Refactored resetEngineToGlobalCheckpoint() to properly initialize CompositeEngine with fresh translog BEFORE creating InternalEngine
  • Ensures checkpoint data is preserved in CatalogSnapshot.userData before serialization for recovery consistency

4. Checkpoint Tracking

  • Added LastRefreshedCheckpointListener to CompositeEngine for tracking refresh checkpoints (required by RemoteStoreRefreshListener)
  • New APIs: lastRefreshedCheckpoint(), currentOngoingRefreshCheckpoint()

5. CatalogSnapshot Recovery Support

  • Added setUserData() method to CatalogSnapshot for recovery scenarios
  • Checkpoint data (LOCAL_CHECKPOINT_KEY, MAX_SEQ_NO, TRANSLOG_UUID_KEY, HISTORY_UUID_KEY) now properly preserved through recovery

Test Coverage

Added comprehensive integration test suite DataFusionRemoteStoreRecoveryTests covering:

  • Basic remote store recovery with format-aware metadata preservation
  • Recovery with multiple Parquet generation files
  • Replica promotion to primary with format preservation
  • Primary restart with extra local commits (commit conflict resolution)

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 24, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

❌ Gradle check result for 7d12a6a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for db21058: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@ask-kamal-nayan ask-kamal-nayan changed the title Recovery flow prod Remote store recovery support for DataFusion indices Dec 29, 2025
@github-actions
Copy link
Contributor

❌ Gradle check result for cd942d4: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 9bb11f3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant