Skip to content

Conversation

@ask-kamal-nayan
Copy link

Description

This PR implements the remote store recovery flow for CompositeEngine/DataFusion, enabling indices using non-Lucene data formats (Parquet, Arrow, etc.) to properly recover from remote store after node failures, restarts, and replica promotions.

Key Features

1. Format-Aware Recovery from Remote Store

  • Recovery flow now preserves FileMetadata format information (e.g., "parquet", "arrow") when syncing segments from remote store
  • syncSegmentsFromRemoteSegmentStore() uses format-aware FileMetadata keys instead of string-based keys
  • Format-aware checksum validation ensures data integrity during recovery

2. CompositeEngine Empty Store Handling

  • Handles FileNotFoundException during recovery when local store is empty (common in remote store recovery scenarios)
  • Reads translog UUID directly from translog header and creates initial empty commit
  • Properly initializes LocalCheckpointTracker even when no prior commits exist

3. Engine Reset for Recovery

  • Refactored resetEngineToGlobalCheckpoint() to properly initialize CompositeEngine with fresh translog BEFORE creating InternalEngine
  • Ensures checkpoint data is preserved in CatalogSnapshot.userData before serialization for recovery consistency

4. Checkpoint Tracking

  • Added LastRefreshedCheckpointListener to CompositeEngine for tracking refresh checkpoints (required by RemoteStoreRefreshListener)
  • New APIs: lastRefreshedCheckpoint(), currentOngoingRefreshCheckpoint()

_5. Lucene Index Recovery support:
Updated the SyncSegmentsFromRemoteStore API and the recovery flow to handle optimized and non-optimized indices properly during recovery.

Test Coverage

Added comprehensive integration test suite DataFusionRemoteStoreRecoveryTests covering:

  • Basic remote store recovery with format-aware metadata preservation
  • Recovery with multiple Parquet generation files
  • Replica promotion to primary with format preservation
  • Primary restart with extra local commits (commit conflict resolution)
  • RemoteIndexRecoveryIT are passing but RemoteIndexRecoveryIT::TestSnapshotRecovery, RemoteIndexRecoveryIT::TestRerouteRecovery and 2 more tests failing with similar error AlreadySetException error while creating SearchContext.
  • One more issue is Refresh is not happening after replica getting promoted Primary for non-optimized index.

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Kamal Nayan and others added 29 commits December 24, 2025 18:43
Signed-off-by: Kamal Nayan <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

❌ Gradle check result for d288377: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

❌ Gradle check result for ba64eea: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

❌ Gradle check result for 4ba76b2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

❌ Gradle check result for 0d2ced5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

❌ Gradle check result for 77ed2a2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

❌ Gradle check result for a9f5058: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

❌ Gradle check result for 9de9311: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

❌ Gradle check result for da95ed4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants