Skip to content

sync/fix: Clear gap sync on known imported blocks #8445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

lexnv
Copy link
Contributor

@lexnv lexnv commented May 6, 2025

This PR ensures that warp sync gaps are properly cleared when known blocks are imported. Previously, gaps were only removed in response to ImportedUnknown events.

This limitation caused issues for asset-hub and bridge-hub collators, which remained stuck in the "Block history" state without progressing.

The root cause lies in the client.info() reporting a gap during node startup or restart (ie block verification fails). In some cases, a peer may respond with the missing blocks after we’ve already imported them locally, leaving the gap open.

Grafana link: https://grafana.teleport.parity.io/goto/jCcsBLxNg?orgId=1

Traces from production:

2025-05-06 12:55:34.251 DEBUG                 main sync: [Parachain] Starting gap sync #4935955 - #4935955    

2025-05-06 12:55:34.558 TRACE tokio-runtime-worker sync: [Parachain] New gap block request for 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy, (best:5103253, common:5103253) BlockRequest { id: 0, fields: HEADER | BODY | JUSTIFICATION, from: Number(4935955), direction: Descending, max: Some(1) }    

2025-05-06 12:55:34.558 TRACE tokio-runtime-worker sync: [Parachain] Processed `SyncingAction::StartRequest` to 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy with strategy key StrategyKey("ChainSync").    

2025-05-06 12:55:34.608 TRACE tokio-runtime-worker sync: [Parachain] BlockResponse 0 from 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy with 1 blocks  (4935955)    

2025-05-06 12:55:34.608 DEBUG tokio-runtime-worker sync: [Parachain] Drained 1 gap blocks from 4935954    
	
2025-05-06 12:55:35.511 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Starting import of 1 blocks  (4935955)    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Block already in chain 4935955: 0x63db2b40cccac020fbc922e5e98bb3955f4cdaa823a2be85ecf22776745ccacc    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Block imported successfully Some(4935955) (0x63db…cacc)    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync: [Parachain] Cleared blocks from 4935955 to 4935956    

Testing Done

Added two tests to verify that warp sync gaps are correctly cleared under both block import scenarios. The first test closely follows the operations performed by the node, while the second one emulates the imports.

Next Steps

Added extra debug logs to monitor if the issue persists (pointing towards a corupt database -- ie client.info() always has the gap present).

Closes: #8416

cc @paritytech/networking

@lexnv lexnv self-assigned this May 6, 2025
@lexnv lexnv added T0-node This PR/Issue is related to the topic “node”. I2-bug The node fails to follow expected behavior. labels May 6, 2025
@lexnv lexnv added this to Networking May 6, 2025
@lexnv lexnv requested a review from a team May 6, 2025 14:39
@paritytech-workflow-stopper
Copy link

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/14862527642
Failed job name: test-linux-stable-no-try-runtime

@lexnv
Copy link
Contributor Author

lexnv commented May 6, 2025

Unrelated test failing

 ──── TRY 1 STDOUT:       polkadot-sdk-docs guides::your_first_node::tests::guide_first_runtime_works

running 1 test
test guides::your_first_node::tests::guide_first_runtime_works has been running for over 60 seconds
test guides::your_first_node::tests::guide_first_runtime_works ... FAILED

Copy link
Contributor

@skunert skunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. Just one more sanity check:

  • When we warp sync to block x, the target of the gap sync will be block x - 1.
  • There is not reasonable way that we import a known block x - 1 without completing the gap sync.

I am just wondering why it was originally done like this.

@@ -992,6 +987,18 @@ where
Ok(sync)
}

/// Complete the gap sync if the target number is reached and there is a gap.
fn maybe_complete_gap_sync(&mut self, number: NumberFor<B>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe complete_gap_if_target?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. T0-node This PR/Issue is related to the topic “node”.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

Investigate why collators constantly are in "Block history" state
4 participants