Skip to content

kvserver: clear RHS when learning about a split through a snapshot #148781

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

arulajmani
Copy link
Collaborator

Previously, it was possible for us to leak the (post-split) RHS keyspans if we learned about a split through a snapshot. This leak could be long lived if the RHS replica was moved away from the store, as the store wouldn't eventually get a snapshot for the RHS which would force the keyspans to be cleared out. This hazard was shown in #148707.

We know there's been a split if the store receives a snapshot that's narrower than the replica it was previously tracking. In the general case, when there are merges involved, a store knows that there's been a N merges followed by a split if there are N (necessarily contiguous) overlapping (with the snapshot) replicas whose keyspan is narrower than that of the snapshot.

Interestingly, we were handling the general case correctly, but it required for there to be at least one merge before the split. This patch lifts that logic from clearSubsumedReplicaDiskData and moves it to a similarly named clearSplitReplicaDiskData, tying the narrowing logic more directly to splits and not confusing it with merges.

Closes #73462

Release note: None

Previously, it was possible for us to leak the (post-split) RHS keyspans
if we learned about a split through a snapshot. This leak could be long
lived if the RHS replica was moved away from the store, as the store
wouldn't eventually get a snapshot for the RHS which would force the
keyspans to be cleared out. This hazard was shown in cockroachdb#148707.

We know there's been a split if the store receives a snapshot that's
narrower than the replica it was previously tracking. In the general
case, when there are merges involved, a store knows that there's been
a N merges followed by a split if there are N (necessarily contiguous)
overlapping (with the snapshot) replicas whose keyspan is narrower than
that of the snapshot.

Interestingly, we were handling the general case correctly, but it
required for there to be at least one merge before the split. This
patch lifts that logic from clearSubsumedReplicaDiskData and moves
it to a similarly named clearSplitReplicaDiskData, tying the narrowing
logic more directly to splits and not confusing it with merges.

Closes cockroachdb#73462

Release note: None
@arulajmani arulajmani requested review from pav-kv and tbg June 25, 2025 01:04
@arulajmani arulajmani requested a review from a team as a code owner June 25, 2025 01:04
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Comment on lines +587 to +590
// NB: subsumedDescs in prepareSnapApplyInput must be sorted by start key.
sort.Slice(subsumedDescs, func(i, j int) bool {
return subsumedDescs[i].StartKey.Compare(subsumedDescs[j].StartKey) < 0
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks unnecessary. This slice is already sorted, by construction here. If you like, assert slices.IsSorted here, or in prepareSnapApply?

The code in clearSubsumedReplicaDiskData might have been misleading before, making an impression that this slice can be unsorted.

// that either). Something is severely wrong for this to happen, so perform
// a sanity check.
if subsumedDescs[0].StartKey.Compare(desc.StartKey) < 0 { // subsumedDescs are sorted by StartKey
log.Fatalf(ctx, "subsuming replica to our left; key span: %v; total key span %v",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log line no longer prints key spans. Update the wording?

Comment on lines +307 to +308
snapDesc *roachpb.RangeDescriptor,
rightMostDesc *roachpb.RangeDescriptor,
Copy link
Collaborator

@pav-kv pav-kv Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of these 2 descriptors, the only information we need/use are the two EndKeys that form a key span that we want to clear. So we can instead just take this span, and make this func straightforward:

for _, span := range rditer.Select(... {
	RSpan: <this "delta" RSpan>
}) {
	_ = span // generate SST
}

I started doing this in https://github.com/cockroachdb/cockroach/pull/147378/files#diff-dd96d448893ec224dc7a1ca0e9056a6ec9d8aa830b80e1b6a44e2e3a8eab29abR262-R266, but it fell through the cracks.

// will also clear all subsumed replicas.
continue
}
// clearSplitReplicaDiskData clears the on disk data of any replica that has
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested below how to make this func a lower level / straightforward primitive. Correspondingly, these comments might need to move one level above to the caller.

// current keySpan).
rightMostDesc = input.subsumedDescs[len(input.subsumedDescs)-1]

// In the other case, where the subsumed replicas are fully contained in the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely makes sense to move the shorter/longer logic up the stack from clearSplitReplicaDiskData, so that these comments are interleaved with the checks as they happen. I made some suggestions in clearSplitReplicaDiskData how to achieve it.

@@ -595,16 +600,20 @@ func (r *Replica) applySnapshotRaftMuLocked(
truncState: truncState,
hardState: hs,
desc: desc,
origDesc: r.descRLocked(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tripped the mutex assertion here, since mu is not locked.

Consider r.Desc(). I can see at least one call to it above, so maybe reuse it to avoid extra mutex locks?

Also, how does this work when origDesc is empty? I.e. when this snapshot just initializes our replica.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need to lock in the first place. Desc is in "shMu", i.e. always changed with mu and raftMu both locked, and readable when holding either of the mutexes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: clear RHS state machine when moving past a split using a snapshot
3 participants