Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Allow validators to reset to the slot which matches their last voted slot #28172

Merged
merged 9 commits into from
Oct 3, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 28 additions & 32 deletions core/src/heaviest_subtree_fork_choice.rs
Original file line number Diff line number Diff line change
Expand Up @@ -858,37 +858,32 @@ impl HeaviestSubtreeForkChoice {
tower
.last_voted_slot_hash()
.and_then(|last_voted_slot_hash| {
let heaviest_slot_hash_on_same_voted_fork = self.best_slot(&last_voted_slot_hash);
if heaviest_slot_hash_on_same_voted_fork.is_none() {
if !tower.is_stray_last_vote() {
// Unless last vote is stray and stale, self.bast_slot(last_voted_slot) must return
// Some(_), justifying to panic! here.
// Also, adjust_lockouts_after_replay() correctly makes last_voted_slot None,
// if all saved votes are ancestors of replayed_root_slot. So this code shouldn't be
// touched in that case as well.
// In other words, except being stray, all other slots have been voted on while this
// validator has been running, so we must be able to fetch best_slots for all of
// them.
panic!(
"a bank at last_voted_slot({:?}) is a frozen bank so must have been \
added to heaviest_subtree_fork_choice at time of freezing",
last_voted_slot_hash,
)
} else {
// fork_infos doesn't have corresponding data for the stale stray last vote,
// meaning some inconsistency between saved tower and ledger.
// (newer snapshot, or only a saved tower is moved over to new setup?)
return None;
match self.is_candidate(&last_voted_slot_hash) {
Some(true) => self.best_slot(&last_voted_slot_hash),
Copy link
Contributor

@ryoqun ryoqun Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mentioned at the different review thread, but i'd prefer provable .unwrap() here for less conditions: Some(self.best_slot(&last_voted_slot_hash).unwrap()).

Some(false) => None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a fine change, but trying to minimize as many logical changes as possible.

In this new code is_candidate(&last_voted_slot_hash) == false, this returns None. This seems like it might be a deviation from the original code.

To maintain the same behavior would require that in the new code if self.is_candidate(&last_voted_slot_hash) == Some(false) then that implies inthe old code that heaviest_slot_hash_on_same_voted_fork == None in the old code, which I don't think is necessarily true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time understanding what you're suggesting that I change here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To maintain the same behavior would require that in the new code if self.is_candidate(&last_voted_slot_hash) == Some(false) then that implies in the old code that heaviest_slot_hash_on_same_voted_fork == None in the old code, which I don't think is necessarily true?

I think he's saying that this is a new condition that wasn't checked in the old code that might now return None.
I don't think this implies in the old code that heaviest_slot_hash_on_same_voted_fork needed to be None, this just requires that there is an invalid ancestor which wasn't being checked before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed that we should try to reduce the change set to the bare minimum here. we need to fix the known bug now, further hardening can wait.

kinda confusing that a comment like this would result in an approved review. seems blocking

This comment was marked as outdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is already at the bare minimum in my opinion. Take note that self.is_candidate(..).is_none() and self.best_slot(..).is_none() is the same thing because each of them is only None when the queried slot is not in fork_infos.

The only logical change is switching from filtering the returned heaviest slot by fork_info.is_candidate rather than checking if it is the same slot as the last vote. Checking is_candidate is crucial here because otherwise we would build blocks from heaviest bank even if it was a duplicate.

Copy link
Contributor Author

@jstarry jstarry Oct 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t-nelson @carllin if you disagree on what the minimum change, please let me know. I'm also in favor of keeping the scope of this change as low as possible.

If we remove the last vote equality check and don't add the is_candidate check, then during the mainnet-beta issue, validators would have kept building off of 221 before it was determined to be confirmed. This might be a good thing actually but just wanted to call out that the last vote equality check happened to prevent this from happening.

The bug I intended to fix was that validators should've started building off of 221 after they detected it to be a confirmed duplicate, but didn't. Without is_candidate, the bug being fixed is that validators should've continued building off of 221 as soon as they started trying to switch to separate fork (225).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After more digging and discussion with @carllin, it looks like we have to add the is_candidate check after all because otherwise we could get into situations where the heaviest_slot_on_same_voted_fork was purged (because the cluster confirmed a different version) and we try to reset to a purged bank. I think that avoiding resetting to invalid forks should be avoided at all costs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that avoiding resetting to invalid forks should be avoided at all costs.

this would be separate from the condition that halted MB though, right? i'm not saying that we shouldn't harden further, just that we can run anything extra through the usual (priority) stabilization/backport process

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did agree to add the is_candidate check in the normal process but without it we expose validators to increased risk of panicking due to purged slots.

None => {
if !tower.is_stray_last_vote() {
// Unless last vote is stray and stale, self.is_candidate(last_voted_slot_hash) must return
// Some(_), justifying to panic! here.
// Also, adjust_lockouts_after_replay() correctly makes last_voted_slot None,
// if all saved votes are ancestors of replayed_root_slot. So this code shouldn't be
// touched in that case as well.
// In other words, except being stray, all other slots have been voted on while this
// validator has been running, so we must be able to fetch best_slots for all of
// them.
panic!(
"a bank at last_voted_slot({:?}) is a frozen bank so must have been \
added to heaviest_subtree_fork_choice at time of freezing",
last_voted_slot_hash,
)
} else {
// fork_infos doesn't have corresponding data for the stale stray last vote,
// meaning some inconsistency between saved tower and ledger.
// (newer snapshot, or only a saved tower is moved over to new setup?)
None
}
}
}
let heaviest_slot_hash_on_same_voted_fork =
heaviest_slot_hash_on_same_voted_fork.unwrap();

if heaviest_slot_hash_on_same_voted_fork == last_voted_slot_hash {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(note to self; still reviewing; i wonder removing this could result in double-voting on the same slot_hash? if that happens, vote program does misbehave or not?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid concern. This method is only used for resetting to a new poh bank, it's not used to determine which block to vote for

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, seem so. I'm saying this after having dug refresh_last_vote() code path...

None
} else {
Some(heaviest_slot_hash_on_same_voted_fork)
}
})
}

Expand Down Expand Up @@ -2957,9 +2952,10 @@ mod test {

// After marking the last vote in the tower as invalid, `heaviest_slot_on_same_voted_fork()`
// should disregard all descendants of that invalid vote
assert!(heaviest_subtree_fork_choice
.heaviest_slot_on_same_voted_fork(&tower)
.is_none());
assert_eq!(
heaviest_subtree_fork_choice.heaviest_slot_on_same_voted_fork(&tower),
None
);

// Adding another descendant to the invalid candidate won't
// update the best slot, even if it contains votes
Expand Down
151 changes: 146 additions & 5 deletions core/src/replay_stage.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2901,9 +2901,18 @@ impl ReplayStage {
);
}

// Given a heaviest bank, `heaviest_bank` and the next votable bank
// `heaviest_bank_on_same_voted_fork` as the validator's last vote, return
// a bank to vote on, a bank to reset to,
/// Given a `heaviest_bank` and a `heaviest_bank_on_same_voted_fork`, return
/// a bank to vote on, a bank to reset to, and a list of switch failure
/// reasons.
///
/// If `heaviest_bank_on_same_voted_fork` is `None` due to that fork no
/// longer being valid to vote on, it's possible that a validator will not
/// be able to reset away from the invalid fork that they last voted on. To
/// resolve this scenario, validators need to wait until they can create a
/// switch proof for another fork or until the invalid fork is be marked
/// valid again if it was confirmed by the cluster.
/// Until this is resolved, leaders will build each of their
/// blocks from the last reset bank on the invalid fork.
Comment on lines +2908 to +2915
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

pub fn select_vote_and_reset_forks(
heaviest_bank: &Arc<Bank>,
// Should only be None if there was no previous vote
Expand Down Expand Up @@ -5605,6 +5614,139 @@ pub(crate) mod tests {
);
}

#[test]
fn test_unconfirmed_duplicate_slots_and_lockouts_for_non_heaviest_fork() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ textbook-level quite readable unit test.

/*
Build fork structure:

slot 0
|
slot 1
/ \
slot 2 |
| |
slot 3 |
| |
slot 4 |
slot 5
*/
let forks = tr(0) / (tr(1) / (tr(2) / (tr(3) / (tr(4)))) / tr(5));

let mut vote_simulator = VoteSimulator::new(1);
vote_simulator.fill_bank_forks(forks, &HashMap::<Pubkey, Vec<u64>>::new(), true);
let (bank_forks, mut progress) = (vote_simulator.bank_forks, vote_simulator.progress);
let ledger_path = get_tmp_ledger_path!();
let blockstore = Arc::new(
Blockstore::open(&ledger_path).expect("Expected to be able to open database ledger"),
);
let mut tower = Tower::new_for_tests(8, 2.0 / 3.0);

// All forks have same weight so heaviest bank to vote/reset on should be the tip of
// the fork with the lower slot
let (vote_fork, reset_fork) = run_compute_and_select_forks(
&bank_forks,
&mut progress,
&mut tower,
&mut vote_simulator.heaviest_subtree_fork_choice,
&mut vote_simulator.latest_validator_votes_for_frozen_banks,
);
assert_eq!(vote_fork.unwrap(), 4);
assert_eq!(reset_fork.unwrap(), 4);
Comment on lines +5653 to +5654
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can just resort to () for adhoc type:

assert_eq!((vote_fork, reset_fork), (Some(4), Some(4));

this should give most helpful assert.

...in other words, rust rocks

same for below this paired asserts.


// Record the vote for 5 which is not on the heaviest fork.
tower.record_bank_vote(
&bank_forks.read().unwrap().get(5).unwrap(),
&Pubkey::default(),
);

// 4 should be the heaviest slot, but should not be votable
// because of lockout. 5 is the heaviest slot on the same fork as the last vote.
let (vote_fork, reset_fork) = run_compute_and_select_forks(
&bank_forks,
&mut progress,
&mut tower,
&mut vote_simulator.heaviest_subtree_fork_choice,
&mut vote_simulator.latest_validator_votes_for_frozen_banks,
);
assert!(vote_fork.is_none());
assert_eq!(reset_fork, Some(5));

// Mark 5 as duplicate
blockstore.store_duplicate_slot(5, vec![], vec![]).unwrap();
let mut duplicate_slots_tracker = DuplicateSlotsTracker::default();
let mut gossip_duplicate_confirmed_slots = GossipDuplicateConfirmedSlots::default();
let mut epoch_slots_frozen_slots = EpochSlotsFrozenSlots::default();
let bank5_hash = bank_forks.read().unwrap().bank_hash(5).unwrap();
assert_ne!(bank5_hash, Hash::default());
let duplicate_state = DuplicateState::new_from_state(
5,
&gossip_duplicate_confirmed_slots,
&mut vote_simulator.heaviest_subtree_fork_choice,
|| progress.is_dead(5).unwrap_or(false),
|| Some(bank5_hash),
);
let (ancestor_hashes_replay_update_sender, _ancestor_hashes_replay_update_receiver) =
unbounded();
check_slot_agrees_with_cluster(
5,
bank_forks.read().unwrap().root(),
&blockstore,
&mut duplicate_slots_tracker,
&mut epoch_slots_frozen_slots,
&mut vote_simulator.heaviest_subtree_fork_choice,
&mut DuplicateSlotsToRepair::default(),
&ancestor_hashes_replay_update_sender,
SlotStateUpdate::Duplicate(duplicate_state),
);

// 4 should be the heaviest slot, but should not be votable
// because of lockout. 5 is no longer valid due to it being a duplicate.
let (vote_fork, reset_fork) = run_compute_and_select_forks(
&bank_forks,
&mut progress,
&mut tower,
&mut vote_simulator.heaviest_subtree_fork_choice,
&mut vote_simulator.latest_validator_votes_for_frozen_banks,
);
assert!(vote_fork.is_none());
assert!(reset_fork.is_none());

// If slot 5 is marked as confirmed, it becomes the heaviest bank on same slot again
let mut duplicate_slots_to_repair = DuplicateSlotsToRepair::default();
gossip_duplicate_confirmed_slots.insert(5, bank5_hash);
let duplicate_confirmed_state = DuplicateConfirmedState::new_from_state(
bank5_hash,
|| progress.is_dead(5).unwrap_or(false),
|| Some(bank5_hash),
);
check_slot_agrees_with_cluster(
5,
bank_forks.read().unwrap().root(),
&blockstore,
&mut duplicate_slots_tracker,
&mut epoch_slots_frozen_slots,
&mut vote_simulator.heaviest_subtree_fork_choice,
&mut duplicate_slots_to_repair,
&ancestor_hashes_replay_update_sender,
SlotStateUpdate::DuplicateConfirmed(duplicate_confirmed_state),
);
// The confirmed hash is detected in `progress`, which means
// it's confirmation on the replayed block. This means we have
// the right version of the block, so `duplicate_slots_to_repair`
// should be empty
assert!(duplicate_slots_to_repair.is_empty());
let (vote_fork, reset_fork) = run_compute_and_select_forks(
&bank_forks,
&mut progress,
&mut tower,
&mut vote_simulator.heaviest_subtree_fork_choice,
&mut vote_simulator.latest_validator_votes_for_frozen_banks,
);
// Should now pick 5 as the heaviest fork from last vote again.
assert!(vote_fork.is_none());
assert_eq!(reset_fork.unwrap(), 5);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also would fail here if it didn't get stuck on the previous failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice gud test I like

}

#[test]
fn test_unconfirmed_duplicate_slots_and_lockouts() {
/*
Expand Down Expand Up @@ -5697,7 +5839,7 @@ pub(crate) mod tests {
&mut vote_simulator.latest_validator_votes_for_frozen_banks,
);
assert!(vote_fork.is_none());
assert_eq!(reset_fork.unwrap(), 3);
assert_eq!(reset_fork, Some(3));

// Now mark 2, an ancestor of 4, as duplicate
blockstore.store_duplicate_slot(2, vec![], vec![]).unwrap();
Expand Down Expand Up @@ -6778,7 +6920,6 @@ pub(crate) mod tests {
);
let (heaviest_bank, heaviest_bank_on_same_fork) = heaviest_subtree_fork_choice
.select_forks(&frozen_banks, tower, progress, ancestors, bank_forks);
assert!(heaviest_bank_on_same_fork.is_none());
let SelectVoteAndResetForkResult {
vote_bank,
reset_bank,
Expand Down