Skip to content

Conversation

@yngvar-antonsson
Copy link
Contributor

@yngvar-antonsson yngvar-antonsson commented Jun 6, 2025

When Raft failover is enabled on replicasets with fewer than 3 instances (typically routers), a bug occurs in leader selection:

  • Leader is changed in the replicaset

  • Router learns about new leader via membership

  • Config update triggers failover.cfg

  • Router incorrectly reverts to the old leader (first lexicographical instance) as if in 'disabled' mode

This causes vshard to return NON_MASTER errors since wrong storage is elected as master.

I didn't forget about

  • Tests
  • Changelog
  • Documentation

Closes https://jira.vk.team/browse/TNT-1415

@yngvar-antonsson yngvar-antonsson requested a review from Copilot June 6, 2025 15:07
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes incorrect initial leader appointments during Raft failover when a replicaset has fewer than three instances, and adds a test to verify leader selection history under Raft mode.

  • Introduce get_leaders_history in the test service to capture applied leader snapshots.
  • Enhance the Raft failover logic to appoint a leader in small clusters on initial configuration.
  • Replace disabled-mode appointment logic with the unified Raft get_appointments and add integration tests.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
test/integration/raft_test.lua Add raft_leaders_history test group and a new test
test/entrypoint/srv_basic.lua Track and expose leaders_history in the test module
cartridge/failover/raft.lua Insert fallback leader appointment for small clusters
cartridge/failover.lua Use raft_failover.get_appointments in disabled mode
CHANGELOG.rst Document “Invalid leader appointment in Raft failover”
Comments suppressed due to low confidence (3)

cartridge/failover.lua:961

  • Ensure that raft_failover is properly imported (e.g. local raft_failover = require('cartridge.failover.raft')) at the top of this file; otherwise this call will result in a nil-reference error.
first_appointments = raft_failover.get_appointments(topology_cfg)

test/integration/raft_test.lua:345

  • Verify that the Cluster helper defines wait_until_healthy(). If not, calling this method will raise an undefined-method error in the test.
g.cluster:wait_until_healthy()

test/entrypoint/srv_basic.lua:35

  • [nitpick] The leaders_history table persists across service restarts and tests. Consider clearing it in a stop or init hook to avoid cross-test contamination.
local leaders_history = {}

appointments[vars.replicaset_uuid] = vars.leader_uuid
-- If there is still no leader in the replicaset, appoint the first
-- electable instance as a leader.
for replicaset_uuid, _ in pairs(replicasets) do
Copy link

Copilot AI Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold ‘3’ for small clusters is a magic number—consider extracting it into a named constant (e.g. MIN_RAFT_REPLICASET_SIZE) or config parameter for clarity and easier adjustments.

Copilot uses AI. Check for mistakes.
Comment on lines 64 to 72
if appointments[replicaset_uuid] == nil then
local leaders = topology.get_leaders_order(
topology_cfg, replicaset_uuid, nil, {only_electable = false, only_enabled = true}
)
if vars.disable_raft_on_small_clusters and #leaders < 3 then
appointments[replicaset_uuid] = leaders[1]
end
end
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need another test for this case

-- If there is still no leader in the replicaset, appoint the first
-- electable instance as a leader.
for replicaset_uuid, _ in pairs(replicasets) do
if appointments[replicaset_uuid] == nil then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failover.disable doesn't change the membership payload.

If we disabled instance, raft_term and leader_uuid will remain the same.

If there were instances and Raft in the replicaset, the instances will keep the same raft_term and leader_uuid, and the old leader will remain as it was with Raft.

Therefore, it's better not to check for nil.

@Satbek Satbek force-pushed the incorrect-appointments branch from 6d2137e to 5027c9d Compare June 10, 2025 15:07
@a1div0 a1div0 self-requested a review June 11, 2025 05:43
@Satbek Satbek force-pushed the incorrect-appointments branch from 4599156 to 887c75c Compare June 11, 2025 06:26
Satbek added 2 commits June 11, 2025 11:43
Reproducer:
1. Start a cluster with one router and a storage replicaset of three instances
   (storage-1 is the initial leader).
2. Call `box.ctl.promote()` on storage-2.
3. Wait until the router learns the new leader via membership.
4. Patch the cluster-wide config to trigger `failover.cfg`.
5. The router reverts to the old master because it is lexicographically first
   in the topology, mimicking ‘disabled’ mode.

Fix: always build `first_appointments` via
`raft_failover.get_appointments(topology_cfg)` so routers do not fall back to
stale Raft information when the replicaset has fewer than three instances.
…ments in Raft mode

If `disable_raft_on_small_clusters` is enabled and a replicaset contains fewer
than three instances, treat it like ‘disabled’ failover and pick the first
enabled non-eleactable instance as leader.
@Satbek Satbek force-pushed the incorrect-appointments branch from 887c75c to 4532fb9 Compare June 11, 2025 06:51
@Satbek Satbek merged commit 100105d into master Jun 11, 2025
22 checks passed
@Satbek Satbek deleted the incorrect-appointments branch June 11, 2025 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants