Skip to content

Split-brain when etcd connectivity is fragmented: routers write to different masters after failover #2342

@Satbek

Description

@Satbek

Failover mode : stateful with an etcd v3 backend
Replication : synchronous

Problem description

Under certain network partitions a router keeps an outdated leader map, continues to write to the former master, while other routers have already switched to the new master elected by the etcd quorum. As a result two masters accept writes concurrently → split-brain and data loss risk.

How to reproduce

  1. Topology
    • one replicaset: 3 storages (storage-1, storage-2, storage-3)
    • 2 routers (router-1, router-2)
    • 3-node etcd cluster (etcd-A, etcd-B, etcd-C)
    • synchronous replication enabled
  2. Create a network partition
    • Break traffic between storage-1 + router-1 and two of the etcd nodes, leaving them connected only to etcd-A.
    • Then break traffic between etcd-A and the remaining quorum (etcd-B, etcd-C).
      Result: router-1 and current master storage-1 see just one etcd node that has lost quorum.
  3. Trigger master switch (from the healthy side):
         membership.events.generate('storage-1-uri', membership.options.DEAD)
    The majority side (storage-2, storage-2, router-2reaching the quorum) promotes a new master.
  4. Observe
    • synchronous queue on storage-1 is abandoned, but it keeps read_only = false;
    • router-1 that lost quorum still sees storage-1 as leader and send writes there;
    • router-2 with quorum switch to storage-2 or storage-3
      Two independent masters are now serving writes.
      (The same scenario without synchronous replication has not yet been checked.)

Behaviour after storage / router regain access to the etcd quorum

Once storage-1 and router-1 can reach etcd-B/etcd-C again, the cluster still does not converge:

  • storage-1 keeps read_only = false and retains the abandoned synchronous queue (it still thinks it is the leader).
  • router-1 continues to read the stale leader map and sends writes to storage-1.
  • router-2 (and the other storages) still direct writes to storage-2 (or storage-3).

So two masters remain active indefinitely; manual intervention is required to restore a single writable leader.

Expected behaviour
Routers (and storages) should never proceed with a leader map that cannot be confirmed by the etcd quorum; they should either converge to the same master or refuse writes until they can obtain a consistent view.

Actual behaviour

Leader map is read from a stale, minority etcd node, so different routers disagree on the master and split-brain occurs.

Possible directions / open questions

  1. Quorum (linearizable) reads for active_leaders.
    Always fetch leader map with etcd linearizable read to avoid stale values held by a partitioned minority node.
  2. Add network partition tests
    Cover scenarios with asymmetric connectivity and verify that all instances either agree on a single master or disable writes.
  3. Router self-protection
    If a router cannot obtain the leader map with quorum it should temporarily disable itself (vshard.router.disable()) to prevent writes to an obsolete master.
  4. Storage fencing fallback
    A storage that loses etcd connectivity during fencing should immediately switch to read_only = true, regardless of replica connectivity, until it regains quorum.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions