-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Failover mode : stateful with an etcd v3 backend
Replication : synchronous
Problem description
Under certain network partitions a router keeps an outdated leader map, continues to write to the former master, while other routers have already switched to the new master elected by the etcd quorum. As a result two masters accept writes concurrently → split-brain and data loss risk.
How to reproduce
- Topology
- one replicaset: 3 storages (storage-1, storage-2, storage-3)
- 2 routers (router-1, router-2)
- 3-node etcd cluster (etcd-A, etcd-B, etcd-C)
- synchronous replication enabled
- Create a network partition
- Break traffic between storage-1 + router-1 and two of the etcd nodes, leaving them connected only to etcd-A.
- Then break traffic between etcd-A and the remaining quorum (etcd-B, etcd-C).
Result: router-1 and current master storage-1 see just one etcd node that has lost quorum.
- Trigger master switch (from the healthy side):
The majority side (storage-2, storage-2, router-2reaching the quorum) promotes a new master.
membership.events.generate('storage-1-uri', membership.options.DEAD)
- Observe
- synchronous queue on storage-1 is abandoned, but it keeps read_only = false;
- router-1 that lost quorum still sees storage-1 as leader and send writes there;
- router-2 with quorum switch to storage-2 or storage-3
Two independent masters are now serving writes.
(The same scenario without synchronous replication has not yet been checked.)
Behaviour after storage / router regain access to the etcd quorum
Once storage-1
and router-1
can reach etcd-B
/etcd-C
again, the cluster still does not converge:
storage-1
keepsread_only = false
and retains the abandoned synchronous queue (it still thinks it is the leader).router-1
continues to read the stale leader map and sends writes tostorage-1
.router-2
(and the other storages) still direct writes tostorage-2
(orstorage-3
).
So two masters remain active indefinitely; manual intervention is required to restore a single writable leader.
Expected behaviour
Routers (and storages) should never proceed with a leader map that cannot be confirmed by the etcd quorum; they should either converge to the same master or refuse writes until they can obtain a consistent view.
Actual behaviour
Leader map is read from a stale, minority etcd node, so different routers disagree on the master and split-brain occurs.
Possible directions / open questions
- Quorum (linearizable) reads for active_leaders.
Always fetch leader map with etcd linearizable read to avoid stale values held by a partitioned minority node. - Add network partition tests
Cover scenarios with asymmetric connectivity and verify that all instances either agree on a single master or disable writes. - Router self-protection
If a router cannot obtain the leader map with quorum it should temporarily disable itself (vshard.router.disable()) to prevent writes to an obsolete master. - Storage fencing fallback
A storage that loses etcd connectivity during fencing should immediately switch to read_only = true, regardless of replica connectivity, until it regains quorum.