Split-brain when etcd connectivity is fragmented: routers write to different masters after failover

**Failover mode :** stateful with an etcd v3 backend
**Replication** : synchronous 

**Problem description**

Under certain network partitions a router keeps an outdated leader map, continues to write to the former master, while other routers have already switched to the new master elected by the etcd quorum. As a result two masters accept writes concurrently → split-brain and data loss risk.

**How to reproduce**
1. Topology
      - one replicaset: 3 storages (storage-1, storage-2, storage-3)
      - 2 routers (router-1, router-2)
      - 3-node etcd cluster (etcd-A, etcd-B, etcd-C)
      - synchronous replication enabled
2. Create a network partition
    - Break traffic between storage-1 + router-1 and two of the etcd nodes, leaving them connected only to etcd-A.
    - Then break traffic between etcd-A and the remaining quorum (etcd-B, etcd-C).
    Result: router-1 and current master storage-1 see just one etcd node that has lost quorum.
3. Trigger master switch (from the healthy side):
    ```lua
         membership.events.generate('storage-1-uri', membership.options.DEAD)
    ```
    The majority side (storage-2, storage-2, router-2reaching the quorum) promotes a new master.
4. Observe
    - synchronous queue on storage-1 is abandoned, but it keeps read_only = false;
    - router-1 that lost quorum still sees storage-1 as leader and send writes there;
    - router-2 with quorum switch to storage-2 or storage-3
     Two independent masters are now serving writes.
(The same scenario without synchronous replication has not yet been checked.)

### Behaviour after storage / router regain access to the etcd quorum

Once **`storage-1` and `router-1` can reach `etcd-B`/`etcd-C` again**, the cluster still does **not converge**:

* `storage-1` keeps `read_only = false` and retains the abandoned synchronous queue (it still thinks it is the leader).
* `router-1` continues to read the stale leader map and sends writes to `storage-1`.
* `router-2` (and the other storages) still direct writes to `storage-2` (or `storage-3`).

So two masters remain active indefinitely; manual intervention is required to restore a single writable leader.

**Expected behaviour**
Routers (and storages) should never proceed with a leader map that cannot be confirmed by the etcd quorum; they should either converge to the same master or refuse writes until they can obtain a consistent view.

**Actual behaviour**

Leader map is read from a stale, minority etcd node, so different routers disagree on the master and split-brain occurs.

**Possible directions / open questions**

1. Quorum (linearizable) reads for active_leaders.
   Always fetch leader map with etcd linearizable read to avoid stale values held by a partitioned minority node.
2. Add network partition tests
 Cover scenarios with asymmetric connectivity and verify that all instances either agree on a single master or disable writes.
3. Router self-protection
    If a router cannot obtain the leader map with quorum it should temporarily disable itself (vshard.router.disable()) to prevent writes to an obsolete master.
4. Storage fencing fallback
    A storage that loses etcd connectivity during fencing should immediately switch to read_only = true, regardless of replica connectivity, until it regains quorum.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split-brain when etcd connectivity is fragmented: routers write to different masters after failover #2342

Behaviour after storage / router regain access to the etcd quorum

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Split-brain when etcd connectivity is fragmented: routers write to different masters after failover #2342

Description

Behaviour after storage / router regain access to the etcd quorum

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions