Skip to content

[upgrade] Phase 3: HA replica-failover rolling upgrade (RPO=0, clustered) #392

Description

@ELares

Part of the upgrade epic, Phase 3. The RPO=0, zero-downtime upgrade for CLUSTERED deployments, using the built raft + replication.

Rolling node-by-node upgrade (the etcd/Consul/ElastiCache/Redis pattern): upgrade an in-sync REPLICA first (it catches up from the primary), then PROMOTE it (ownership moves only via a committed raft log + monotonic epoch - this fence IS synchronous promotion, so no acknowledged write is lost), then upgrade the old primary as a replica. Clients redirect on the failover; the dataset is never down.

Scope: an ironcache upgrade mode/flag that orchestrates the cluster-aware rolling upgrade (drive the replica upgrade via the spine, wait for in-sync, trigger the committed PromoteReplica, verify, upgrade the old primary); guardrails (refuse to promote a non-in-sync replica - the lag gate; require quorum). NOT applicable to the single-node prod box without first standing up a replica (raft mode is a boot-only cluster decision). Async replication has a small loss window bounded by the in-sync gate - document it.

Acceptance: in a raft cluster, an ironcache upgrade rolls the whole cluster to a new version with zero downtime and zero acknowledged-write loss, primary upgraded last. Depends on: the self-updater spine; prod/test running clustered.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:upgradeBinary self-upgrade (ironcache upgrade) workstreamsub-issueGranular child task split out from a parent design issue

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions