OCPEDGE-2280: mutable topology#2008
Conversation
|
@jeff-roche: This pull request references OCPEDGE-2280 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
438e03c to
98a1ba4
Compare
brandisher
left a comment
There was a problem hiding this comment.
I'm missing a "why" statement covering why a day 2, out-of-payload operator is the right choice for this. The CVO section towards the bottom hints at the why a bit but more explicit detail is needed.
With that in mind, I haven't reviewed the EP fully because I don't understand why this is the approach we're taking. The assessment of CVO seems very light and not enough to exclude that as a potential option to meet the goals.
| - The CLI would need direct access to operator internals, violating separation of concerns | ||
| - Error recovery and retry logic is better suited to an operator's reconciliation loop | ||
|
|
||
| ### Controller in CVO |
There was a problem hiding this comment.
Is CVO the only option in the core operators where this might make sense?
There was a problem hiding this comment.
I expanded to include some other operators, none of which fit the bill in my opinion. This is an entirely new process and shoehorning it into another operator that wasn't designed for tackling this type of procedure seems irresponsible to me
There was a problem hiding this comment.
Which operator handles adding nodes to clusters?
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
@brandisher I've added a new paragraph under the |
JoelSpeed
left a comment
There was a problem hiding this comment.
🤖 Generated with Claude Code
There are significant portions of this proposal that assume behaviour of OpenShift that either doesn't exist, or doesn't work in the way proposed. I'm assuming here that this is hallucination of Claude?
The EP as it stands today doesn't actually make sense for implementation. It also doesn't align with what I thought we had agreed on the architecture call.
Has anyone tried to manually take a cluster and scale up and manually transition from a single replica to multiple replicas? IMO this is the most important next step for this project
What I thought we had agreed:
- To scale from SNO to HA, the user must create two new control plane nodes and join them to the cluster
- On HighlyAvailable topology - KAS, KCM, etcd, etc all get scheduled automatically as static pods on these nodes - I don't see anything that prevents this based on if it's a SNO cluster today, this needs to be checked (it probably should)
- MCO still serves ignition for control plane nodes on SNO, so user needs to create the control plane nodes somehow to ignite from here
- New fields are added to the infrastructure spec to allow the user to say "I intend for this cluster to be HA going forward"
- A controller is added to cluster config operator
- This checks that the precondition of having additional control plane nodes in the cluster is met
- Once the precondition is met, it updates the status to reflect spec
- Operators now react to the change in status and transition from single to HA
- etcd operator promotes learners to full members, quorum goes from 1->3 (I don't know if this guard is in place today, we should add if not)
- KAS/KCM - no change, it already scheduled new KAS/kCM pods
- Others - Those that previously deploy a single replica of their operand now move to 2 replicas, other changes might be needed on a per operator basis, I was expecting those details in the EP but don't see them yet
| - Installed either manually or via the `oc adm transition topology` command | ||
| - Owns the transition graph — the directed graph defining which topology transitions are supported | ||
| - Owns the validation criteria for each transition (required nodes, certificates, secrets, operator states) | ||
| - Orchestrates transitions by interacting with cluster operators via their existing APIs |
There was a problem hiding this comment.
I don't think this actually exists
There was a problem hiding this comment.
This statement is just objectively incorrect. :( I can see why you'd be confused reading this.
This never happens. What we can do is look to update things like ingress and console to be more adaptable like etcd/api-server such that they update their replicas when more infrastructure nodes become available and do firmer pre-flight checks so that the "transition" piece becomes a no-op, but I think it's OK for some operators to continue to treat the topology field as the source of truth for desired behavior.
An alternative would be to key off the infrastructure topologies "desiredTopology" and update the hooks for ingress to try to update it's replicas when it detects an update to that field. Then the pre-flight checks actually verify that has the right number of replicas and we update the topology after it's already succeeded. i guess it depends on whether we're treating the topology field or the desired topology field as the answer to "what should the operator being doing right now".
There was a problem hiding this comment.
it's OK for some operators to continue to treat the topology field as the source of truth for desired behavior.
Absolutely.
In an ideal world, most operators would not scale up their operands until the status toplogy fields were updated. We know that's not true today but I don't think we necessarily need to fix most of the controllers. The one controller that does concern me is etcd operator. Would be good to understand why it acts the way it does today (will just scale up and add the member to quorum on SNO) and whether there's a way we can change that behaviour so that it would treat new members as learners until the status toplogy transitions
@jeff-roche BTW can we get rid of the objectively incorrect statements at some point please
|
|
||
| #### Risk: Platform Bare Metal May Not Support Single-Node Clusters | ||
|
|
||
| **Risk**: If keepalived networking cannot be enabled, `platform: baremetal` will be limited to 2+ nodes, reducing the value of mutable topology for this platform. |
There was a problem hiding this comment.
limited to 2+? Isn't that the success criteria?
There was a problem hiding this comment.
baremetal platform doesn't support SNO because having a load balancer for 1 node doesn't make sense.
In order for users to get the benefits of having not having to manually deploy a load balancer (i.e. what they primarily save in terms of effort when deploying on platform: baremetal), we need to investigate if we can allow baremetal as a platform for SNO first (which loadbalancing disabled), and change that operator so that loadbalancing can be introduced post-transition.
Otherwise we need to introduce a new, scarier feature: platform transitions.
That a pandora's box I don't want to look at.
There was a problem hiding this comment.
That a pandora's box I don't want to look at.
You and me both
So are we tying this EP to not only supporting topology transitions, but also SNO on baremetal? I would have expected a SNO on baremetal project to be sufficiently large and warrant its own EP?
| - Error recovery and retry logic is better suited to an operator's reconciliation loop than imperative CLI code | ||
| - The CLI would need direct access to operator internals, violating separation of concerns | ||
|
|
||
| ### Extending an Existing Core Operator |
There was a problem hiding this comment.
Or cluster config operator which would make a very natural home for this as long as we have commitment of ownership from folks writing the new controller
There was a problem hiding this comment.
At risk of going on a tangent, currently the installer has problems calculating topology when laying down manifests. We have a bug for this in the backlog, and I left #1905 (comment) on the previous enhancement.
I would like to see that calculation moved to the cluster config operator in bootkube during bootstrapping. That solution could co-exist with this one (and my team will push it forward as priorities allow); but it could potentially also tie into this solution.
There was a problem hiding this comment.
I'm find with us using CCO for this. I will take the blame for miscommunicating this to Jeff - it didn't strike me as obvious that a controller for this transition would obviously belong there. My instincts were that new code in the core operators is expensive, especially for a controller that doesn't need to be running 99% of the time. That said, I think it's fine for this to be a controller that is installed with zero replicas and the replicas are scaled-up during transition events. That fits them main sentiment of what Jeff and I were trying to solve - minimizing the tax on clusters that will never use this feature (i.e. the vast majority of them).
There was a problem hiding this comment.
IMHO the solution for the installer is to get the user to specify their intent in the install-config (this should be passed through to the cluster). This enhancement is a good opportunity to define what that input should look like.
There was a problem hiding this comment.
Resolving this thread as I've re-scoped this to be a new CCO controller
patrickdillon
left a comment
There was a problem hiding this comment.
I know the scope is limited to baremetal/platform:none, but I know there is interest for mutable topologies in cloud platforms as well so as much as appropriate I would to ensure the design leaves a path forward for those cloud platforms.
Also, like the other enhancement I don't see any mention of mastersSchedulable which affects the calculation for infrastructureTopology. How is the mastersSchedulable field handled/taken into account for this solution?
| - Error recovery and retry logic is better suited to an operator's reconciliation loop than imperative CLI code | ||
| - The CLI would need direct access to operator internals, violating separation of concerns | ||
|
|
||
| ### Extending an Existing Core Operator |
There was a problem hiding this comment.
At risk of going on a tangent, currently the installer has problems calculating topology when laying down manifests. We have a bug for this in the backlog, and I left #1905 (comment) on the previous enhancement.
I would like to see that calculation moved to the cluster config operator in bootkube during bootstrapping. That solution could co-exist with this one (and my team will push it forward as priorities allow); but it could potentially also tie into this solution.
zaneb
left a comment
There was a problem hiding this comment.
This one looks directionally correct 👍
|
|
||
| ##### Pre-Transition | ||
|
|
||
| 1. The cluster administrator prepares the additional control-plane nodes (hardware, network, OS) |
There was a problem hiding this comment.
Does 'OS' here imply that the user joins the hosts to the cluster as as control plane nodes at this stage? If not, at what stage is that expected to happen?
There was a problem hiding this comment.
Not sure what this means? Does this mean just prepping the HW is inplace? Or does this mean adding the node as a worker node to the cluster?
That would have the benefit that we can rely on all the existing docs and procedures on how to add a worker node to an existing cluster.
| OTTO maintains a directed graph of supported transitions. For the initial implementation: | ||
|
|
||
| ```text | ||
| SingleReplica (SNO, platform: none) → HighlyAvailable (3-node compact) |
There was a problem hiding this comment.
I think it's a mistake to define the supported topologies in terms of the controlPlaneTopology field. There are at least 6 use cases I can think of that users have articulated:
- single-node (1 schedulable control plane, 0+ workers, no load balancer)
- compact (3 schedulable control plane, 0+ workers)
- standby (3 non-schedulable control plane, 0 workers)
- HA (3 non-schedulable control plane, 2+ workers)
- TNA (2 non-schedulable control plane, 1 arbiter, 2+ workers)
- TNF (2 schedulable control plane w/ STONITH, 0 workers)
There was a problem hiding this comment.
I've expanded the detail around CP and infra topology, as well as some validation rules around number of workers. For the first pass, we will report an error prior to transitioning if there are any worker nodes.
There was a problem hiding this comment.
Can you point me to this expansion? I have the same question as Zane still having re-read the EP. This IMO needs more expansion unless I missed a section
|
|
||
| The initial implementation targets `platform: none` clusters. On `platform: none`, the administrator is responsible for managing their own load balancing configuration (VIPs, DNS) when scaling beyond a single node. | ||
|
|
||
| `platform: baremetal` support is planned for a subsequent phase. Bare metal networking uses keepalived for ingress load balancing, which is not useful and creates a point of failure for SNO deployments. The Bare Metal Networking team will be consulted to determine if this networking setup can be enabled for single-node clusters transitioning to HA. |
There was a problem hiding this comment.
I find it weird that we are going to add single-node support to platform:baremetal just so that we can say we are not preventing it from later transitioning to HA.
Who is asking for this?
I would prefer that any effort from the on-prem networking team were instead directed toward adding optional on-prem networking to platform:external.
There was a problem hiding this comment.
As said in a previous comment, its crucial to get support for "plaform:baremetal" and keelaived load balancing (for ingress AND API) in the medium term. We should validate that there is no technical obstacle and this can be added in the next release. Rational: at the edge, there hardly is an external load balancer available.
| 10. OTTO updates the Infrastructure status fields: | ||
| - `controlPlaneTopology` transitions from `SingleReplica` to `HighlyAvailable` | ||
| - `infrastructureTopology` transitions from `SingleReplica` to `HighlyAvailable` | ||
| 11. Operators reconcile against the new topology values and adjust their deployment strategies, replica counts, and placement policies |
There was a problem hiding this comment.
Are we going to try to e.g. restart OLM operators (which previously have treated the topology as fixed)?
There was a problem hiding this comment.
Do you have a view into how many/which olm operators are reading this value? Are they reading it at startup, or watching the resource? The expected pattern would be that the operator sees the change, and then reacts by updating the operand (e.g. scaling from 1 to 2 replicas now that it's been told the cluster is HA)
| | cluster-etcd-operator | Coordinate with OTTO for sequential etcd scaling during transitions | | ||
| | Ingress, networking, monitoring operators | Respond to OTTO coordination signals during transitions; reconcile on Infrastructure config changes | | ||
|
|
||
| #### Platform Support Constraints |
There was a problem hiding this comment.
We need to mention that IBI clusters cannot be converted from SNO (and have some mechanism for preventing that).
There was a problem hiding this comment.
What's the technical blocker there?
| - Error recovery and retry logic is better suited to an operator's reconciliation loop than imperative CLI code | ||
| - The CLI would need direct access to operator internals, violating separation of concerns | ||
|
|
||
| ### Extending an Existing Core Operator |
There was a problem hiding this comment.
IMHO the solution for the installer is to get the user to specify their intent in the install-config (this should be passed through to the cluster). This enhancement is a good opportunity to define what that input should look like.
|
Big update coming next week to realign this with CCO instead of a dedicated operator, add some more technical detail around the flow, and address masters schedulable. Thank you everyone for the quick and thorough reviews, I believe we are rapidly converging on a solid solution! |
|
|
||
| ##### Pre-Transition | ||
|
|
||
| 1. The cluster administrator prepares the additional control-plane nodes (hardware, network, OS) |
There was a problem hiding this comment.
Not sure what this means? Does this mean just prepping the HW is inplace? Or does this mean adding the node as a worker node to the cluster?
That would have the benefit that we can rely on all the existing docs and procedures on how to add a worker node to an existing cluster.
| 4. CEO promotes the learner to a voting member — the cluster now has 2 voting members (quorum=2) | ||
| 5. CEO adds an etcd learner on the third control-plane node | ||
| 6. The learner syncs data from an existing voter | ||
| 7. CEO promotes the learner to a voting member — the cluster now has 3 voting members (quorum=2) |
There was a problem hiding this comment.
| 7. CEO promotes the learner to a voting member — the cluster now has 3 voting members (quorum=2) | |
| 7. CEO promotes the learner to a voting member — the cluster now has 3 voting members (quorum=3) |
|
|
||
| The initial implementation targets `platform: none` clusters. On `platform: none`, the administrator is responsible for managing their own load balancing configuration (VIPs, DNS) when scaling beyond a single node. | ||
|
|
||
| `platform: baremetal` support is planned for a subsequent phase. Bare metal networking uses keepalived for ingress load balancing, which is not useful and creates a point of failure for SNO deployments. The Bare Metal Networking team will be consulted to determine if this networking setup can be enabled for single-node clusters transitioning to HA. |
There was a problem hiding this comment.
As said in a previous comment, its crucial to get support for "plaform:baremetal" and keelaived load balancing (for ingress AND API) in the medium term. We should validate that there is no technical obstacle and this can be added in the next release. Rational: at the edge, there hardly is an external load balancer available.
| - The 2-member state is transient and follows the same sequential pattern used during cluster bootstrapping — a well-exercised code path | ||
| - Learner instances are used before promoting members to minimize the promotion window | ||
| - No availability guarantee during transitions; administrators should treat scaling operations as a maintenance window | ||
| - CEO will attempt rollback if scaling fails (e.g., rollback to 1 member if the 1→2→3 scale-up fails partway through) |
There was a problem hiding this comment.
What happens if the loss of quorum=2 was created by a split brain situation? Will both etcd attempt rollback to 1? This could lead to two individual clusters of one. I would be fine with specifing a simple heuristic to resolve this situation, e.g. dropping the younger etcd instance in favour of the older one or something like that. Or maybe a special command for the admin to resolve this situation
| - Learner instances are used before promoting members to minimize the promotion window | ||
| - No availability guarantee during transitions; administrators should treat scaling operations as a maintenance window | ||
| - CEO will attempt rollback if scaling fails (e.g., rollback to 1 member if the 1→2→3 scale-up fails partway through) | ||
| - Future iterations may explore admitting two learners simultaneously and promoting only when both are ready, eliminating the 2-member voting window entirely but that is out of scope for this enhancement |
There was a problem hiding this comment.
maybe worth adressing this directly, instead of dealing with the potential split brain situation from my previous comment?
Introduce the Mutable Topology enhancement, which replaces the previous Adaptable Topology proposal. Instead of a new topology enum that all operators must interpret, this approach uses a dedicated operator (OTTO) to orchestrate transitions between existing fixed topology modes. Initial scope: SNO to HA compact on platform: none. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move the topology transition controller from a standalone operator (OTTO) into cluster-config-operator. CCO owns the config.openshift.io API group and infrastructure CR lifecycle, making it the natural home. Key design decisions: - desiredTopology initialized by installer to match controlPlaneTopology (no kubebuilder default — value is cluster-specific) - Controller triggers on desiredTopology != status.controlPlaneTopology - On failure, controller resets desiredTopology to current topology - Upgrade blocked via Upgradeable=False during transitions - Condition types: TopologyTransitionProgressing, Completed, Failed - Per-operator topology audit required for Dev Preview entry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
a8d48b3 to
22b3682
Compare
|
@jeff-roche: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| - Resolution: CEO should attempt automatic rollback. If rollback fails, follow standard etcd disaster recovery procedures. | ||
|
|
||
| ### Recovery Procedures | ||
|
|
There was a problem hiding this comment.
As part of this transition, will backups scale ?
If I take a backup on SNO, wil it work on TNA or do I need to take a fresh/new backup ?
There was a problem hiding this comment.
I haven't thought through backups. @jaypoulz have you given this any thought? My initial thought is you would need to do a new backup as I'm not sure of how we would scale the backup.
|
Are there limitations for a SNO to TNF transition ? TNF requires BMC/Redfish so if the SNO bare metal hardware does not have it, does it block the transition? I could see this being a problem trying to match hardware in general (BMC firmware versions, vendor types, etc. ). |
JoelSpeed
left a comment
There was a problem hiding this comment.
This is much better than the previous iteration. I still fee like there's some disconnect between the new and old stuff, some stuff may still be hanging over from the previous iteration that doesn't quite make sense now, PTAL at my comments
|
|
||
| This enhancement enables OpenShift clusters to transition between topology modes as a Day 2 operation. This changes the existing OpenShift assumption that topologies are immutable after installation. | ||
|
|
||
| A new `desiredTopology` field in the infrastructure spec expresses the administrator's intent to transition. A topology transition controller in cluster-config-operator watches for changes to this field, validates preconditions, coordinates the transition, and updates the existing topology status fields when the cluster is ready. |
There was a problem hiding this comment.
Is this for infrastructure or control plane, or both?
There was a problem hiding this comment.
This is a fair question. In my head this entire process is about control plane scaling. I think we already have the necessary mechanisms in place to scale workers, right?
| This enhancement enables OpenShift clusters to transition between topology modes as a Day 2 operation. This changes the existing OpenShift assumption that topologies are immutable after installation. | ||
|
|
||
| A new `desiredTopology` field in the infrastructure spec expresses the administrator's intent to transition. A topology transition controller in cluster-config-operator watches for changes to this field, validates preconditions, coordinates the transition, and updates the existing topology status fields when the cluster is ready. | ||
| A new `oc adm transition topology` CLI command provides an interface for cluster administrators to initiate transitions. |
There was a problem hiding this comment.
Is this a common addition to the CLI? I have nothing against extending the CLI, but do question if it is strictly required
There was a problem hiding this comment.
I think it is not strictly required, this is more of a usability thing. In theory a cluster admin could go in and update the desired topology and manually monitor progress but that might feel disconnected. Through the CLI we could give some structure to the process
|
|
||
| A new `desiredTopology` field in the infrastructure spec expresses the administrator's intent to transition. A topology transition controller in cluster-config-operator watches for changes to this field, validates preconditions, coordinates the transition, and updates the existing topology status fields when the cluster is ready. | ||
| A new `oc adm transition topology` CLI command provides an interface for cluster administrators to initiate transitions. | ||
| The initial implementation supports transitioning Single Node OpenShift (SNO) clusters to HA compact (3-node) on `platform: none`. |
There was a problem hiding this comment.
Hoping to see somewhere a documented reason for why we are only considering platform none
There was a problem hiding this comment.
ack, I think this is covered a couple times in this doc but I can find a more explicit place to mention the reasoning
|
|
||
| This enhancement introduces a new infrastructure API field and a topology transition controller in cluster-config-operator (CCO; not to be confused with cloud-credential-operator) to enable topology transitions as Day 2 operations. | ||
|
|
||
| The approach follows the standard Kubernetes spec/status contract and mirrors the pattern used by `oc adm upgrade`: |
There was a problem hiding this comment.
It's more an openshift thing rather than a kube thing this pattern
|
|
||
| 3. **`oc adm transition topology` CLI command** — A command that validates preconditions before patching `spec.desiredTopology` on the infrastructure CR, then monitors transition progress. | ||
|
|
||
| The transition controller is proposed to live in cluster-config-operator because CCO is the canonical location for config.openshift.io CRD manifests and bootstrap CR rendering, and the topology transition logic is tightly coupled to the Infrastructure CR schema it ships. This is a deliberate expansion of CCO's scope since historically the repo has been limited to CRD manifests and bootstrap rendering. The controller is feature-gated using the standard library-go FeatureGateAccess pattern: when the gate is disabled the controller is not registered with the manager and incurs negligible runtime overhead; a gate change triggers an operator restart via ForceExit so the new state is picked up cleanly. |
There was a problem hiding this comment.
the repo has been limited to CRD manifests and bootstrap rendering
This is not really true, but also doesn't materially affect what you're trying to say in this EP
TBH, this whole paragraph is fluff IMO
There was a problem hiding this comment.
Good with me dropping it? It was a recommendation from chai-bot to add it and I figured it didn't hurt but agree it's fluff
|
|
||
| #### Risk: Platform Bare Metal May Not Support Single-Node Clusters | ||
|
|
||
| **Risk**: If keepalived networking cannot be enabled, `platform: baremetal` will be limited to 2+ nodes, reducing the value of mutable topology for this platform. |
There was a problem hiding this comment.
That a pandora's box I don't want to look at.
You and me both
So are we tying this EP to not only supporting topology transitions, but also SNO on baremetal? I would have expected a SNO on baremetal project to be sufficiently large and warrant its own EP?
|
|
||
| #### Risk: Cannot Validate External Requirements | ||
|
|
||
| **Risk**: On `platform: none`, the topology transition controller cannot validate external requirements such as correct load balancer configuration or DNS setup. An administrator may initiate a transition with misconfigured networking, leading to a partially functional cluster. |
There was a problem hiding this comment.
This is the first time load balancers are mentioned. Is this still something we expect the CCO to validate? Feels like that's up to the admin to set up before they initiate the transition, and not something we should be caring about IMO
| **Why it was rejected**: | ||
| - The scope does not warrant a new operator — cluster-config-operator is the natural home for this logic since it already owns the `config.openshift.io` API group and infrastructure CR lifecycle | ||
| - A standalone operator adds payload size, requires its own upgrade/lifecycle management, and introduces another component to monitor | ||
| - The transition controller can live in CCO with zero overhead when not in use, gated by the `MutableTopology` feature gate |
|
|
||
| ## Open Questions | ||
|
|
||
| 1. **HyperShift considerations**: Since the scope has broadened from edge-specific deployments to changing the topology assumption for OpenShift as a whole, do we need to consider HyperShift support? Initial answer is no — this would be future work and require its own enhancement. |
There was a problem hiding this comment.
This doesn't feel like an open question if it has an answer
| | ---- | ----------- | | ||
| | Precondition validation | Verify controller rejects transitions with missing nodes, invalid platforms, or unsupported source topologies | | ||
| | CLI interaction | Verify `oc adm transition topology` correctly patches `spec.desiredTopology` and monitors progress | | ||
| | Feature gate gating | Verify the controller is inactive when `MutableTopology` feature gate is disabled | |
There was a problem hiding this comment.
The API won't exist when the gate is disabled, so you won't be able to drive the controller even if it were running. I think this test is probably impossible if not superfluous
Summary
Introduces the Mutable Topology enhancement proposal, which enables OpenShift clusters to transition between topology modes as a Day 2 operation. This replaces the previous Adaptable Topology proposal.
Key Design Decisions
spec.desiredTopologyon the Infrastructure CR, validates preconditions, coordinates the transition across operators, and updates topology status fields when complete. CCO was chosen over CVO, CEO, and MCO (and over a standalone operator) because it owns theconfig.openshift.ioAPI group and the Infrastructure CR lifecycle. See Alternatives in the proposal for the full placement analysis.TopologyModevalues (SingleReplica,HighlyAvailable, etc.). Operators continue reacting to fixed topology values they already understand. Transition complexity is concentrated in a single controller rather than distributed across 30+ operators.spec.desiredTopologyexpresses administrator intent;status.controlPlaneTopologyreflects observed state. Mirrors theoc adm upgradepattern (patch spec, controller does the work).MutableTopologygate progresses through DevPreview → TechPreview → GA. Controller is not registered when the gate is disabled (zero runtime overhead).Scope
platform: noneoc adm transition topology HighlyAvailabledesiredTopology; ValidatingAdmissionPolicy (fail-closed) protects topology status fields from direct edits outside CCOdesiredTopologyon failure (deliberate spec mutation to prevent infinite retry loops); CEO attempts etcd rollbackUpgradeable=Falsewhile a transition is in progressWhat Changed (Revision History)
The proposal was revised to base the controller in CCO rather than proposing a dedicated standalone operator (OTTO). Key changes from the prior revision:
desiredTopologyon failure with rationale for the spec-mutation deviationUpgradeable=Falseenforcement during transitions to prevent concurrent upgradesOut of Scope
platform: baremetal— pending keepalived resolution🤖 Generated with Claude Code