Skip to content
172 changes: 171 additions & 1 deletion enhancements/kube-apiserver/kms-encryption-foundations.md
Original file line number Diff line number Diff line change
Expand Up @@ -453,6 +453,60 @@ Credentials and ConfigMap data (`kms-secret-{key}-{keyID}` and `kms-configmap-{k

During KMS-to-KMS migration, the encryption-configuration secret contains provider configs for all active keys. The operator creates a separate sidecar for each key, listening on its own unix domain socket (e.g., `kms-1.sock`, `kms-2.sock`).

#### Health Reporter Sidecar
Copy link
Copy Markdown
Contributor

@p0lyn0mial p0lyn0mial May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in general we miss:

  1. Specifying the probe interval
  2. Specifying the creds used to publish the condition
  3. Explaining what happens with orphaned conditions
  4. A plugin identity, when multiple plugins are specified - otherwise we cannot map status to a specific plugin.
  5. Listing the alternatives we discussed
  6. Specifying how the reporter connects to the apiserver. Perhaps for KAS we could go via localhost

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, this is worthy to elaborate to balance detection and hammering KAS with SSA request.
  2. I assumed we would have a SA with RBAC.
  3. https://github.com/openshift/enhancements/pull/2005/changes#diff-c0e8034f61b2841cee9b2b1d07403961148c3a30f096b0d35720a57ead8d3697R660
  4. In the status of the condition each health report has its own line. I assumed that the keyID would make it unique. Does it not?
  5. Ok.
  6. See 2.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Set to 30s.
  2. legacy SA token with RBAC for now.
  3. Node + keyID
  4. oh damn. I will add that.
  5. service is slightly better, but localhost is worth mentioning.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a section in alternatives to cover the options discussed: b411a09


When KMS encryption is enabled, a health reporter sidecar runs alongside every API server pod replica. The sidecar probes the colocated KMS plugin(s) and publishes the outcome to the owning operator's CR as a per-node condition. A separate aggregator controller picks up these conditions and emits one or more `ClusterOperator`-bound rollups (starting with `KMSPluginsDegraded`, see [Aggregator behavior](#aggregator-behavior)).

The sidecar's lifecycle (injection into the pod spec, image, mounts, RBAC) is managed by the same mechanism that handles KMS plugin sidecars; see [KMS Plugin Lifecycle Management](#kms-plugin-lifecycle-management-tech-preview-v2).

The reporter receives the set of UDS sockets to probe as flags at injection time. The `pluginlifecycle` package in library-go already enumerates the active KMS plugins from the encryption-config secret when it builds the pod spec, so passing the same socket paths into the reporter is essentially free. Plugin additions and removals always trigger a pod-spec change, which restarts the pod, so there is no live-discovery requirement.

##### Topology

One sidecar per API server pod replica, scaling with control-plane HA replica count:

- One per kube-apiserver static pod (typically 3 in HA, one per control-plane node)
- One per `openshift-oauth-apiserver` Deployment replica
- One per `openshift-apiserver` Deployment replica

During KMS-to-KMS migration, the same sidecar probes every active KMS plugin in its pod (see [Multiple Concurrent Sidecars](#multiple-concurrent-sidecars)) and reports their combined state in the Message field of its single per-node condition.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by checking the current EncryptionConfiguration?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The health reporter is given the UDS paths as a flag argument. Technically it could also scan for mounted UDS sockets.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh there's no multiple-concurrent-sidecars section :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah, that was two weeks ago. I think you just forgot to push a few commits back then.


##### Probe contract

Each sidecar probes its colocated KMS plugin(s) over the local UDS at `unix:///var/run/kmsplugin/kms-{keyID}.sock` (the same socket path scheme described in [Sidecar Injection](#sidecar-injection)).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, this states each sidecar probes its colocated KMS Plugins

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and is it the keyID though? I would assume we would allocate a unique socket path on each node like:

/var/run/kmsplugin/kms-kube-apiserver/{encryptionconfig.Name (?) }/plugin.sock

not sure whether we need to separate via folders here though

there's also no sidecar-injection section either 😄

Copy link
Copy Markdown
Contributor Author

@ibihim ibihim May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ardaguclu, is this a question? Each sidecar probes its colocated KMS Plugins. Yes, this is correct.

I added a "Naming caveat" section.

nodeName + keyID should be unique.

There is a sidecar injection section: https://github.com/ibihim/enhancements/blob/2bb10c537d05893622c0340a2b2322d6d0765abd/enhancements/kube-apiserver/kms-encryption-foundations.md#sidecar-injection


**Naming caveat.** `{keyID}` in the socket path is **not** an id of a cryptographic key. It is the id of the encryption key secret managed by the encryption controllers, a monotonically incrementing sequence number (a new one per key rotation). The KMS v2 plugin separately reports the id of the remote KEK it currently uses in its `StatusResponse.key_id`. This document keeps `keyID` for the socket-path id and `kekID` for the plugin-reported KEK. Conflating them will misbehave in any consumer that assumes `keyID` names a key.

##### Per-tick emission

Each probe produces one `PluginHealthCondition` (defined in [Message format](#message-format)) for the plugin it targeted. The sidecar collects one entry per colocated plugin into a `PluginHealthConditions` array and writes the minified JSON to the condition's `Message`. Each entry's `lastChecked` is the wall-clock time of that probe.

##### Destination

The sidecar writes one condition per pod replica to the owning operator's `*.operator.openshift.io/cluster` CR via Server-Side Apply (per-entry ownership via `+listType=map` on `OperatorStatus.Conditions`). The aggregator controller reads these conditions and emits the `KMSPluginsDegraded` rollup. See [KMS Plugin Health Conditions](#kms-plugin-health-conditions) for the exact naming, status mapping, and rollup behavior, and [KMS Health Reporter Connectivity](#kms-health-reporter-connectivity) for how the reporter authenticates and connects to perform the write.

```
within each apiserver pod (3 in HA):

KMS plugin (kms-2.sock, write) ──┐
KMS plugin (kms-1.sock, read) ──┤ UDS
health-reporter sidecar
│ SSA (per-node fieldManager)
operator CR (kubeapiservers.operator.openshift.io/cluster):
├─ KMSHealthReporter_<nodeName> ◄─ written by each per-pod reporter
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example about how kms-2.sock and kms-1.sock will be represented in operator CR?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw example below.

│ (one per node, multi-plugin state in Message)
└─ KMSPluginsDegraded ◄─ written by aggregator controller
│ (reads the per-node entries above)
│ matches _Degraded suffix
ClusterOperator: Degraded
```

### User Stories

- As a cluster admin, I want to enable KMS encryption by updating the APIServer resource, so I can declaratively configure encryption without manually managing keys.
Expand Down Expand Up @@ -508,6 +562,100 @@ This feature does not depend on the features that are excluded from the OKE prod
- keyController uses provider-specific field-level comparison (not simple equality) to determine migration necessity
- UDS path convention: `unix:///var/run/kmsplugin/kms-{keyID}.sock` — keyID appended for uniqueness

#### KMS Health Reporter Connectivity

**Auth**: the reporter uses a **legacy ServiceAccount token** (mounted from a Secret) bound to a minimal Role that only permits applying its single per-node condition entry on the operator CR. The projected SA tokens available in API-server-adjacent namespaces are admin-grade, as are the auth client certificates on disk; both would over-privilege a sidecar whose only job is one SSA apply. The legacy SA token keeps the blast radius minimal if the sidecar is compromised. The tradeoff is lifetime: a legacy token does not expire, whereas a projected token rotates. We accept this, since a token scoped to one verb on one resource is a far smaller prize than an admin-grade token, expiring or not.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the justification here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, same here


**Connection**: all reporters reach the kube-apiserver through the in-cluster Service `kubernetes.default.svc`.

In an HA control plane this survives KMS failure: if one node's KMS plugin breaks, that node's KAS degrades, but the Service still has healthy endpoints on the other nodes, so the affected node's reporter can still deliver its condition. The Service approach only fully breaks when every KMS plugin is down, and that is a cluster-down event already surfaced by far louder signals (`ClusterOperator`, etcd, kubelet probes) than a missing reporter condition. On Single-Node OpenShift the Service has a single endpoint, so a broken local KMS plugin does leave the reporter unable to write; this is acceptable for the same reason: the cluster is already hard-down and the condition would be redundant.

Dialing `127.0.0.1:6443` directly (the kube-apiserver static pod uses `hostNetwork: true`) was considered and rejected. It would bridge the post-start window where the local KAS accepts TLS connections but is still absent from `kubernetes.default` `Endpoints` (the Service reconciler self-gates on `/readyz`; see [`kubernetesservice/controller.go`](https://github.com/kubernetes/kubernetes/blob/master/pkg/controlplane/controller/kubernetesservice/controller.go)). But reporting KMS plugin health is not on KAS's critical startup path, and a not-ready KAS is already surfaced with higher signal-to-noise by `ClusterOperator`, kubelet probes, and KAS's own readiness machinery.

#### KMS Plugin Health Conditions

##### Naming convention

Each reporter sidecar writes one condition per pod replica to the owning operator's CR (`kubeapiservers.operator.openshift.io/cluster`, etc.), keyed by the node:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please mention about who is going to inject this side car.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


```
KMSHealthReporter_<nodeName>
```

The Type has no `_Available` or `_Degraded` suffix, so library-go's `StatusSyncer` ignores it and it does not propagate to the `ClusterOperator`. The aggregator controller consumes these conditions and emits the `KMSPluginsDegraded` rollup separately (see [Aggregator behavior](#aggregator-behavior)).

This is a **temporary mechanism**. Long term, we plan to add first-class status fields for KMS plugin health to the operator CR API, so this signal lives in a typed shape rather than a string-encoded condition. Until then, encoding it in `KMSHealthReporter_<nodeName>` avoids an API change and keeps the design reversible.

##### Status mapping

While this condition is a temporary solution (see [Naming convention](#naming-convention)), the `Status` and `Reason` are hardcoded to avoid library-go's `StatusSyncer` or other consumers reacting to per-pod transitions:

- `Status: True`
- `Reason: AsExpected`

All structured probe outcomes (per-plugin health, KEK ID, timestamps, error detail) live in the `Message` field and are parsed by the aggregator (see [Message format](#message-format) and [Aggregator behavior](#aggregator-behavior)).

##### Message format

The `Message` field carries the structured probe outcomes that the aggregator parses. It holds a single minified JSON array, one element per probed plugin:

```go
type PluginHealthConditions []PluginHealthCondition

type PluginHealthCondition struct {
KeyID string `json:"keyID"` // encryption-key-secret id from the socket path (kms-{keyID}.sock); not a cryptographic key
KEKID string `json:"kekID,omitempty"` // remote KEK id from the plugin's KMS v2 StatusResponse.key_id; omitted when the probe errors (no StatusResponse)
Status string `json:"status"` // healthy | unhealthy | error
LastChecked time.Time `json:"lastChecked"` // RFC 3339 timestamp of this probe
Detail string `json:"detail,omitempty"` // error/health detail; omitted when healthy
}
```

##### Example

A three-node control plane during KMS-to-KMS migration. Each pod carries two plugins (six total across the cluster): `keyID=2` is the new key handling writes and reads, `keyID=1` is the previous key kept read-only to decrypt in-flight data. In this snapshot:

- `master-0`: both plugins healthy
- `master-1`: the new plugin (`id=2`) has a misconfigured cloud credential
- `master-2`: cannot reach either plugin

`Status` and `Reason` are uniform per the [Status mapping](#status-mapping); the actionable signal lives in `Message`:

```yaml
status:
conditions:
- type: KMSHealthReporter_master-0
status: "True"
reason: AsExpected
message: '[{"kekID":"kek-9f2c","keyID":"2","status":"healthy","lastChecked":"2026-05-08T12:34:56Z"},{"kekID":"kek-4a17","keyID":"1","status":"healthy","lastChecked":"2026-05-08T12:34:56Z"}]'
- type: KMSHealthReporter_master-1
status: "True"
reason: AsExpected
message: '[{"kekID":"kek-9f2c","keyID":"2","status":"unhealthy","lastChecked":"2026-05-08T12:34:56Z","detail":"credential lacks decrypt permission"},{"kekID":"kek-4a17","keyID":"1","status":"healthy","lastChecked":"2026-05-08T12:34:56Z"}]'
- type: KMSHealthReporter_master-2
status: "True"
reason: AsExpected
message: '[{"keyID":"2","status":"error","lastChecked":"2026-05-08T12:34:56Z","detail":"connection refused"},{"keyID":"1","status":"error","lastChecked":"2026-05-08T12:34:56Z","detail":"connection refused"}]'
```

See [Aggregator behavior](#aggregator-behavior) for how these conditions roll up to the `ClusterOperator`.

##### Probe interval

Each reporter probes and emits on a fixed interval, **default 30 seconds**, passed as a sidecar flag at injection time (alongside the UDS socket paths). The exact value is not load-bearing: a probe cycle is `n` local UDS gRPC calls, one per `n` colocated plugins, followed by a single SSA write carrying all `n` results to the operator CR. Both are cheap, and tens-of-seconds detection latency is acceptable for KMS plugin health (key rotation and credential expiry are minutes-to-hours events).

Emission is unconditional and best-effort. The reporter writes its condition every tick even when nothing changed: the stale-reporter mitigation (see [Risks and Mitigations](#risks-and-mitigations)) relies on `lastChecked` advancing, so a write-on-change-only reporter would leave a healthy steady-state condition indistinguishable from a hung one. The reporter attempts the write every interval no matter the cluster state. If it cannot reach the kube-apiserver within the interval, it discards that result rather than queuing it: the reporter only ever needs its freshest probe on the CR, so once the next interval produces a result with a newer `lastChecked`, the un-written previous one is outdated and pointless to retry. A reporter that keeps failing to write stops advancing `lastChecked` and is caught by the staleness threshold.

The aggregator's staleness threshold is derived from the interval rather than configured independently: a condition whose `lastChecked` is older than `4 × interval` (120 s at the default) is treated as `Unknown`. Four intervals give enough data points that one or two dropped probes do not flip the rollup. Reporters apply small random jitter so replicas do not write the operator CR in lockstep.

##### Aggregator behavior

An aggregator controller reads the per-node `KMSHealthReporter_<nodeName>` conditions on the operator's CR and emits rollup conditions on the same CR. The first rollup is `KMSPluginsDegraded`; its `_Degraded` suffix routes it into the `ClusterOperator`'s `Degraded` slot via library-go's `StatusSyncer`. Additional rollups (e.g. `KMSPluginsAvailable`, `KMSPluginsProgressing`) may be added so the `ClusterOperator`'s `Available` and `Progressing` slots also reflect KMS plugin health. Each suffix maps to its matching `ClusterOperator` field via the same `StatusSyncer` convention, so each new type slots in without additional plumbing.

These rollup conditions (`KMSPluginsDegraded`, and any future `KMSPluginsAvailable` / `KMSPluginsProgressing`) are the admin-facing signal: they surface through `ClusterOperator` so that `oc get co kube-apiserver` is sufficient to learn KMS plugin health. The per-node `KMSHealthReporter_<nodeName>` conditions are plumbing for the aggregator and tooling, not intended for direct admin consumption.

The plan is to extend the existing [`conditionController`](https://github.com/openshift/library-go/blob/master/pkg/operator/encryption/controllers/condition_controller.go) in library-go's encryption controllers, which already emits the `Encrypted` condition on the same operator CR. It sits in the right call path (operator CR → ClusterOperator) and runs on the informer set the rollup needs. If extending it turns out to be a poor fit (conflicting sync triggers, unrelated dependencies that make the rollup hard to reason about), a dedicated controller will be introduced instead.

### Risks and Mitigations

**Risk: KMS Plugin Unavailable During Controller Sync**
Expand All @@ -526,6 +674,18 @@ This feature does not depend on the features that are excluded from the OKE prod
- **Impact:** Conflict with in-progress state machine
- **Mitigation:** keyController blocks new encryption key generation during promotion

**Risk: Stale Reporter Conditions**
- **Impact:** A reporter that hangs leaves its last `KMSHealthReporter_<nodeName>` condition in etcd unchanged.
- **Mitigation:** Per-plugin `lastChecked` timestamps in Message expose staleness. The aggregator controller treats a condition whose `lastChecked` exceeds the freshness threshold (`4 × probe interval`; see [Probe interval](#probe-interval)) as effectively `Unknown`.

**Risk: Orphaned Conditions on encryption type change and node replacement**
- **Impact:** When KMS is disabled (e.g., switching to `aescbc`), reporter sidecars are removed. Without explicit cleanup, `KMSHealthReporter_<nodeName>` and `KMSPluginsDegraded` entries remain stale on the operator CR.
- **Mitigation:** The aggregator controller owns cleanup. It removes orphaned `KMSHealthReporter_<nodeName>` entries (when their owning sidecar is no longer present) and removes its own `KMSPluginsDegraded` entry on KMS disable.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In SSA, will there be a fight between health reporter that tries to update the condition and aggregator controller that tries to cleanup the condition?.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's going to be a big problem when they stomp on each other, we might have some flapping during non-ready/deleted states though.

I wonder if we rather have top level condition of the KMSHealthReporter_<nodeName> owned by the aggregator controller.

Kinda similar to how the node controller works:
https://github.com/openshift/library-go/blob/3a6f949c22c3ffc0a2b28ee50e78173086e7adae/pkg/operator/staticpod/controller/node/node_controller.go

the sidecar would only ever write to it when the condition for said node exists. Sounds like too much coordination for a few conflicts between node state transitions...

for the later API, we could just use a map and key it off the nodename directly https://github.com/openshift/api/pull/2850/changes#diff-616d67895c3421c2d091662d30ed47b4b6f0b57db9411e29618d22b964ddb9efR362

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know better than me those conditions. In SSA, it is always better a controller fully owns the field.

Copy link
Copy Markdown
Contributor Author

@ibihim ibihim May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The *Degraded, *Available and so on are owned by the aggregator controller.
The aggregator controller itself would remove KMSHealthReporter_ that are owned by the KMS health reporters, not write into them. So there is some flapping, but it shouldn't be for too long / it shouldn't be a problem. Because the removal happens during garbage collection.

We could tear down the health reporter down before we tear down the KMS plugin, so we don't have wrong error conditions.


**Risk: Cold-Start Window**
- **Impact:** KMS plugin starts first (KAS depends on it), KAS starts second, reporter starts last. During the window between KAS readiness and reporter readiness, no `KMSHealthReporter_<nodeName>` condition exists even though KMS is functional.
- **Mitigation:** Consumers must not infer "KMS broken" from condition absence; missing means "not yet observed". KMS plugin lifecycle and KAS startup do not depend on reporter conditions existing.

#### KMS Key Loss Considerations

If the KMS key (the KEK used to encrypt the cluster seed, which Kubernetes then uses to generate DEKs for encrypting cluster data) is deleted externally, all encrypted resources in etcd become unreadable.
Expand Down Expand Up @@ -630,7 +790,7 @@ No special handling required.
## Operational Aspects of API Extensions

**Monitoring:**
- Operator conditions: `EncryptionControllerDegraded`, `EncryptionMigrationControllerProgressing`, `KMSPluginDegraded`
- Operator conditions: `EncryptionControllerDegraded`, `EncryptionMigrationControllerProgressing`, plus per-node `KMSHealthReporter_<nodeName>` and the aggregated `KMSPluginsDegraded` (rolled into the `ClusterOperator`'s `Degraded` condition; see [Health Reporter Sidecar](#health-reporter-sidecar))
- Metrics: `apiserver_storage_transformation_operations_total`, `apiserver_storage_transformation_duration_seconds`

**Impact:**
Expand Down Expand Up @@ -721,6 +881,16 @@ Instead of extending existing controllers, create new KMS-only controllers.

**Why not chosen:** Adds complexity to plugin lifecycle (must detect identical providers), breaks isolation, and complicates rollback scenarios.

### Alternative: Exposing KMS Plugin Status Outside the Pod

The plugin's `Status` gRPC is reachable only over the in-pod UDS. Three projections outward were considered:

1. **`kube-rbac-proxy` in front of the plugin's `Status` RPC.** Adds a new exposed port on the kube-apiserver pod.
2. **Carry patch in kube-apiserver** to expose plugin status. Grows the kube-apiserver carry set we are trying to shrink.
3. **`kube-rbac-proxy` in front of the kube-apiserver pod.** No new port, but inserts a single point of failure in front of kube-apiserver.

**Chosen:** the in-pod [Health Reporter Sidecar](#health-reporter-sidecar) consumes `Status` locally and pushes the result to the operator CR.

## Infrastructure Needed

None - extends existing library-go code.