From b719627e9fb4879b95abc68590cdbe9491422f5b Mon Sep 17 00:00:00 2001 From: Krzysztof Ostrowski Date: Fri, 8 May 2026 19:42:35 +0200 Subject: [PATCH 1/7] kms: design health reporter sidecar and aggregator - Per-node health reporter sidecar publishes one advisory KMSHealthReporter_ condition on the apiserver operator CR. - Aggregator controller reads those conditions and emits a single KMSPluginsDegraded rollup; library-go's StatusSyncer routes the _Degraded suffix into the ClusterOperator's Degraded condition. - Message format: one key=value line per probed plugin (keyID, status, lastChecked, optional trailing detail). - Risks: stale reporter conditions, orphaned conditions on KMS disable, cold-start window. --- .../kms-encryption-foundations.md | 141 +++++++++++++++++- 1 file changed, 140 insertions(+), 1 deletion(-) diff --git a/enhancements/kube-apiserver/kms-encryption-foundations.md b/enhancements/kube-apiserver/kms-encryption-foundations.md index 16635717d9..434bd35b7e 100644 --- a/enhancements/kube-apiserver/kms-encryption-foundations.md +++ b/enhancements/kube-apiserver/kms-encryption-foundations.md @@ -453,6 +453,58 @@ Credentials and ConfigMap data (`kms-secret-{key}-{keyID}` and `kms-configmap-{k During KMS-to-KMS migration, the encryption-configuration secret contains provider configs for all active keys. The operator creates a separate sidecar for each key, listening on its own unix domain socket (e.g., `kms-1.sock`, `kms-2.sock`). +#### Health Reporter Sidecar + +When KMS encryption is enabled, a health reporter sidecar runs alongside every API server pod replica. The sidecar probes the colocated KMS plugin(s) and publishes the outcome to the owning operator's CR as a per-node condition. A separate aggregator controller picks up these conditions and emits a single `KMSPluginsDegraded` rollup, which propagates to the `ClusterOperator`'s `Degraded` condition. + +##### Topology + +One sidecar per API server pod replica, scaling with control-plane HA replica count: + +- One per kube-apiserver static pod (typically 3 in HA, one per control-plane node) +- One per `openshift-oauth-apiserver` Deployment replica +- One per `openshift-apiserver` Deployment replica + +During KMS-to-KMS migration, the same sidecar probes every active KMS plugin in its pod (see [Multiple Concurrent Sidecars](#multiple-concurrent-sidecars)) and reports their combined state in the Message field of its single per-node condition. + +##### Probe contract + +Each sidecar probes its colocated KMS plugin(s) over the local UDS at `unix:///var/run/kmsplugin/kms-{keyID}.sock` (the same socket path scheme described in [Sidecar Injection](#sidecar-injection)). + +##### Per-tick emission + +Each probe yields a HealthStatus carrying: + +- `State`: healthy / unhealthy / RPC error +- `KeyID`: keyID currently active in the probed plugin +- `Timestamp`: time of this check (last-check time; not the condition's `LastTransitionTime`, which records state flips only) + +##### Destination + +The sidecar writes one advisory condition per pod replica to the owning operator's `*.operator.openshift.io/cluster` CR via Server-Side Apply. The aggregator controller reads these advisory conditions and emits the `KMSPluginsDegraded` rollup. See [KMS Plugin Health Conditions](#kms-plugin-health-conditions) for the exact naming, status mapping, and rollup behavior. + +``` +within each apiserver pod (3 in HA): + + KMS plugin (kms-2.sock, write) ──┐ + KMS plugin (kms-1.sock, read) ──┤ UDS + │ + ▼ + health-reporter sidecar + │ + │ SSA (per-node fieldManager) + ▼ +operator CR (kubeapiservers.operator.openshift.io/cluster): + ├─ KMSHealthReporter_ ◄─ written by each per-pod reporter + │ (advisory, one per node, multi-plugin state in Message) + │ + └─ KMSPluginsDegraded ◄─ written by aggregator controller + │ (reads the per-node entries above) + │ matches _Degraded suffix + ▼ + ClusterOperator: Degraded +``` + ### User Stories - As a cluster admin, I want to enable KMS encryption by updating the APIServer resource, so I can declaratively configure encryption without manually managing keys. @@ -508,6 +560,81 @@ This feature does not depend on the features that are excluded from the OKE prod - keyController uses provider-specific field-level comparison (not simple equality) to determine migration necessity - UDS path convention: `unix:///var/run/kmsplugin/kms-{keyID}.sock` — keyID appended for uniqueness +#### KMS Plugin Health Conditions + +##### Naming convention + +Each reporter sidecar writes one condition per pod replica to the owning operator's CR (`kubeapiservers.operator.openshift.io/cluster`, etc.), keyed by the node: + +``` +KMSHealthReporter_ +``` + +The Type has no `_Available` or `_Degraded` suffix, so it stays advisory on the operator CR (library-go's `StatusSyncer` ignores it). The aggregator controller consumes these conditions and emits the `KMSPluginsDegraded` rollup separately (see [Aggregator behavior](#aggregator-behavior)). + +##### Status mapping + +The condition's `Status` is determined by the probe outcomes across all plugins in the sidecar's pod: + +| Probe outcome | Status | Reason | +|---|---|---| +| All plugins healthy | True | `AsExpected` | +| Any plugin returns not-ok | False | `Unhealthy` | +| Any plugin RPC error, no explicit not-ok | Unknown | `Unreachable` | + +##### Message format + +The Message is the structured input the aggregator controller consumes. One line per probed plugin: + +``` +keyID= status=healthy lastChecked= +``` + +When a plugin is unhealthy, a trailing `detail=` is appended: + +``` +keyID= status=unhealthy lastChecked= detail= +``` + +Each line is a sequence of `key=value` pairs separated by spaces. `detail=...` is the only field whose value may contain spaces; it is always last on its line. `lastChecked` is per-plugin so partial probe failures (one plugin stuck while others probe normally) are visible. + +##### Example + +A three-node control plane during KMS-to-KMS migration. One pod is healthy (master-0); another has a misconfigured cloud credential affecting one of its plugins (master-1); the third cannot reach its plugins at all (master-2): + +```yaml +status: + conditions: + - type: KMSHealthReporter_master-0 + status: "True" + reason: AsExpected + message: | + keyID=2 status=healthy lastChecked=2026-05-08T12:34:56Z + keyID=1 status=healthy lastChecked=2026-05-08T12:34:56Z + - type: KMSHealthReporter_master-1 + status: "False" + reason: Unhealthy + message: | + keyID=2 status=unhealthy lastChecked=2026-05-08T12:34:56Z detail=credential lacks decrypt permission + keyID=1 status=healthy lastChecked=2026-05-08T12:34:56Z + - type: KMSHealthReporter_master-2 + status: "Unknown" + reason: Unreachable + message: | + keyID=2 status=unreachable lastChecked=2026-05-08T12:34:56Z detail=connection refused + keyID=1 status=unreachable lastChecked=2026-05-08T12:34:56Z detail=connection refused +``` + +See [Aggregator behavior](#aggregator-behavior) for how these conditions roll up to the `ClusterOperator`. + +##### SSA mechanism + +Two classes of writers share the conditions array: the operator (writing its own conditions like `NodeControllerDegraded` plus the aggregator's `KMSPluginsDegraded` rollup) and N reporter sidecars (each writing its `KMSHealthReporter_`). Per-entry ownership is enabled by `+listType=map +listMapKey=type` on `OperatorStatus.Conditions`. Each reporter uses a per-node fieldManager (`kms-health-reporter-`); all writers apply with `Force: true`. + +##### Aggregator behavior + +A separate aggregator controller reads the `KMSHealthReporter_` conditions on the operator's CR and emits a single rollup condition `KMSPluginsDegraded`. The `_Degraded` suffix routes this rollup into the `ClusterOperator`'s `Degraded` slot via library-go's `StatusSyncer`. + ### Risks and Mitigations **Risk: KMS Plugin Unavailable During Controller Sync** @@ -526,6 +653,18 @@ This feature does not depend on the features that are excluded from the OKE prod - **Impact:** Conflict with in-progress state machine - **Mitigation:** keyController blocks new encryption key generation during promotion +**Risk: Stale Reporter Conditions** +- **Impact:** A reporter that hangs leaves its last `KMSHealthReporter_` condition in etcd unchanged. +- **Mitigation:** Per-plugin `lastChecked` timestamps in Message expose staleness. The aggregator controller can treat conditions whose `lastChecked` exceeds a freshness threshold as effectively `Unknown`. + +**Risk: Orphaned Conditions on Mode Switch** +- **Impact:** When KMS is disabled (e.g., switching to `aescbc`), reporter sidecars are removed. Without explicit cleanup, `KMSHealthReporter_` and `KMSPluginsDegraded` entries remain stale on the operator CR. +- **Mitigation:** The aggregator controller owns cleanup. It removes orphaned `KMSHealthReporter_` entries (when their owning sidecar is no longer present) and removes its own `KMSPluginsDegraded` entry on KMS disable. + +**Risk: Cold-Start Window** +- **Impact:** KMS plugin starts first (KAS depends on it), KAS starts second, reporter starts last. During the window between KAS readiness and reporter readiness, no `KMSHealthReporter_` condition exists even though KMS is functional. +- **Mitigation:** Consumers must not infer "KMS broken" from condition absence; missing means "not yet observed". KMS plugin lifecycle and KAS startup do not depend on reporter conditions existing. + #### KMS Key Loss Considerations If the KMS key (the KEK used to encrypt the cluster seed, which Kubernetes then uses to generate DEKs for encrypting cluster data) is deleted externally, all encrypted resources in etcd become unreadable. @@ -630,7 +769,7 @@ No special handling required. ## Operational Aspects of API Extensions **Monitoring:** -- Operator conditions: `EncryptionControllerDegraded`, `EncryptionMigrationControllerProgressing`, `KMSPluginDegraded` +- Operator conditions: `EncryptionControllerDegraded`, `EncryptionMigrationControllerProgressing`, plus per-node `KMSHealthReporter_` and the aggregated `KMSPluginsDegraded` (rolled into the `ClusterOperator`'s `Degraded` condition; see [Health Reporter Sidecar](#health-reporter-sidecar)) - Metrics: `apiserver_storage_transformation_operations_total`, `apiserver_storage_transformation_duration_seconds` **Impact:** From 7cb7c431c7de23c1721bdd89beb098e0fdfb0c52 Mon Sep 17 00:00:00 2001 From: Krzysztof Ostrowski Date: Wed, 20 May 2026 14:16:53 +0200 Subject: [PATCH 2/7] kms: tighten health reporter section for coherence Reconcile terminology and arguments across the Health Reporter Sidecar and KMS Plugin Health Conditions sections: - Add a naming caveat distinguishing the socket-path keyID (encryption key secret id) from the plugin-reported kekID. - Collapse the duplicate per-tick struct into a single PluginHealthCondition definition; rename KMSPluginID to KeyID. - Drop the stale LastTransitionTime contrast (Status is now hardcoded). - Strengthen the connection rationale with the HA-routing argument and a Single-Node OpenShift caveat. - Acknowledge the legacy SA token lifetime tradeoff. --- .../kms-encryption-foundations.md | 95 +++++++++++-------- 1 file changed, 53 insertions(+), 42 deletions(-) diff --git a/enhancements/kube-apiserver/kms-encryption-foundations.md b/enhancements/kube-apiserver/kms-encryption-foundations.md index 434bd35b7e..b7fa2277dc 100644 --- a/enhancements/kube-apiserver/kms-encryption-foundations.md +++ b/enhancements/kube-apiserver/kms-encryption-foundations.md @@ -455,7 +455,11 @@ During KMS-to-KMS migration, the encryption-configuration secret contains provid #### Health Reporter Sidecar -When KMS encryption is enabled, a health reporter sidecar runs alongside every API server pod replica. The sidecar probes the colocated KMS plugin(s) and publishes the outcome to the owning operator's CR as a per-node condition. A separate aggregator controller picks up these conditions and emits a single `KMSPluginsDegraded` rollup, which propagates to the `ClusterOperator`'s `Degraded` condition. +When KMS encryption is enabled, a health reporter sidecar runs alongside every API server pod replica. The sidecar probes the colocated KMS plugin(s) and publishes the outcome to the owning operator's CR as a per-node condition. A separate aggregator controller picks up these conditions and emits one or more `ClusterOperator`-bound rollups (starting with `KMSPluginsDegraded`, see [Aggregator behavior](#aggregator-behavior)). + +The sidecar's lifecycle (injection into the pod spec, image, mounts, RBAC) is managed by the same mechanism that handles KMS plugin sidecars; see [KMS Plugin Lifecycle Management](#kms-plugin-lifecycle-management-tech-preview-v2). + +The reporter receives the set of UDS sockets to probe as flags at injection time. The `pluginlifecycle` package in library-go already enumerates the active KMS plugins from the encryption-config secret when it builds the pod spec (see [`AddKMSPluginSidecarToPodSpec`](https://github.com/openshift/library-go/blob/master/pkg/operator/encryption/kms/pluginlifecycle/sidecar.go)), so passing the same socket paths into the reporter is essentially free. Plugin additions and removals always trigger a pod-spec change, which restarts the pod, so there is no live-discovery requirement. ##### Topology @@ -471,17 +475,15 @@ During KMS-to-KMS migration, the same sidecar probes every active KMS plugin in Each sidecar probes its colocated KMS plugin(s) over the local UDS at `unix:///var/run/kmsplugin/kms-{keyID}.sock` (the same socket path scheme described in [Sidecar Injection](#sidecar-injection)). -##### Per-tick emission +**Naming caveat.** `{keyID}` in the socket path is **not** an id of a cryptographic key. It is the id of the encryption key secret managed by the encryption controllers, which acts as a **revision number**. The KMS v2 plugin separately reports the id of the remote KEK it currently uses in its `StatusResponse.key_id`. This document keeps `keyID` for the socket-path id and `kekID` for the plugin-reported KEK. Conflating them will misbehave in any consumer that assumes `keyID` names a key. -Each probe yields a HealthStatus carrying: +##### Per-tick emission -- `State`: healthy / unhealthy / RPC error -- `KeyID`: keyID currently active in the probed plugin -- `Timestamp`: time of this check (last-check time; not the condition's `LastTransitionTime`, which records state flips only) +Each probe produces one `PluginHealthCondition` (defined in [Message format](#message-format)) for the plugin it targeted. The sidecar collects one entry per colocated plugin into a `PluginHealthConditions` array and writes the minified JSON to the condition's `Message`. Each entry's `lastChecked` is the wall-clock time of that probe. ##### Destination -The sidecar writes one advisory condition per pod replica to the owning operator's `*.operator.openshift.io/cluster` CR via Server-Side Apply. The aggregator controller reads these advisory conditions and emits the `KMSPluginsDegraded` rollup. See [KMS Plugin Health Conditions](#kms-plugin-health-conditions) for the exact naming, status mapping, and rollup behavior. +The sidecar writes one advisory condition per pod replica to the owning operator's `*.operator.openshift.io/cluster` CR via Server-Side Apply (per-entry ownership via `+listType=map` on `OperatorStatus.Conditions`). The aggregator controller reads these advisory conditions and emits the `KMSPluginsDegraded` rollup. See [KMS Plugin Health Conditions](#kms-plugin-health-conditions) for the exact naming, status mapping, and rollup behavior. ``` within each apiserver pod (3 in HA): @@ -505,6 +507,16 @@ operator CR (kubeapiservers.operator.openshift.io/cluster): ClusterOperator: Degraded ``` +##### Auth and connection + +**Auth**: the reporter uses a **legacy ServiceAccount token** (mounted from a Secret) bound to a minimal Role that only permits applying its single per-node condition entry on the operator CR. The projected SA tokens available in API-server-adjacent namespaces are admin-grade, as are the auth client certificates on disk; both would over-privilege a sidecar whose only job is one SSA apply. The legacy SA token keeps the blast radius minimal if the sidecar is compromised. The tradeoff is lifetime: a legacy token does not expire, whereas a projected token rotates. We accept this, since a token scoped to one verb on one resource is a far smaller prize than an admin-grade token, expiring or not. + +**Connection**: all reporters reach the kube-apiserver through the in-cluster Service `kubernetes.default.svc`. + +In an HA control plane this survives KMS failure: if one node's KMS plugin breaks, that node's KAS degrades, but the Service still has healthy endpoints on the other nodes, so the affected node's reporter can still deliver its condition. The Service approach only fully breaks when every KMS plugin is down, and that is a cluster-down event already surfaced by far louder signals (`ClusterOperator`, etcd, kubelet probes) than a missing reporter condition. On Single-Node OpenShift the Service has a single endpoint, so a broken local KMS plugin does leave the reporter unable to write; this is acceptable for the same reason: the cluster is already hard-down and the condition would be redundant. + +Dialing `127.0.0.1:6443` directly (the kube-apiserver static pod uses `hostNetwork: true`) was considered and rejected. It would bridge the post-start window where the local KAS accepts TLS connections but is still absent from `kubernetes.default` `Endpoints` (the Service reconciler self-gates on `/readyz`; see [`kubernetesservice/controller.go`](https://github.com/kubernetes/kubernetes/blob/master/pkg/controlplane/controller/kubernetesservice/controller.go)). But reporting KMS plugin health is not on KAS's critical startup path, and a not-ready KAS is already surfaced with higher signal-to-noise by `ClusterOperator`, kubelet probes, and KAS's own readiness machinery. + ### User Stories - As a cluster admin, I want to enable KMS encryption by updating the APIServer resource, so I can declaratively configure encryption without manually managing keys. @@ -572,35 +584,42 @@ KMSHealthReporter_ The Type has no `_Available` or `_Degraded` suffix, so it stays advisory on the operator CR (library-go's `StatusSyncer` ignores it). The aggregator controller consumes these conditions and emits the `KMSPluginsDegraded` rollup separately (see [Aggregator behavior](#aggregator-behavior)). +This is a **temporary mechanism**. Long term, we plan to add first-class status fields for KMS plugin health to the operator CR API, so this signal lives in a typed shape rather than an advisory string-encoded condition. Until then, encoding it in `KMSHealthReporter_` avoids an API change and keeps the design reversible. + ##### Status mapping -The condition's `Status` is determined by the probe outcomes across all plugins in the sidecar's pod: +While this advisory condition is a temporary solution (see [Naming convention](#naming-convention)), the `Status` and `Reason` are hardcoded to avoid library-go's `StatusSyncer` or other consumers reacting to per-pod transitions: -| Probe outcome | Status | Reason | -|---|---|---| -| All plugins healthy | True | `AsExpected` | -| Any plugin returns not-ok | False | `Unhealthy` | -| Any plugin RPC error, no explicit not-ok | Unknown | `Unreachable` | +- `Status: True` +- `Reason: AsExpected` + +All structured probe outcomes (per-plugin health, KEK ID, timestamps, error detail) live in the `Message` field and are parsed by the aggregator (see [Message format](#message-format) and [Aggregator behavior](#aggregator-behavior)). ##### Message format -The Message is the structured input the aggregator controller consumes. One line per probed plugin: +The `Message` field carries the structured probe outcomes that the aggregator parses. It holds a single minified JSON array, one element per probed plugin: -``` -keyID= status=healthy lastChecked= -``` +```go +type PluginHealthConditions []PluginHealthCondition -When a plugin is unhealthy, a trailing `detail=` is appended: - -``` -keyID= status=unhealthy lastChecked= detail= +type PluginHealthCondition struct { + KeyID string `json:"keyID"` // encryption-key-secret id from the socket path (kms-{keyID}.sock); not a cryptographic key + KEKID string `json:"kekID,omitempty"` // remote KEK id from the plugin's KMS v2 StatusResponse.key_id; omitted when the plugin is unreachable + Status string `json:"status"` // healthy | unhealthy | unreachable + LastChecked time.Time `json:"lastChecked"` // RFC 3339 timestamp of this probe + Detail string `json:"detail,omitempty"` // error/health detail; omitted when healthy +} ``` -Each line is a sequence of `key=value` pairs separated by spaces. `detail=...` is the only field whose value may contain spaces; it is always last on its line. `lastChecked` is per-plugin so partial probe failures (one plugin stuck while others probe normally) are visible. - ##### Example -A three-node control plane during KMS-to-KMS migration. One pod is healthy (master-0); another has a misconfigured cloud credential affecting one of its plugins (master-1); the third cannot reach its plugins at all (master-2): +A three-node control plane during KMS-to-KMS migration. Each pod carries two plugins (six total across the cluster): `keyID=2` is the new key handling writes and reads, `keyID=1` is the previous key kept read-only to decrypt in-flight data. In this snapshot: + +- `master-0`: both plugins healthy +- `master-1`: the new plugin (`id=2`) has a misconfigured cloud credential +- `master-2`: cannot reach either plugin + +`Status` and `Reason` are uniform per the [Status mapping](#status-mapping); the actionable signal lives in `Message`: ```yaml status: @@ -608,32 +627,24 @@ status: - type: KMSHealthReporter_master-0 status: "True" reason: AsExpected - message: | - keyID=2 status=healthy lastChecked=2026-05-08T12:34:56Z - keyID=1 status=healthy lastChecked=2026-05-08T12:34:56Z + message: '[{"kekID":"projects/p/locations/l/keyRings/r/cryptoKeys/k/cryptoKeyVersions/2","keyID":"2","status":"healthy","lastChecked":"2026-05-08T12:34:56Z"},{"kekID":"projects/p/locations/l/keyRings/r/cryptoKeys/k/cryptoKeyVersions/1","keyID":"1","status":"healthy","lastChecked":"2026-05-08T12:34:56Z"}]' - type: KMSHealthReporter_master-1 - status: "False" - reason: Unhealthy - message: | - keyID=2 status=unhealthy lastChecked=2026-05-08T12:34:56Z detail=credential lacks decrypt permission - keyID=1 status=healthy lastChecked=2026-05-08T12:34:56Z + status: "True" + reason: AsExpected + message: '[{"kekID":"projects/p/locations/l/keyRings/r/cryptoKeys/k/cryptoKeyVersions/2","keyID":"2","status":"unhealthy","lastChecked":"2026-05-08T12:34:56Z","detail":"credential lacks decrypt permission"},{"kekID":"projects/p/locations/l/keyRings/r/cryptoKeys/k/cryptoKeyVersions/1","keyID":"1","status":"healthy","lastChecked":"2026-05-08T12:34:56Z"}]' - type: KMSHealthReporter_master-2 - status: "Unknown" - reason: Unreachable - message: | - keyID=2 status=unreachable lastChecked=2026-05-08T12:34:56Z detail=connection refused - keyID=1 status=unreachable lastChecked=2026-05-08T12:34:56Z detail=connection refused + status: "True" + reason: AsExpected + message: '[{"keyID":"2","status":"unreachable","lastChecked":"2026-05-08T12:34:56Z","detail":"connection refused"},{"keyID":"1","status":"unreachable","lastChecked":"2026-05-08T12:34:56Z","detail":"connection refused"}]' ``` See [Aggregator behavior](#aggregator-behavior) for how these conditions roll up to the `ClusterOperator`. -##### SSA mechanism - -Two classes of writers share the conditions array: the operator (writing its own conditions like `NodeControllerDegraded` plus the aggregator's `KMSPluginsDegraded` rollup) and N reporter sidecars (each writing its `KMSHealthReporter_`). Per-entry ownership is enabled by `+listType=map +listMapKey=type` on `OperatorStatus.Conditions`. Each reporter uses a per-node fieldManager (`kms-health-reporter-`); all writers apply with `Force: true`. - ##### Aggregator behavior -A separate aggregator controller reads the `KMSHealthReporter_` conditions on the operator's CR and emits a single rollup condition `KMSPluginsDegraded`. The `_Degraded` suffix routes this rollup into the `ClusterOperator`'s `Degraded` slot via library-go's `StatusSyncer`. +An aggregator controller reads the per-node `KMSHealthReporter_` conditions on the operator's CR and emits rollup conditions on the same CR. The first rollup is `KMSPluginsDegraded`; its `_Degraded` suffix routes it into the `ClusterOperator`'s `Degraded` slot via library-go's `StatusSyncer`. Additional rollups (e.g. `KMSPluginsAvailable`, `KMSPluginsProgressing`) may be added so the `ClusterOperator`'s `Available` and `Progressing` slots also reflect KMS plugin health. Each suffix maps to its matching `ClusterOperator` field via the same `StatusSyncer` convention, so each new type slots in without additional plumbing. + +The plan is to extend the existing [`conditionController`](https://github.com/openshift/library-go/blob/master/pkg/operator/encryption/controllers/condition_controller.go) in library-go's encryption controllers, which already emits the `Encrypted` condition on the same operator CR. It sits in the right call path (operator CR → ClusterOperator) and runs on the informer set the rollup needs. If extending it turns out to be a poor fit (conflicting sync triggers, unrelated dependencies that make the rollup hard to reason about), a dedicated controller will be introduced instead. ### Risks and Mitigations From 5796ea309c015f0158ba463788271812330efa2e Mon Sep 17 00:00:00 2001 From: Krzysztof Ostrowski Date: Wed, 20 May 2026 15:00:21 +0200 Subject: [PATCH 3/7] kms: refine health reporter terminology and section placement Address review nits on the health reporter design: - Drop the deep link to AddKMSPluginSidecarToPodSpec; keep the pluginlifecycle package reference (function names and master URLs rot). - Describe the socket-path keyID as a monotonically incrementing sequence number instead of a "revision number", which collides with the kas-o RevisionController concept. - Remove the "advisory" label from the per-node condition: it is misleading since the aggregator does act on it. Describe the StatusSyncer mechanic directly instead. - Move Auth and connection out of the Proposal into a new KMS Health Reporter Connectivity section under Implementation Details. - Rename the unreachable probe status to error, since a failed Status RPC can fail for reasons beyond unreachability. - Replace the GCP-style example kekIDs with short opaque values, decorrelated from keyID to reinforce the naming caveat. --- .../kms-encryption-foundations.md | 44 +++++++++---------- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/enhancements/kube-apiserver/kms-encryption-foundations.md b/enhancements/kube-apiserver/kms-encryption-foundations.md index b7fa2277dc..2ee31d319c 100644 --- a/enhancements/kube-apiserver/kms-encryption-foundations.md +++ b/enhancements/kube-apiserver/kms-encryption-foundations.md @@ -459,7 +459,7 @@ When KMS encryption is enabled, a health reporter sidecar runs alongside every A The sidecar's lifecycle (injection into the pod spec, image, mounts, RBAC) is managed by the same mechanism that handles KMS plugin sidecars; see [KMS Plugin Lifecycle Management](#kms-plugin-lifecycle-management-tech-preview-v2). -The reporter receives the set of UDS sockets to probe as flags at injection time. The `pluginlifecycle` package in library-go already enumerates the active KMS plugins from the encryption-config secret when it builds the pod spec (see [`AddKMSPluginSidecarToPodSpec`](https://github.com/openshift/library-go/blob/master/pkg/operator/encryption/kms/pluginlifecycle/sidecar.go)), so passing the same socket paths into the reporter is essentially free. Plugin additions and removals always trigger a pod-spec change, which restarts the pod, so there is no live-discovery requirement. +The reporter receives the set of UDS sockets to probe as flags at injection time. The `pluginlifecycle` package in library-go already enumerates the active KMS plugins from the encryption-config secret when it builds the pod spec, so passing the same socket paths into the reporter is essentially free. Plugin additions and removals always trigger a pod-spec change, which restarts the pod, so there is no live-discovery requirement. ##### Topology @@ -475,7 +475,7 @@ During KMS-to-KMS migration, the same sidecar probes every active KMS plugin in Each sidecar probes its colocated KMS plugin(s) over the local UDS at `unix:///var/run/kmsplugin/kms-{keyID}.sock` (the same socket path scheme described in [Sidecar Injection](#sidecar-injection)). -**Naming caveat.** `{keyID}` in the socket path is **not** an id of a cryptographic key. It is the id of the encryption key secret managed by the encryption controllers, which acts as a **revision number**. The KMS v2 plugin separately reports the id of the remote KEK it currently uses in its `StatusResponse.key_id`. This document keeps `keyID` for the socket-path id and `kekID` for the plugin-reported KEK. Conflating them will misbehave in any consumer that assumes `keyID` names a key. +**Naming caveat.** `{keyID}` in the socket path is **not** an id of a cryptographic key. It is the id of the encryption key secret managed by the encryption controllers, a monotonically incrementing sequence number (a new one per key rotation). The KMS v2 plugin separately reports the id of the remote KEK it currently uses in its `StatusResponse.key_id`. This document keeps `keyID` for the socket-path id and `kekID` for the plugin-reported KEK. Conflating them will misbehave in any consumer that assumes `keyID` names a key. ##### Per-tick emission @@ -483,7 +483,7 @@ Each probe produces one `PluginHealthCondition` (defined in [Message format](#me ##### Destination -The sidecar writes one advisory condition per pod replica to the owning operator's `*.operator.openshift.io/cluster` CR via Server-Side Apply (per-entry ownership via `+listType=map` on `OperatorStatus.Conditions`). The aggregator controller reads these advisory conditions and emits the `KMSPluginsDegraded` rollup. See [KMS Plugin Health Conditions](#kms-plugin-health-conditions) for the exact naming, status mapping, and rollup behavior. +The sidecar writes one condition per pod replica to the owning operator's `*.operator.openshift.io/cluster` CR via Server-Side Apply (per-entry ownership via `+listType=map` on `OperatorStatus.Conditions`). The aggregator controller reads these conditions and emits the `KMSPluginsDegraded` rollup. See [KMS Plugin Health Conditions](#kms-plugin-health-conditions) for the exact naming, status mapping, and rollup behavior, and [KMS Health Reporter Connectivity](#kms-health-reporter-connectivity) for how the reporter authenticates and connects to perform the write. ``` within each apiserver pod (3 in HA): @@ -498,7 +498,7 @@ within each apiserver pod (3 in HA): ▼ operator CR (kubeapiservers.operator.openshift.io/cluster): ├─ KMSHealthReporter_ ◄─ written by each per-pod reporter - │ (advisory, one per node, multi-plugin state in Message) + │ (one per node, multi-plugin state in Message) │ └─ KMSPluginsDegraded ◄─ written by aggregator controller │ (reads the per-node entries above) @@ -507,16 +507,6 @@ operator CR (kubeapiservers.operator.openshift.io/cluster): ClusterOperator: Degraded ``` -##### Auth and connection - -**Auth**: the reporter uses a **legacy ServiceAccount token** (mounted from a Secret) bound to a minimal Role that only permits applying its single per-node condition entry on the operator CR. The projected SA tokens available in API-server-adjacent namespaces are admin-grade, as are the auth client certificates on disk; both would over-privilege a sidecar whose only job is one SSA apply. The legacy SA token keeps the blast radius minimal if the sidecar is compromised. The tradeoff is lifetime: a legacy token does not expire, whereas a projected token rotates. We accept this, since a token scoped to one verb on one resource is a far smaller prize than an admin-grade token, expiring or not. - -**Connection**: all reporters reach the kube-apiserver through the in-cluster Service `kubernetes.default.svc`. - -In an HA control plane this survives KMS failure: if one node's KMS plugin breaks, that node's KAS degrades, but the Service still has healthy endpoints on the other nodes, so the affected node's reporter can still deliver its condition. The Service approach only fully breaks when every KMS plugin is down, and that is a cluster-down event already surfaced by far louder signals (`ClusterOperator`, etcd, kubelet probes) than a missing reporter condition. On Single-Node OpenShift the Service has a single endpoint, so a broken local KMS plugin does leave the reporter unable to write; this is acceptable for the same reason: the cluster is already hard-down and the condition would be redundant. - -Dialing `127.0.0.1:6443` directly (the kube-apiserver static pod uses `hostNetwork: true`) was considered and rejected. It would bridge the post-start window where the local KAS accepts TLS connections but is still absent from `kubernetes.default` `Endpoints` (the Service reconciler self-gates on `/readyz`; see [`kubernetesservice/controller.go`](https://github.com/kubernetes/kubernetes/blob/master/pkg/controlplane/controller/kubernetesservice/controller.go)). But reporting KMS plugin health is not on KAS's critical startup path, and a not-ready KAS is already surfaced with higher signal-to-noise by `ClusterOperator`, kubelet probes, and KAS's own readiness machinery. - ### User Stories - As a cluster admin, I want to enable KMS encryption by updating the APIServer resource, so I can declaratively configure encryption without manually managing keys. @@ -572,6 +562,16 @@ This feature does not depend on the features that are excluded from the OKE prod - keyController uses provider-specific field-level comparison (not simple equality) to determine migration necessity - UDS path convention: `unix:///var/run/kmsplugin/kms-{keyID}.sock` — keyID appended for uniqueness +#### KMS Health Reporter Connectivity + +**Auth**: the reporter uses a **legacy ServiceAccount token** (mounted from a Secret) bound to a minimal Role that only permits applying its single per-node condition entry on the operator CR. The projected SA tokens available in API-server-adjacent namespaces are admin-grade, as are the auth client certificates on disk; both would over-privilege a sidecar whose only job is one SSA apply. The legacy SA token keeps the blast radius minimal if the sidecar is compromised. The tradeoff is lifetime: a legacy token does not expire, whereas a projected token rotates. We accept this, since a token scoped to one verb on one resource is a far smaller prize than an admin-grade token, expiring or not. + +**Connection**: all reporters reach the kube-apiserver through the in-cluster Service `kubernetes.default.svc`. + +In an HA control plane this survives KMS failure: if one node's KMS plugin breaks, that node's KAS degrades, but the Service still has healthy endpoints on the other nodes, so the affected node's reporter can still deliver its condition. The Service approach only fully breaks when every KMS plugin is down, and that is a cluster-down event already surfaced by far louder signals (`ClusterOperator`, etcd, kubelet probes) than a missing reporter condition. On Single-Node OpenShift the Service has a single endpoint, so a broken local KMS plugin does leave the reporter unable to write; this is acceptable for the same reason: the cluster is already hard-down and the condition would be redundant. + +Dialing `127.0.0.1:6443` directly (the kube-apiserver static pod uses `hostNetwork: true`) was considered and rejected. It would bridge the post-start window where the local KAS accepts TLS connections but is still absent from `kubernetes.default` `Endpoints` (the Service reconciler self-gates on `/readyz`; see [`kubernetesservice/controller.go`](https://github.com/kubernetes/kubernetes/blob/master/pkg/controlplane/controller/kubernetesservice/controller.go)). But reporting KMS plugin health is not on KAS's critical startup path, and a not-ready KAS is already surfaced with higher signal-to-noise by `ClusterOperator`, kubelet probes, and KAS's own readiness machinery. + #### KMS Plugin Health Conditions ##### Naming convention @@ -582,13 +582,13 @@ Each reporter sidecar writes one condition per pod replica to the owning operato KMSHealthReporter_ ``` -The Type has no `_Available` or `_Degraded` suffix, so it stays advisory on the operator CR (library-go's `StatusSyncer` ignores it). The aggregator controller consumes these conditions and emits the `KMSPluginsDegraded` rollup separately (see [Aggregator behavior](#aggregator-behavior)). +The Type has no `_Available` or `_Degraded` suffix, so library-go's `StatusSyncer` ignores it and it does not propagate to the `ClusterOperator`. The aggregator controller consumes these conditions and emits the `KMSPluginsDegraded` rollup separately (see [Aggregator behavior](#aggregator-behavior)). -This is a **temporary mechanism**. Long term, we plan to add first-class status fields for KMS plugin health to the operator CR API, so this signal lives in a typed shape rather than an advisory string-encoded condition. Until then, encoding it in `KMSHealthReporter_` avoids an API change and keeps the design reversible. +This is a **temporary mechanism**. Long term, we plan to add first-class status fields for KMS plugin health to the operator CR API, so this signal lives in a typed shape rather than a string-encoded condition. Until then, encoding it in `KMSHealthReporter_` avoids an API change and keeps the design reversible. ##### Status mapping -While this advisory condition is a temporary solution (see [Naming convention](#naming-convention)), the `Status` and `Reason` are hardcoded to avoid library-go's `StatusSyncer` or other consumers reacting to per-pod transitions: +While this condition is a temporary solution (see [Naming convention](#naming-convention)), the `Status` and `Reason` are hardcoded to avoid library-go's `StatusSyncer` or other consumers reacting to per-pod transitions: - `Status: True` - `Reason: AsExpected` @@ -604,8 +604,8 @@ type PluginHealthConditions []PluginHealthCondition type PluginHealthCondition struct { KeyID string `json:"keyID"` // encryption-key-secret id from the socket path (kms-{keyID}.sock); not a cryptographic key - KEKID string `json:"kekID,omitempty"` // remote KEK id from the plugin's KMS v2 StatusResponse.key_id; omitted when the plugin is unreachable - Status string `json:"status"` // healthy | unhealthy | unreachable + KEKID string `json:"kekID,omitempty"` // remote KEK id from the plugin's KMS v2 StatusResponse.key_id; omitted when the probe errors (no StatusResponse) + Status string `json:"status"` // healthy | unhealthy | error LastChecked time.Time `json:"lastChecked"` // RFC 3339 timestamp of this probe Detail string `json:"detail,omitempty"` // error/health detail; omitted when healthy } @@ -627,15 +627,15 @@ status: - type: KMSHealthReporter_master-0 status: "True" reason: AsExpected - message: '[{"kekID":"projects/p/locations/l/keyRings/r/cryptoKeys/k/cryptoKeyVersions/2","keyID":"2","status":"healthy","lastChecked":"2026-05-08T12:34:56Z"},{"kekID":"projects/p/locations/l/keyRings/r/cryptoKeys/k/cryptoKeyVersions/1","keyID":"1","status":"healthy","lastChecked":"2026-05-08T12:34:56Z"}]' + message: '[{"kekID":"kek-9f2c","keyID":"2","status":"healthy","lastChecked":"2026-05-08T12:34:56Z"},{"kekID":"kek-4a17","keyID":"1","status":"healthy","lastChecked":"2026-05-08T12:34:56Z"}]' - type: KMSHealthReporter_master-1 status: "True" reason: AsExpected - message: '[{"kekID":"projects/p/locations/l/keyRings/r/cryptoKeys/k/cryptoKeyVersions/2","keyID":"2","status":"unhealthy","lastChecked":"2026-05-08T12:34:56Z","detail":"credential lacks decrypt permission"},{"kekID":"projects/p/locations/l/keyRings/r/cryptoKeys/k/cryptoKeyVersions/1","keyID":"1","status":"healthy","lastChecked":"2026-05-08T12:34:56Z"}]' + message: '[{"kekID":"kek-9f2c","keyID":"2","status":"unhealthy","lastChecked":"2026-05-08T12:34:56Z","detail":"credential lacks decrypt permission"},{"kekID":"kek-4a17","keyID":"1","status":"healthy","lastChecked":"2026-05-08T12:34:56Z"}]' - type: KMSHealthReporter_master-2 status: "True" reason: AsExpected - message: '[{"keyID":"2","status":"unreachable","lastChecked":"2026-05-08T12:34:56Z","detail":"connection refused"},{"keyID":"1","status":"unreachable","lastChecked":"2026-05-08T12:34:56Z","detail":"connection refused"}]' + message: '[{"keyID":"2","status":"error","lastChecked":"2026-05-08T12:34:56Z","detail":"connection refused"},{"keyID":"1","status":"error","lastChecked":"2026-05-08T12:34:56Z","detail":"connection refused"}]' ``` See [Aggregator behavior](#aggregator-behavior) for how these conditions roll up to the `ClusterOperator`. From 2bb10c537d05893622c0340a2b2322d6d0765abd Mon Sep 17 00:00:00 2001 From: Krzysztof Ostrowski Date: Wed, 20 May 2026 15:20:03 +0200 Subject: [PATCH 4/7] kms: specify health reporter probe interval and emission semantics Resolve the open probe-interval question raised in review: - Add a Probe interval subsection under KMS Plugin Health Conditions: fixed 30s default passed as a sidecar flag, n UDS probes per cycle followed by a single SSA write of all results. - Define emission as unconditional and best-effort: the reporter writes every tick so lastChecked keeps advancing, and discards a result it cannot deliver rather than queuing it, since the next interval produces a fresher one. - Derive the aggregator staleness threshold as 4 x probe interval and reference it from the Stale Reporter Conditions risk. --- .../kube-apiserver/kms-encryption-foundations.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/enhancements/kube-apiserver/kms-encryption-foundations.md b/enhancements/kube-apiserver/kms-encryption-foundations.md index 2ee31d319c..86e9be9ef2 100644 --- a/enhancements/kube-apiserver/kms-encryption-foundations.md +++ b/enhancements/kube-apiserver/kms-encryption-foundations.md @@ -640,6 +640,14 @@ status: See [Aggregator behavior](#aggregator-behavior) for how these conditions roll up to the `ClusterOperator`. +##### Probe interval + +Each reporter probes and emits on a fixed interval, **default 30 seconds**, passed as a sidecar flag at injection time (alongside the UDS socket paths). The exact value is not load-bearing: a probe cycle is `n` local UDS gRPC calls, one per `n` colocated plugins, followed by a single SSA write carrying all `n` results to the operator CR. Both are cheap, and tens-of-seconds detection latency is acceptable for KMS plugin health (key rotation and credential expiry are minutes-to-hours events). + +Emission is unconditional and best-effort. The reporter writes its condition every tick even when nothing changed: the stale-reporter mitigation (see [Risks and Mitigations](#risks-and-mitigations)) relies on `lastChecked` advancing, so a write-on-change-only reporter would leave a healthy steady-state condition indistinguishable from a hung one. The reporter attempts the write every interval no matter the cluster state. If it cannot reach the kube-apiserver within the interval, it discards that result rather than queuing it: the reporter only ever needs its freshest probe on the CR, so once the next interval produces a result with a newer `lastChecked`, the un-written previous one is outdated and pointless to retry. A reporter that keeps failing to write stops advancing `lastChecked` and is caught by the staleness threshold. + +The aggregator's staleness threshold is derived from the interval rather than configured independently: a condition whose `lastChecked` is older than `4 × interval` (120 s at the default) is treated as `Unknown`. Four intervals give enough data points that one or two dropped probes do not flip the rollup. Reporters apply small random jitter so replicas do not write the operator CR in lockstep. + ##### Aggregator behavior An aggregator controller reads the per-node `KMSHealthReporter_` conditions on the operator's CR and emits rollup conditions on the same CR. The first rollup is `KMSPluginsDegraded`; its `_Degraded` suffix routes it into the `ClusterOperator`'s `Degraded` slot via library-go's `StatusSyncer`. Additional rollups (e.g. `KMSPluginsAvailable`, `KMSPluginsProgressing`) may be added so the `ClusterOperator`'s `Available` and `Progressing` slots also reflect KMS plugin health. Each suffix maps to its matching `ClusterOperator` field via the same `StatusSyncer` convention, so each new type slots in without additional plumbing. @@ -666,7 +674,7 @@ The plan is to extend the existing [`conditionController`](https://github.com/op **Risk: Stale Reporter Conditions** - **Impact:** A reporter that hangs leaves its last `KMSHealthReporter_` condition in etcd unchanged. -- **Mitigation:** Per-plugin `lastChecked` timestamps in Message expose staleness. The aggregator controller can treat conditions whose `lastChecked` exceeds a freshness threshold as effectively `Unknown`. +- **Mitigation:** Per-plugin `lastChecked` timestamps in Message expose staleness. The aggregator controller treats a condition whose `lastChecked` exceeds the freshness threshold (`4 × probe interval`; see [Probe interval](#probe-interval)) as effectively `Unknown`. **Risk: Orphaned Conditions on Mode Switch** - **Impact:** When KMS is disabled (e.g., switching to `aescbc`), reporter sidecars are removed. Without explicit cleanup, `KMSHealthReporter_` and `KMSPluginsDegraded` entries remain stale on the operator CR. From 771b367ea77daa49ae3f84b9e886a3f4da0a3667 Mon Sep 17 00:00:00 2001 From: Krzysztof Ostrowski Date: Tue, 26 May 2026 11:12:38 +0200 Subject: [PATCH 5/7] kms: mark rollup conditions as the admin-facing KMS health signal --- enhancements/kube-apiserver/kms-encryption-foundations.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/enhancements/kube-apiserver/kms-encryption-foundations.md b/enhancements/kube-apiserver/kms-encryption-foundations.md index 86e9be9ef2..258d6c26bc 100644 --- a/enhancements/kube-apiserver/kms-encryption-foundations.md +++ b/enhancements/kube-apiserver/kms-encryption-foundations.md @@ -652,6 +652,8 @@ The aggregator's staleness threshold is derived from the interval rather than co An aggregator controller reads the per-node `KMSHealthReporter_` conditions on the operator's CR and emits rollup conditions on the same CR. The first rollup is `KMSPluginsDegraded`; its `_Degraded` suffix routes it into the `ClusterOperator`'s `Degraded` slot via library-go's `StatusSyncer`. Additional rollups (e.g. `KMSPluginsAvailable`, `KMSPluginsProgressing`) may be added so the `ClusterOperator`'s `Available` and `Progressing` slots also reflect KMS plugin health. Each suffix maps to its matching `ClusterOperator` field via the same `StatusSyncer` convention, so each new type slots in without additional plumbing. +These rollup conditions (`KMSPluginsDegraded`, and any future `KMSPluginsAvailable` / `KMSPluginsProgressing`) are the admin-facing signal: they surface through `ClusterOperator` so that `oc get co kube-apiserver` is sufficient to learn KMS plugin health. The per-node `KMSHealthReporter_` conditions are plumbing for the aggregator and tooling, not intended for direct admin consumption. + The plan is to extend the existing [`conditionController`](https://github.com/openshift/library-go/blob/master/pkg/operator/encryption/controllers/condition_controller.go) in library-go's encryption controllers, which already emits the `Encrypted` condition on the same operator CR. It sits in the right call path (operator CR → ClusterOperator) and runs on the informer set the rollup needs. If extending it turns out to be a poor fit (conflicting sync triggers, unrelated dependencies that make the rollup hard to reason about), a dedicated controller will be introduced instead. ### Risks and Mitigations From 022627c4d1df2f1964ca387bd797912bc15a0943 Mon Sep 17 00:00:00 2001 From: Krzysztof Ostrowski Date: Tue, 26 May 2026 11:12:54 +0200 Subject: [PATCH 6/7] kms: broaden orphaned-conditions risk to cover node replacement --- enhancements/kube-apiserver/kms-encryption-foundations.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/kube-apiserver/kms-encryption-foundations.md b/enhancements/kube-apiserver/kms-encryption-foundations.md index 258d6c26bc..f43db44280 100644 --- a/enhancements/kube-apiserver/kms-encryption-foundations.md +++ b/enhancements/kube-apiserver/kms-encryption-foundations.md @@ -678,7 +678,7 @@ The plan is to extend the existing [`conditionController`](https://github.com/op - **Impact:** A reporter that hangs leaves its last `KMSHealthReporter_` condition in etcd unchanged. - **Mitigation:** Per-plugin `lastChecked` timestamps in Message expose staleness. The aggregator controller treats a condition whose `lastChecked` exceeds the freshness threshold (`4 × probe interval`; see [Probe interval](#probe-interval)) as effectively `Unknown`. -**Risk: Orphaned Conditions on Mode Switch** +**Risk: Orphaned Conditions on encryption type change and node replacement** - **Impact:** When KMS is disabled (e.g., switching to `aescbc`), reporter sidecars are removed. Without explicit cleanup, `KMSHealthReporter_` and `KMSPluginsDegraded` entries remain stale on the operator CR. - **Mitigation:** The aggregator controller owns cleanup. It removes orphaned `KMSHealthReporter_` entries (when their owning sidecar is no longer present) and removes its own `KMSPluginsDegraded` entry on KMS disable. From b411a0922f54735a7040dc8f60704e99f38beb49 Mon Sep 17 00:00:00 2001 From: Krzysztof Ostrowski Date: Tue, 26 May 2026 11:12:58 +0200 Subject: [PATCH 7/7] kms: record rejected alternatives for exposing plugin Status outside the pod --- .../kube-apiserver/kms-encryption-foundations.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/enhancements/kube-apiserver/kms-encryption-foundations.md b/enhancements/kube-apiserver/kms-encryption-foundations.md index f43db44280..ab54a4f4e5 100644 --- a/enhancements/kube-apiserver/kms-encryption-foundations.md +++ b/enhancements/kube-apiserver/kms-encryption-foundations.md @@ -881,6 +881,16 @@ Instead of extending existing controllers, create new KMS-only controllers. **Why not chosen:** Adds complexity to plugin lifecycle (must detect identical providers), breaks isolation, and complicates rollback scenarios. +### Alternative: Exposing KMS Plugin Status Outside the Pod + +The plugin's `Status` gRPC is reachable only over the in-pod UDS. Three projections outward were considered: + +1. **`kube-rbac-proxy` in front of the plugin's `Status` RPC.** Adds a new exposed port on the kube-apiserver pod. +2. **Carry patch in kube-apiserver** to expose plugin status. Grows the kube-apiserver carry set we are trying to shrink. +3. **`kube-rbac-proxy` in front of the kube-apiserver pod.** No new port, but inserts a single point of failure in front of kube-apiserver. + +**Chosen:** the in-pod [Health Reporter Sidecar](#health-reporter-sidecar) consumes `Status` locally and pushes the result to the operator CR. + ## Infrastructure Needed None - extends existing library-go code.