diff --git a/enhancements/hypershift/etcd-sharding-by-resource-kind.md b/enhancements/hypershift/etcd-sharding-by-resource-kind.md
new file mode 100644
index 0000000000..adbbcdabbc
--- /dev/null
+++ b/enhancements/hypershift/etcd-sharding-by-resource-kind.md
@@ -0,0 +1,1535 @@
+---
+title: etcd-sharding-by-resource-kind
+authors:
+ - "@jhjaggars"
+reviewers:
+ - "@enxebre, for CPO component framework and HCP architecture"
+ - "@sjenning, for etcd and control plane scalability"
+ - "@csrwng, for HCP lifecycle and API design"
+ - "@JoelSpeed"
+approvers:
+ - "@enxebre"
+api-approvers:
+ - "@enxebre"
+ - "@JoelSpeed"
+creation-date: 2026-04-17
+last-updated: 2026-04-30
+status: provisional
+tracking-link:
+ - https://issues.redhat.com/browse/OCPBUGS-TBD
+see-also: []
+replaces: []
+superseded-by: []
+---
+
+# Etcd Sharding by Kubernetes Resource Kind via Separate Control Plane Components
+
+## Summary
+
+This enhancement adds etcd sharding by Kubernetes resource kind to HyperShift,
+enabling hosted clusters to distribute resources across multiple independent etcd
+deployments for improved scalability and performance. Each etcd shard is registered
+as an independent `ControlPlaneComponent` within the control-plane-operator (CPO) component framework,
+inheriting all framework features automatically. Kube-apiserver (KAS) is configured with
+`--etcd-servers-overrides` to route resources to the appropriate shard.
+
+## Motivation
+
+HyperShift currently deploys a single etcd cluster per hosted control plane, which
+becomes a bottleneck for very large clusters (7500+ nodes). High-churn resources like
+Events and Leases cause performance degradation and impact critical cluster resources.
+[OpenAI demonstrated](https://openai.com/index/scaling-kubernetes-to-7500-nodes/) that sharding etcd by resource type enables scaling to 7,500 nodes
+with better performance and stability through reduced blast radius.
+
+Current limitations:
+
+- Single etcd cluster handles all Kubernetes resources regardless of churn rate
+- Events and Leases cause compaction and performance issues
+- No way to optimize storage or replica counts per resource type
+- No isolation between critical and ephemeral data
+
+### User Stories
+
+#### Story 1: Platform operator creates a sharded HostedControlPlane (HCP) for a large cluster
+
+As a platform operator, I want to create a HostedCluster with separate etcd shards for
+events and leases so that high-churn resources do not cause compaction issues or degrade
+performance for critical cluster state.
+
+#### Story 2: Platform operator configures volatile storage for events
+
+As a platform operator, I want to configure the events shard with `storage.type: EmptyDir`
+so that events are stored in memory for maximum write throughput without provisioning
+persistent storage for data that is inherently ephemeral.
+
+#### Story 3: SRE investigates a shard failure
+
+As an SRE, I want to identify which etcd shard is unhealthy so that I can take targeted
+remediation action without disrupting the entire control plane. I can inspect individual
+`ControlPlaneComponent` CRs per shard and correlate alerts by the `job` label in
+Prometheus (e.g., `job="etcd-events"`).
+
+#### Story 4: Existing cluster upgrades without disruption
+
+As a platform operator running a pre-sharding cluster, I want to upgrade the CPO to a
+sharding-capable version without any changes to my existing single-etcd deployment so
+that the upgrade is transparent and requires no action on my part.
+
+#### Story 5: Platform operator creates a cluster via CLI with sharding config
+
+As a platform operator, I want to specify etcd sharding configuration via a YAML file
+passed to `hypershift create cluster --etcd-sharding-config` so that I can declaratively
+define shard topology at cluster creation time.
+
+#### Story 6: Platform operator places etcd shards on dedicated storage nodes
+
+As a platform operator, I want to schedule the default etcd shard on management cluster
+nodes with NVMe SSDs and place the events shard on nodes with standard storage so that
+I can optimize disk I/O for critical cluster state without over-provisioning all
+management cluster nodes with high-performance storage.
+
+### Goals
+
+1. Add declarative etcd sharding configuration to the HostedCluster/HostedControlPlane API,
+ allowing users to define multiple etcd shards, map resource kinds to shards, and configure
+ per-shard settings.
+2. Implement sharded etcd using the CPO component framework by registering each shard as an independent
+ `ControlPlaneComponent`, inheriting all framework features (priority class, node
+ isolation, topology spread, etc.) without behavioral gaps.
+3. Extend the CPO component framework minimally to support etcd shard components without
+ breaking any existing single-instance components.
+4. Preserve backward compatibility: existing `StatefulSet/etcd` and
+ `ControlPlaneComponent/etcd` names must not change for single-shard deployments.
+5. Configure kube-apiserver with `--etcd-servers-overrides` to route resources to the
+ appropriate shard.
+6. Provide per-shard pod scheduling controls (`nodeSelector`, `tolerations`) so that
+ operators can place etcd shards on management cluster nodes with appropriate storage
+ or isolation characteristics.
+
+### Non-Goals
+
+1. Changing the `adapt` function signature for existing components.
+2. Introducing a general-purpose `CompositeComponent` abstraction for arbitrary
+ multi-instance components.
+3. Dynamic shard rebalancing or migration from non-sharded to sharded after creation.
+4. Mutating the shard list after cluster creation (adding, removing, or reordering shards).
+5. Auto-sharding based on resource usage patterns.
+6. Per-shard Grafana dashboards (can be added later).
+7. Per-shard backup orchestration (backup tooling backs up all PVC-backed
+ shards; per-shard opt-out can be added later if needed).
+8. Routing openshift-apiserver or oauth-apiserver managed resources to specific shards
+ (these aggregated API servers only support a single etcd endpoint via
+ `storageConfig.urls`, not per-resource overrides like KAS `--etcd-servers-overrides`).
+
+## Proposal
+
+Each etcd shard is registered as an independent `ControlPlaneComponent` using the same
+`NewStatefulSetComponent` builder that all other CPO components use. A single-shard HCP
+registers only the existing `etcd` component; a multi-shard HCP additionally registers
+`etcd-events`, `etcd-leases`, etc.
+
+The CPO component framework maps each registered component to exactly one workload
+object with its own `ControlPlaneComponent` CR, status tracking, and lifecycle management.
+This ensures every shard automatically inherits all framework features — priority class
+assignment, control plane node isolation, colocation affinity, multi-zone topology spread,
+config hash annotations, scale-to-zero support, restart propagation, and PodDisruptionBudget (PDB) semantics —
+without duplicating adaptation logic or diverging from framework behavior as it evolves.
+
+This introduces a new pattern of conditional component registration at startup. Existing
+components (including platform-specific cloud controller managers) are registered
+unconditionally and use predicates to skip reconciliation at runtime. Shard components
+are instead registered conditionally because the number of components varies per HCP
+spec. Shard immutability ensures the component list does not need to change after
+startup. On CPO restart, the shard list is re-read from the HCP spec and all
+components are re-registered, converging to the correct state idempotently.
+This approach was explicitly preferred by maintainers over alternatives that
+would manage multiple workloads within a single component.
+
+### Workflow Description
+
+**Platform operator** is a human user responsible for creating and managing hosted clusters.
+
+**SRE** is a human user responsible for monitoring and remediating hosted cluster issues.
+
+#### Managed Etcd Sharding
+
+```mermaid
+sequenceDiagram
+ participant PO as Platform Operator
+ participant HO as HyperShift Operator
+ participant CPO as Control Plane Operator
+ participant FW as CPO Framework
+ participant KAS as kube-apiserver
+ participant SRE as SRE
+
+ PO->>HO: Create HostedCluster with
etcd sharding config
+ HO->>HO: CEL validation
(resource format, uniqueness)
+ HO->>CPO: Deploy CPO with HCP spec
+ CPO->>CPO: Read EffectiveShards()
+ loop For each shard
+ CPO->>FW: Register ControlPlaneComponent
(etcd, etcd-events, etc.)
+ FW->>FW: Create StatefulSet, Service,
ServiceMonitor, PDB
+ end
+ CPO->>KAS: Configure --etcd-servers-overrides
+ KAS->>KAS: wait-for-etcd checks
all shard DNS
+ KAS->>KAS: Route API requests
to appropriate shards
+ SRE->>FW: Monitor per-shard
ControlPlaneComponent CRs
+ SRE->>SRE: Prometheus alerts
with per-shard job labels
+```
+
+1. The service consumer configures etcd sharding in the HostedCluster
+ `spec.etcd.managed.shards` field, specifying shard names, resource types,
+ data policies, and storage settings. For self-hosted (MCE) deployments, the
+ `hcp` CLI provides a convenience flag:
+ `hcp create cluster --etcd-sharding-config `.
+3. The HyperShift operator validates the shard configuration at admission time via CEL
+ rules (resource format, uniqueness, name constraints).
+4. The control-plane-operator (CPO) starts and reads the HCP spec. For each shard in
+ `EffectiveShards()`, it registers a CPO component (`etcd`, `etcd-events`,
+ `etcd-leases`, etc.).
+5. The framework reconciles each shard component independently — creating StatefulSets,
+ Services, ServiceMonitors, PDBs, and defrag RBAC resources with shard-specific names.
+6. The CPO configures kube-apiserver with `--etcd-servers-overrides` pointing non-default
+ resources to their shard endpoints, and updates `wait-for-etcd` to check all shards.
+7. KAS starts after all shard services are DNS-resolvable and routes API requests to the
+ appropriate etcd shard based on resource type.
+8. The SRE monitors per-shard health via individual `ControlPlaneComponent` CRs and
+ Prometheus metrics with distinct `job` labels per shard.
+
+#### Unmanaged Etcd Sharding
+
+For unmanaged etcd, HyperShift does not deploy or manage etcd infrastructure. The
+platform operator is responsible for operating the external etcd clusters. HyperShift's
+role is limited to configuring KAS to route resources to the correct external endpoints.
+
+1. The platform operator provisions and operates multiple external etcd clusters
+ (one per shard), each with its own endpoint and TLS credentials.
+2. The platform operator creates a HostedCluster with `spec.etcd.managementType:
+ Unmanaged` and populates `spec.etcd.unmanaged.shards` with the shard definitions,
+ including per-shard endpoints and TLS configuration.
+3. The HyperShift operator validates the shard configuration at admission time via CEL
+ rules (resource format, uniqueness, endpoint format).
+4. The CPO configures kube-apiserver:
+ - `--etcd-servers` points to the default shard's endpoint
+ - `--etcd-servers-overrides` contains entries for non-default shards
+ - Per-shard TLS credentials are mounted from the referenced secrets
+5. No `ControlPlaneComponent` CRs, StatefulSets, Services, ServiceMonitors, or PDBs
+ are created for unmanaged shards — monitoring and lifecycle are the operator's
+ responsibility.
+6. The `wait-for-etcd` init container is removed for unmanaged etcd (existing behavior,
+ unchanged by this enhancement).
+
+### API Extensions
+
+This enhancement adds the following types to the `hypershift.openshift.io/v1beta1` API:
+
+- **`ManagedEtcdShardSpec`** — per-shard configuration with fields: `Name`,
+ `ResourcePrefixes`, `Storage`, `Replicas`, `Scheduling`
+- **`ManagedEtcdShardStorageSpec`** — per-shard storage configuration with fields:
+ `Type` (PersistentVolume or EmptyDir) and optional `StorageClassName` override.
+ PVC size and etcd backend quota are inherited from the parent
+ `ManagedEtcdSpec.Storage`; EmptyDir `sizeLimit` is derived from the same PVC
+ size to match the etcd quota ceiling. This is a separate type from
+ `ManagedEtcdStorageSpec` because `EmptyDir` is only valid for sharded
+ configurations — the top-level `ManagedEtcdSpec.Storage` continues to support
+ only `PersistentVolume`
+- **`EtcdShardSchedulingSpec`** — per-shard pod placement with `NodeSelector` and
+ `Tolerations`
+- **`Shards []ManagedEtcdShardSpec`** — added to `ManagedEtcdSpec`
+- **`UnmanagedEtcdShardSpec`** — per-shard configuration for unmanaged etcd with fields:
+ `Name`, `ResourcePrefixes`, `Endpoint`, `TLS` (HyperShift does not manage storage or
+ backups for unmanaged etcd)
+- **`Shards []UnmanagedEtcdShardSpec`** — added to `UnmanagedEtcdSpec`
+
+All new sharding fields are gated behind the `EtcdSharding` feature gate, starting
+in `TechPreviewNoUpgrade`. The gate is defined in
+`api/hypershift/v1beta1/featuregates/featureGate-Hypershift-TechPreviewNoUpgrade.yaml`
+and the `Shards` field on both `ManagedEtcdSpec` and `UnmanagedEtcdSpec` is annotated
+with `+openshift:enable:FeatureGate=EtcdSharding`.
+
+No new CRDs, webhooks, or aggregated API servers are introduced. The existing
+`ControlPlaneComponent` CRD is used for per-shard health reporting.
+
+#### New Type Definitions
+
+```go
+// EtcdShardResource identifies a Kubernetes resource type to be routed to an
+// etcd shard. It is used to build the KAS --etcd-servers-overrides flag.
+// The combination of apiGroup and resource uniquely identifies a resource type.
+type EtcdShardResource struct {
+ // apiGroup is the API group of the resource (e.g., "coordination.k8s.io"
+ // for leases). An empty string designates the core API group (e.g.,
+ // events, pods, configmaps).
+ // When non-empty, must be at most 253 characters in length and consist
+ // of only lowercase alphanumeric characters, hyphens and periods. Each
+ // period separated segment must start and end with an alphanumeric
+ // character.
+ // +required
+ // +kubebuilder:validation:MinLength=0
+ // +kubebuilder:validation:MaxLength=253
+ // +kubebuilder:validation:XValidation:rule="self == '' || self.matches('^[a-z0-9]([a-z0-9-]*[a-z0-9])?([.][a-z0-9]([a-z0-9-]*[a-z0-9])?)*$')",message="apiGroup must be a valid DNS subdomain (lowercase alphanumeric, hyphens, dots)"
+ APIGroup *string `json:"apiGroup,omitempty"`
+
+ // resource is the plural resource name (e.g., "events", "leases",
+ // "configmaps"). Must be a valid DNS label (RFC 1123): lowercase
+ // alphanumeric characters or hyphens, starting and ending with an
+ // alphanumeric character, max 63 characters.
+ // +required
+ // +kubebuilder:validation:MinLength=1
+ // +kubebuilder:validation:MaxLength=63
+ // +kubebuilder:validation:XValidation:rule="self.matches('^[a-z0-9]([a-z0-9-]*[a-z0-9])?$')",message="resource must be a valid DNS label (lowercase alphanumeric, hyphens)"
+ Resource string `json:"resource,omitempty"`
+}
+
+// ManagedEtcdShardSpec defines the configuration for a single etcd shard
+// within a managed etcd deployment.
+// +kubebuilder:validation:XValidation:rule="has(oldSelf.storage) == has(self.storage)",message="storage cannot be added or removed after creation"
+type ManagedEtcdShardSpec struct {
+ // name is a unique identifier for this shard. It is used to derive
+ // resource names (e.g., StatefulSet "etcd-{name}", Service
+ // "etcd-client-{name}").
+ // Must be a valid DNS1123 label (lowercase alphanumeric with hyphens,
+ // starting and ending with an alphanumeric character), max 48 characters.
+ // Immutable once set.
+ // +required
+ // +kubebuilder:validation:MinLength=1
+ // +kubebuilder:validation:MaxLength=48
+ // +kubebuilder:validation:XValidation:rule="self.matches('^[a-z0-9]([-a-z0-9]*[a-z0-9])?$')",message="name must be a valid DNS1123 label: lowercase alphanumeric with hyphens, starting and ending with an alphanumeric character"
+ // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="name is immutable"
+ Name string `json:"name,omitempty"`
+
+ // resources is the list of Kubernetes resource types routed to this
+ // shard. Each entry identifies a resource by its API group and plural
+ // resource name. For example, events in the core group would be
+ // {resource: "events"}, and leases in the coordination group would be
+ // {apiGroup: "coordination.k8s.io", resource: "leases"}.
+ // Minimum 1, maximum 20 entries. Immutable once set.
+ // +required
+ // +kubebuilder:validation:MinItems=1
+ // +kubebuilder:validation:MaxItems=20
+ // +listType=map
+ // +listMapKey=apiGroup
+ // +listMapKey=resource
+ // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="resources are immutable"
+ Resources []EtcdShardResource `json:"resources,omitempty"`
+
+ // storage configures the storage backend for this shard.
+ // If not specified, the shard inherits PersistentVolume storage from
+ // the parent ManagedEtcdSpec.Storage. Immutable once set.
+ // +optional
+ // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="storage is immutable"
+ Storage ManagedEtcdShardStorageSpec `json:"storage,omitzero"`
+
+ // replicas is the number of etcd replicas for this shard. Must be 1 or 3.
+ // Immutable once set.
+ // +required
+ // +kubebuilder:validation:Enum=1;3
+ // +kubebuilder:validation:Minimum=1
+ // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="replicas is immutable"
+ Replicas int32 `json:"replicas"`
+
+ // scheduling configures per-shard pod placement constraints. These
+ // constraints are merged with the framework's control plane node
+ // isolation settings (nodeSelector, tolerations, topology spread).
+ // +optional
+ Scheduling EtcdShardSchedulingSpec `json:"scheduling,omitzero"`
+}
+
+// EtcdShardSchedulingSpec configures pod placement for a single etcd shard.
+// +kubebuilder:validation:MinProperties=1
+type EtcdShardSchedulingSpec struct {
+ // nodeSelector constrains this shard's pods to nodes matching the
+ // specified labels, in addition to the framework's control plane node
+ // selector. Keys and values must be valid Kubernetes label key/value
+ // pairs. Maximum 16 entries.
+ // +optional
+ // +kubebuilder:validation:MinProperties=1
+ // +kubebuilder:validation:MaxProperties=16
+ NodeSelector map[string]string `json:"nodeSelector,omitempty"`
+
+ // tolerations allows this shard's pods to schedule on nodes with
+ // matching taints, in addition to the framework's control plane
+ // tolerations. Maximum 16 entries.
+ // +optional
+ // +listType=atomic
+ // +kubebuilder:validation:MinItems=1
+ // +kubebuilder:validation:MaxItems=16
+ Tolerations []corev1.Toleration `json:"tolerations,omitempty"`
+}
+
+// UnmanagedEtcdShardSpec defines the configuration for a single etcd shard
+// within an unmanaged (externally operated) etcd deployment.
+type UnmanagedEtcdShardSpec struct {
+ // name is a unique identifier for this shard.
+ // Must be a valid DNS1123 label (lowercase alphanumeric with hyphens,
+ // starting and ending with an alphanumeric character), max 48 characters.
+ // Immutable once set.
+ // +required
+ // +kubebuilder:validation:MinLength=1
+ // +kubebuilder:validation:MaxLength=48
+ // +kubebuilder:validation:XValidation:rule="self.matches('^[a-z0-9]([-a-z0-9]*[a-z0-9])?$')",message="name must be a valid DNS1123 label: lowercase alphanumeric with hyphens, starting and ending with an alphanumeric character"
+ // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="name is immutable"
+ Name string `json:"name,omitempty"`
+
+ // resources is the list of Kubernetes resource types routed to this
+ // shard. Uses the same format as ManagedEtcdShardSpec.Resources.
+ // Minimum 1, maximum 20 entries. Immutable once set.
+ // +required
+ // +kubebuilder:validation:MinItems=1
+ // +kubebuilder:validation:MaxItems=20
+ // +listType=map
+ // +listMapKey=apiGroup
+ // +listMapKey=resource
+ // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="resources are immutable"
+ Resources []EtcdShardResource `json:"resources,omitempty"`
+
+ // endpoint is the full etcd client endpoint URL for this shard.
+ // Must be a valid HTTPS URL, max 267 characters. Immutable once set.
+ // +required
+ // +kubebuilder:validation:MinLength=1
+ // +kubebuilder:validation:MaxLength=267
+ // +kubebuilder:validation:XValidation:rule="isURL(self) && url(self).getScheme() == 'https'",message="endpoint must be a valid HTTPS URL"
+ // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="endpoint is immutable"
+ Endpoint string `json:"endpoint,omitempty"`
+
+ // tls specifies TLS configuration for this shard's HTTPS endpoint.
+ // +required
+ // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="tls is immutable"
+ TLS EtcdTLSConfig `json:"tls"`
+}
+```
+
+#### Modifications to Existing Types
+
+The `Shards` field is added to both `ManagedEtcdSpec` and `UnmanagedEtcdSpec`:
+
+```go
+// +kubebuilder:validation:XValidation:rule="has(oldSelf.shards) == has(self.shards)",message="shards cannot be added or removed after creation"
+type ManagedEtcdSpec struct {
+ // storage specifies how etcd data is persisted.
+ // +required
+ Storage ManagedEtcdStorageSpec `json:"storage"`
+
+ // scheduling configures pod placement for the default etcd shard.
+ // These constraints are merged with the framework's control plane
+ // node isolation settings.
+ // +optional
+ // +openshift:enable:FeatureGate=EtcdSharding
+ Scheduling EtcdShardSchedulingSpec `json:"scheduling,omitzero"`
+
+ // shards defines additional etcd shards for resource-level routing.
+ // The existing storage and scheduling fields above configure
+ // the default shard (catch-all for all resources not explicitly
+ // routed). Entries in this list define non-default shards,
+ // each deployed as an independent StatefulSet and ControlPlaneComponent.
+ // Minimum 1, maximum 10 entries. Resources must not overlap across
+ // shards. Immutable after creation: shards cannot be added, removed,
+ // or reordered.
+ // +optional
+ // +openshift:enable:FeatureGate=EtcdSharding
+ // +listType=map
+ // +listMapKey=name
+ // +kubebuilder:validation:MinItems=1
+ // +kubebuilder:validation:MaxItems=10
+ // +kubebuilder:validation:XValidation:rule="self.all(s1, self.all(s2, s1.name == s2.name || !s1.resources.exists(r, s2.resources.exists(q, r.apiGroup == q.apiGroup && r.resource == q.resource))))",message="resources must not overlap across shards"
+ // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || self.size() == oldSelf.size()",message="shards cannot be added or removed after creation"
+ // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(cur, cur.name == old.name))",message="existing shards cannot be replaced"
+ Shards []ManagedEtcdShardSpec `json:"shards,omitempty"`
+}
+
+// +kubebuilder:validation:XValidation:rule="has(oldSelf.shards) == has(self.shards)",message="shards cannot be added or removed after creation"
+type UnmanagedEtcdSpec struct {
+ Endpoint string `json:"endpoint"`
+ TLS EtcdTLSConfig `json:"tls"`
+
+ // shards defines additional etcd shards for resource-level routing.
+ // The top-level endpoint and tls fields define the default shard
+ // (the catch-all for all resources not explicitly routed). Entries
+ // in this list define non-default shards, each with its own endpoint
+ // and TLS configuration.
+ // Minimum 1, maximum 10 entries. Resources must not overlap across
+ // shards. Immutable after creation: shards cannot be added, removed,
+ // or reordered.
+ // +optional
+ // +openshift:enable:FeatureGate=EtcdSharding
+ // +listType=map
+ // +listMapKey=name
+ // +kubebuilder:validation:MinItems=1
+ // +kubebuilder:validation:MaxItems=10
+ // +kubebuilder:validation:XValidation:rule="self.all(s1, self.all(s2, s1.name == s2.name || !s1.resources.exists(r, s2.resources.exists(q, r.apiGroup == q.apiGroup && r.resource == q.resource))))",message="resources must not overlap across shards"
+ // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || self.size() == oldSelf.size()",message="shards cannot be added or removed after creation"
+ // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(cur, cur.name == old.name))",message="existing shards cannot be replaced"
+ Shards []UnmanagedEtcdShardSpec `json:"shards,omitempty"`
+}
+```
+
+**Helper function:** `EffectiveShards(managed *ManagedEtcdSpec)` is a standalone
+function (not a method on the API type, per
+[OpenShift API conventions](https://github.com/openshift/enhancements/blob/master/dev-guide/api-conventions.md)
+which forbid functions on API types). It always synthesizes a default shard from the
+top-level `Storage` and `Scheduling` fields (with replicas from
+`controllerAvailabilityPolicy`), and appends any explicit non-default
+shards from the `Shards` list. When no
+shards are configured, the result is a single default shard — identical to today's
+behavior. The function is defined in `support/etcd/` or a similar helper package, not
+in the `api/` module.
+
+**Storage type determines the backing implementation.** The storage type and replica
+count are independent choices, giving operators flexibility to match each shard's
+configuration to its workload characteristics.
+
+**Backup policy is determined by storage type.** All PVC-backed shards are included
+in backup/restore procedures. EmptyDir shards have no PVCs and are not backed up.
+No per-shard opt-out is provided in the initial implementation; if needed, a typed
+API field can be added in a future enhancement.
+
+#### Common Configurations
+
+**PersistentVolume (default shard and PVC-backed non-default shards)**
+
+Data is written to persistent disk, survives pod and node restarts, and is included
+in backup/restore procedures for disaster recovery and cluster migration.
+
+*Use case — default shard:* The default shard (`/`) containing pods, deployments,
+secrets, configmaps, and all other core Kubernetes resources.
+
+*Use case — leases shard:* Lease data is high-churn and regenerable, but PVC storage
+avoids memory pressure and survives full-cluster restarts gracefully. Clients don't
+all race to re-acquire leases simultaneously after a full restart. While the data is
+not critical for DR, PVC-backed leases shards are still included in backup for
+simplicity.
+
+**EmptyDir (tmpfs)**
+
+Data is stored in memory only (tmpfs). Maximum write throughput, no PVC provisioning,
+no storage costs. Data survives container crashes and restarts within the same pod,
+but is lost on pod deletion, rescheduling, or node restart. With multiple replicas, a
+restarted member catches up from peers. Not backed up — there is no persistent state.
+
+*Use case:* Events shard. Events are inherently ephemeral, continuously generated by
+the system, and have built-in TTL expiration. Losing events on restart has no impact
+on cluster functionality.
+
+#### Configuration Matrix
+
+| Storage Type | Survives container/pod restart | Survives pod rescheduling or node restart | DR/Migration | Memory pressure |
+| --- | --- | --- | --- | --- |
+| PVC | Yes | Yes | Yes (backed up) | None (disk) |
+| EmptyDir | Yes (same pod) | No | No | Uses node RAM |
+
+The existing `ManagedEtcdStorageSpec` is **unchanged** by this enhancement — it
+continues to support only `PersistentVolume`. Per-shard storage uses a new
+`ManagedEtcdShardStorageSpec` type that adds `EmptyDir`:
+
+```go
+// ManagedEtcdShardStorageSpec configures storage for a single etcd shard.
+// This is a separate type from ManagedEtcdStorageSpec because EmptyDir
+// is only valid for sharded configurations.
+//
+// When type is PersistentVolume, the shard inherits PVC size and etcd
+// backend quota from the parent ManagedEtcdSpec.Storage — these are
+// tightly coupled and exposing them independently per shard risks
+// misconfiguration (e.g., a PVC smaller than the quota causes
+// filesystem-full errors before the clean quota alarm fires). The
+// storageClassName can be overridden per shard to target different I/O
+// tiers or storage pools (e.g., NVMe vs standard, or different local
+// volume pools on bare metal).
+//
+// When type is EmptyDir, the shard uses tmpfs-backed storage. The
+// sizeLimit is derived from the parent ManagedEtcdSpec.Storage PVC size
+// to match the etcd backend quota ceiling. Since tmpfs does not
+// pre-allocate memory, setting the limit equal to the PVC size has no
+// upfront cost — memory is consumed only as etcd writes data. This
+// avoids the same misconfiguration risk as undersized PVCs: a sizeLimit
+// lower than the etcd quota would cause filesystem-full errors (ENOSPC)
+// before etcd's clean quota alarm fires.
+// +kubebuilder:validation:XValidation:rule="self.type == 'PersistentVolume' ? true : !has(self.persistentVolume)",message="persistentVolume is forbidden when type is not PersistentVolume"
+type ManagedEtcdShardStorageSpec struct {
+ // type is the kind of storage implementation to use for this shard.
+ // PersistentVolume uses PVCs; EmptyDir uses ephemeral node storage (tmpfs).
+ // +required
+ // +unionDiscriminator
+ Type ManagedEtcdShardStorageType `json:"type,omitempty"`
+
+ // persistentVolume configures PVC-based storage for this shard.
+ // Only valid when type is PersistentVolume.
+ // +optional
+ // +kubebuilder:validation:MinProperties=1
+ PersistentVolume ManagedEtcdShardPersistentVolumeSpec `json:"persistentVolume,omitzero"`
+}
+
+// ManagedEtcdShardPersistentVolumeSpec configures PVC storage for an etcd shard.
+type ManagedEtcdShardPersistentVolumeSpec struct {
+ // storageClassName overrides the StorageClass for this shard's PVCs.
+ // If not specified, the parent ManagedEtcdSpec's storageClassName is used.
+ // Must be a valid DNS1123 subdomain (lowercase alphanumeric, hyphens, or
+ // dots, starting and ending with an alphanumeric character), max 253
+ // characters.
+ // +optional
+ // +kubebuilder:validation:MinLength=1
+ // +kubebuilder:validation:MaxLength=253
+ // +kubebuilder:validation:XValidation:rule="self.matches('^[a-z0-9]([a-z0-9.-]*[a-z0-9])?$')",message="storageClassName must be a valid DNS1123 subdomain"
+ StorageClassName string `json:"storageClassName,omitempty"`
+}
+```
+
+The existing `PersistentVolumeEtcdStorageSpec` (used by the top-level
+`ManagedEtcdSpec.Storage`) is unchanged. It continues to configure storage for
+the default etcd shard and is inherited by any non-default shard with
+`type: PersistentVolume`. For EmptyDir shards, the CPO derives the tmpfs
+`sizeLimit` from the parent `ManagedEtcdSpec.Storage.PersistentVolume.Size`
+(defaulting to 8Gi if unset) to match the etcd backend quota ceiling.
+
+When a non-default shard's `Storage` is unset or `type` is `PersistentVolume`,
+the shard inherits PVC size and etcd backend quota from the parent
+`ManagedEtcdSpec.Storage` — these are tightly coupled and exposing them
+independently per shard risks misconfiguration (e.g., a PVC smaller than the
+quota causes filesystem-full errors before the clean quota alarm fires). The
+`storageClassName` can be overridden per shard to target different I/O tiers
+or storage pools.
+Replicas on non-default shards are required (must be explicitly set to 1 or 3).
+Scheduling defaults to no special scheduling if unset. Neither inherits from the
+top-level fields, which apply only to the default shard.
+
+**Validation rules:**
+
+- Prevent duplicate resources across shards (no `(apiGroup, resource)` pair may appear
+ in more than one shard)
+- Validate resource entries:
+ - `apiGroup` must be a valid DNS subdomain when non-empty (lowercase alphanumeric,
+ hyphens, dots), max 253 chars. Empty string designates the core API group.
+ - `resource` must be a valid DNS1123 label (lowercase alphanumeric, hyphens,
+ starting and ending with alphanumeric), max 63 chars.
+ - Only built-in kube-apiserver resources can be overridden — CRDs are not supported
+ by KAS `--etcd-servers-overrides`
+ ([kubernetes/kubernetes#118858](https://github.com/kubernetes/kubernetes/issues/118858))
+ - The CPO constructs the `--etcd-servers-overrides` flag format
+ (`group/resource#endpoint`) from the structured fields — users do not need to
+ know the KAS flag syntax.
+- Validate shard names: DNS1123 label, lowercase alphanumeric with hyphens, max 48
+ chars (leaves room for the `etcd-` prefix within the 63-char StatefulSet name limit)
+- Enforce immutability via CEL rules:
+
+ **On the parent struct (`ManagedEtcdSpec`):** prevent adding or removing `shards` after creation:
+ ```go
+ // +kubebuilder:validation:XValidation:rule="has(oldSelf.shards) == has(self.shards)",
+ // message="shards cannot be added or removed after creation"
+ ```
+
+ **On `Shards` field (list-level):** prevent adding, removing, or replacing shards
+ after creation:
+ ```go
+ // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || self.size() == oldSelf.size()",
+ // message="shards cannot be added or removed after creation"
+ // +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(cur, cur.name == old.name))",
+ // message="existing shards cannot be replaced"
+ ```
+
+ The size check alone is not sufficient: with `+listType=map`, transition rules on
+ list items only fire for entries correlated by key. An entry removed from the list
+ has no correlated new entry, so its per-field `self == oldSelf` rules never fire.
+ Similarly, a newly added entry has no correlated old entry, so transition rules are
+ skipped. This means swapping one shard for another (remove `events`, add `foo`) would
+ pass the size check while bypassing all per-field immutability. The second rule closes
+ this gap by requiring every old shard name to still be present. Combined with the size
+ check, this makes the set of shard names fully immutable. Reordering is a no-op for
+ map-type lists since items are matched by key, not position.
+
+ **Per-field immutability on `ManagedEtcdShardSpec`:** each field is individually
+ immutable, following the pattern used by `GCPWorkloadIdentityConfig` fields. This
+ allows future API evolution — new optional fields can be added to the shard spec
+ without breaking the immutability check on existing fields:
+ ```go
+ type ManagedEtcdShardSpec struct {
+ // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="name is immutable"
+ Name string `json:"name,omitempty"`
+
+ // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="resources are immutable"
+ Resources []EtcdShardResource `json:"resources,omitempty"`
+
+ Storage ManagedEtcdShardStorageSpec `json:"storage,omitzero"`
+
+ // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="replicas is immutable"
+ Replicas int32 `json:"replicas"`
+
+ Scheduling EtcdShardSchedulingSpec `json:"scheduling,omitzero"`
+ }
+ ```
+
+ `storage` and `scheduling` are intentionally mutable — changing storage type
+ or pod placement does not require data migration. `name`, `resources`, and
+ `replicas` are immutable after creation.
+
+ The same per-field immutability pattern applies to `UnmanagedEtcdShardSpec` fields
+ (`Name`, `Resources`, `Endpoint`, `TLS`).
+
+- Validate replica count: must be 1 or 3 on `ManagedEtcdShardSpec.Replicas` (required)
+- Enforce `MaxItems=10` for both managed and unmanaged shards arrays
+- Use `+listType=map` and `+listMapKey=name` for the shards array
+- Validate scheduling: `nodeSelector` keys and values must be valid Kubernetes label
+ key/value pairs; `tolerations` must conform to `corev1.Toleration` schema
+- **No semantic validation of resource names.** Syntactically valid resource names that
+ do not match a real built-in Kubernetes resource (e.g., `event` instead of `events`)
+ are accepted by the CRD — the KAS override silently has no effect and the resource
+ is stored in the default shard. Semantic validation against a known resource list was
+ rejected due to maintenance burden (the list would need updating every Kubernetes
+ rebase) and version skew between the management cluster and hosted cluster.
+ Documentation should list commonly sharded resources and their correct names.
+
+**Unmanaged-specific validation rules:**
+
+The resource and shard name validations above apply identically to unmanaged
+shards. Additionally:
+
+- Each shard's `Endpoint` must be a valid HTTPS URL (validated via CEL `isURL` + scheme check), at most 267 characters
+- Each shard's `TLS.ClientSecret` must reference a valid secret name
+- The top-level `endpoint` becomes `--etcd-servers`; the shards list becomes
+ `--etcd-servers-overrides`
+- Immutability rules are identical: the shard list cannot be modified after creation
+
+**Example configuration:**
+
+```yaml
+spec:
+ etcd:
+ managementType: Managed
+ managed:
+ # Default shard — catches all resources not routed elsewhere
+ storage:
+ type: PersistentVolume
+ persistentVolume:
+ size: 8Gi
+ scheduling:
+ nodeSelector:
+ storage-tier: nvme
+
+ # Non-default shards — only overrides
+ shards:
+ # PVC — leases survive restarts, ok to regenerate after DR
+ # Inherits PVC size (8Gi) and quota from the parent storage config.
+ - name: leases
+ resources:
+ - apiGroup: coordination.k8s.io
+ resource: leases
+ storage:
+ type: PersistentVolume
+
+ # tmpfs — events are pure ephemeral, regenerated on restart
+ # sizeLimit is derived from the parent PVC size (8Gi) to match
+ # the etcd backend quota ceiling.
+ - name: events
+ resources:
+ - resource: events
+ storage:
+ type: EmptyDir
+```
+
+**Unmanaged example configuration:**
+
+```yaml
+spec:
+ etcd:
+ managementType: Unmanaged
+ unmanaged:
+ # The top-level endpoint/tls is the default shard (catch-all for "/")
+ endpoint: https://etcd-default.example.com:2379
+ tls:
+ clientSecret:
+ name: etcd-default-tls
+ # shards only contains non-default overrides
+ shards:
+ - name: events
+ resources:
+ - resource: events
+ endpoint: https://etcd-events.example.com:2379
+ tls:
+ clientSecret:
+ name: etcd-events-tls
+```
+
+### Topology Considerations
+
+#### Hypershift / Hosted Control Planes
+
+This enhancement is HyperShift-specific. All shard components (StatefulSets, Services,
+ServiceMonitors, PDBs) run in the management cluster within the HCP namespace. The guest
+cluster is unaware of etcd sharding — it only sees a single Kubernetes API endpoint.
+KAS routing via `--etcd-servers-overrides` is transparent to guest cluster workloads.
+
+#### Standalone Clusters
+
+This enhancement does not apply to standalone OpenShift clusters. Etcd sharding for
+standalone clusters would require changes to the cluster-etcd-operator, which is out
+of scope.
+
+#### Single-node Deployments or MicroShift
+
+Not applicable. HyperShift is not used in SNO or MicroShift deployments.
+
+#### OpenShift Kubernetes Engine
+
+No OKE-specific considerations. This enhancement operates entirely within the HyperShift
+control plane operator and does not depend on OKE-excluded features.
+
+#### Disconnected / Air-Gapped Environments
+
+No additional considerations. Shard components use the same etcd image already required
+by the control plane. No new images, external network calls, or registry dependencies
+are introduced.
+
+### Implementation Details/Notes/Constraints
+
+#### Conditional Component Registration
+
+Registration happens in the reconciler setup where `r.components` is populated. The
+existing `etcd` component continues to use its current `isManagedETCD` predicate for
+the default shard. Additional shard components are registered alongside it:
+
+```go
+// Default etcd component -- always registered (predicate handles managed/unmanaged)
+r.components = append(r.components, etcdv2.NewComponent())
+
+// Additional shard components -- registered for managed etcd only
+if hcp.Spec.Etcd.ManagementType == hyperv1.Managed && hcp.Spec.Etcd.Managed != nil {
+ for _, shard := range hcp.Spec.Etcd.Managed.Shards {
+ r.components = append(r.components, etcdv2.NewShardComponent(shard))
+ }
+}
+```
+
+Each shard component uses `WithAssetDir("etcd")` to load manifests from the shared
+`assets/etcd/` directory. Asset YAMLs use Go templates (`{{ .Name }}`) in place of
+hardcoded `etcd` in resource names, labels, and selectors. For the default `etcd`
+component, `{{ .Name }}` renders to `etcd` — unchanged behavior. For shards, it
+renders to the shard's component name (e.g., `etcd-events`, `etcd-leases`).
+
+The framework renders templates at load time via a `TemplatedProvider` that wraps the
+existing `WorkloadProvider`. Both workload and non-workload manifests pass through
+the same rendering path, so `update()`, `delete()`, and `reconcileComponentStatus()`
+all get correctly-named objects. This eliminates per-manifest rename adapt functions
+entirely — no `adaptClientServiceForShard`, `adaptDiscoveryServiceForShard`,
+`adaptPDBForShard`, etc. When a new manifest is added to `assets/etcd/`, it only
+needs `{{ .Name }}` in the right places — no Go code change required.
+
+```go
+func NewShardComponent(shard hyperv1.ManagedEtcdShardSpec) component.ControlPlaneComponent {
+ name := resourceNameForShard(ComponentName, shard.Name)
+ return component.NewStatefulSetComponent(name, &etcdShard{shard: shard}).
+ WithAssetDir(ComponentName).
+ WithAdaptFunction(func(ctx component.WorkloadContext, sts *appsv1.StatefulSet) error {
+ return adaptStatefulSetForShard(ctx, sts, shard)
+ }).
+ WithPredicate(isManagedETCD).
+ WithManifestAdapter("servicemonitor.yaml").
+ WithManifestAdapter("pdb.yaml",
+ component.AdaptPodDisruptionBudget(),
+ ).
+ WithManifestAdapter("defrag-role.yaml",
+ component.WithPredicate(defragControllerPredicate),
+ ).
+ WithManifestAdapter("defrag-rolebinding.yaml",
+ component.WithPredicate(defragControllerPredicate),
+ ).
+ WithManifestAdapter("defrag-serviceaccount.yaml",
+ component.WithPredicate(defragControllerPredicate),
+ ).
+ Build()
+}
+```
+
+The only adapt functions that remain are those with real etcd domain logic:
+`adaptStatefulSetForShard` (rewriting `spec.serviceName`, peer URLs, init container
+args, storage type switching, and scheduling merge). Defrag manifests retain their
+predicates but no longer need rename adapters — the template handles naming.
+
+**Defrag controller:** The defrag controller runs as a sidecar inside each etcd pod,
+connecting to `localhost:2379` and discovering cluster members via the etcd member
+list API. Each shard's StatefulSet gets its own defrag sidecar that operates only
+on that shard's members. No changes to the defrag controller are needed — only
+the RBAC resources (Role, RoleBinding, ServiceAccount) need shard-specific names,
+which the template rendering handles.
+
+#### Framework Extensions
+
+**Etcd component prefix check** — The four framework locations that currently use
+`name == "etcd"` checks are changed to `strings.HasPrefix(name, "etcd")` to cover
+all shard components (`etcd`, `etcd-events`, `etcd-leases`, etc.). Where the
+existing code uses set membership rather than direct comparison (e.g.,
+`checkDependencies`), the set lookup is replaced with prefix-based iteration:
+
+1. `checkDependencies()` — excludes etcd components from automatic KAS dependency
+2. `priorityClass()` — returns `hypershift-etcd` for etcd components
+3. `DefaultReplicas()` — returns 3 for HA etcd components
+4. `setDefaults()` — sets FSGroup for etcd components (required for PVC permissions)
+
+This requires no interface changes, no builder additions, and no boilerplate on
+existing components. The long-term fix is builder methods (`WithPriorityClass()`,
+`WithDefaultReplicas()`) that let components declare these settings at registration
+time, but the prefix check is the smallest change that works correctly for the
+initial implementation.
+
+#### Shard-Aware StatefulSet Adaptation
+
+The `statefulset.yaml` asset contains hardcoded references to `etcd-discovery` and
+`etcd-client` service names. Following the established framework pattern (as used by
+KAS `wait-for-etcd`), `adaptStatefulSetForShard()` overwrites these values:
+
+- **`sts.Spec.ServiceName`**: overwritten to `etcd-discovery-{shardName}`
+- **`ETCD_INITIAL_ADVERTISE_PEER_URLS`** and **`ETCD_ADVERTISE_CLIENT_URLS`**: upserted
+ via `util.UpsertEnvVar` with shard-specific discovery service DNS
+- **`reset-member`** and **`ensure-dns`** init container `Args`: overwritten wholesale
+ with shard-specific service references
+
+**Storage adaptation based on `Storage.Type`:** When `Storage.Type` is `EmptyDir`, the
+adapt function removes `volumeClaimTemplates` and replaces it with an inline `emptyDir`
+volume with `medium: Memory` and `sizeLimit` derived from the parent
+`ManagedEtcdSpec.Storage.PersistentVolume.Size` (defaulting to 8Gi). The container's
+memory limit is set to match the `sizeLimit` so the pod can use the full tmpfs
+allocation without being OOM-killed. When `Storage.Type` is `PersistentVolume`, the
+existing `volumeClaimTemplates` are preserved with size and quota inherited from the
+parent, and `storageClassName` optionally overridden from the shard's
+`Storage.PersistentVolume.StorageClassName`.
+
+**Scheduling adaptation from `Scheduling`:** When `Scheduling` is set, the adapt
+function merges the shard's `NodeSelector` labels into the pod template's existing
+`nodeSelector` (the framework's control plane node selector is always present) and
+appends the shard's `Tolerations` to the pod template's existing tolerations. The
+framework's topology spread constraints and colocation affinity are unaffected — they
+continue to apply to all shard pods. This merge strategy means per-shard scheduling
+*narrows* placement within the control plane node pool rather than replacing it.
+
+#### KAS Configuration
+
+The KAS configuration for `--etcd-servers` and `--etcd-servers-overrides` is the same
+for both managed and unmanaged etcd. The only difference is the source of the endpoint
+URLs and TLS credentials.
+
+**Managed etcd:**
+- `--etcd-servers` points to the default shard's in-cluster service URL
+ (e.g., `https://etcd-client.{namespace}.svc:2379`)
+- `--etcd-servers-overrides` contains entries for non-default shards using their
+ in-cluster service URLs
+- TLS credentials are generated by the PKI controller
+- The KAS deployment adapt function extends the `wait-for-etcd` init container to
+ check DNS resolution for all shard client services. The existing init container
+ runs `nslookup etcd-client.$(POD_NAMESPACE).svc` in a loop. The adapt function
+ has access to the shard list from the HCP spec and appends one `nslookup` check
+ per shard service name (e.g., `etcd-client-events`, `etcd-client-leases`) to the
+ same shell script
+
+**Unmanaged etcd:**
+- `--etcd-servers` points to the default shard's external endpoint from
+ `spec.etcd.unmanaged.endpoint` (single-shard) or the default shard's `Endpoint` field
+- `--etcd-servers-overrides` contains entries for non-default shards using their
+ respective `Endpoint` values
+- TLS credentials are mounted from the secrets referenced in each shard's `TLS` config
+- `wait-for-etcd` init container is not used (existing behavior for unmanaged etcd)
+
+**Common:**
+- Override format follows the
+ [KAS `--etcd-servers-overrides` format](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/):
+ `group/resource#servers`
+- Only built-in kube-apiserver resources can be overridden — CRDs are not supported
+ ([kubernetes/kubernetes#118858](https://github.com/kubernetes/kubernetes/issues/118858))
+
+#### Shard Rollout Ordering
+
+**Initial creation:** All shard StatefulSets are created in parallel by the framework,
+which reconciles each `ControlPlaneComponent` independently. The `wait-for-etcd` init
+container in the KAS deployment blocks KAS startup until all shard client services are
+DNS-resolvable. KAS does not start until every shard is available.
+
+**Upgrades (CPO or etcd image):** Each shard's StatefulSet performs its own rolling
+update independently. KAS is already running and tolerates brief per-pod unavailability
+during rolling restarts — this is the same behavior as today's single-etcd rolling
+update, just happening per-shard. Since each StatefulSet has its own PDB, the framework
+ensures at most one pod per shard is unavailable at a time. Multiple shards may roll
+simultaneously, but each shard maintains quorum independently.
+
+**KAS restart:** If KAS itself needs to restart (e.g., config change), the same
+`wait-for-etcd` init container gates startup on all shard DNS. KAS does not start with
+a partial set of shards.
+
+#### Monitoring and Alerting
+
+Each shard has its own ServiceMonitor named after the component (e.g., `etcd`,
+`etcd-events`). Prometheus derives the `job` label from the ServiceMonitor name by
+default, producing a distinct `job` label per shard (e.g., `job="etcd"`,
+`job="etcd-events"`) with no additional configuration.
+
+Existing SRE alerts use regex selectors like `job=~".*etcd.*"`, which automatically
+match all shard job labels. Per-shard alerting works with no changes to SRE alert
+definitions.
+
+The KAS recording rules in
+`control-plane-operator/controllers/hostedcontrolplane/v2/assets/kube-apiserver/prometheus-recording-rules.yaml`
+must be updated from `{job="etcd"}` to `{job=~"etcd.*"}` to capture metrics from all
+shards (five rules). This file is owned by the HyperShift team and ships as part of the
+same PR — no cross-team dependency.
+
+#### Resource Naming Conventions
+
+| Resource | Default Shard | Named Shard (e.g., "events") |
+| --- | --- | --- |
+| StatefulSet | `etcd` | `etcd-events` |
+| Client Service | `etcd-client` | `etcd-client-events` |
+| Discovery Service | `etcd-discovery` | `etcd-discovery-events` |
+| ServiceMonitor | `etcd` | `etcd-events` |
+| PDB | `etcd` | `etcd-events` |
+| ControlPlaneComponent CR | `etcd` | `etcd-events` |
+| PVCs | `data-etcd-{ordinal}` | `data-etcd-events-{ordinal}` |
+
+#### TLS Certificate Generation
+
+Each shard gets its own TLS certificates — there is no cert sharing between shards.
+They are independent components with different pods and services, so each shard has its
+own server cert, peer cert, and client cert secrets (e.g., `etcd-server-tls`,
+`etcd-events-server-tls`). The CPO's PKI controller generates per-shard certificates
+with Subject Alternative Names (SANs) matching the shard's service names (`etcd-client-{shard}`,
+`*.etcd-discovery-{shard}`). Certificate rotation affects only the rotated shard — no
+blast radius to other shards. Since shards are immutable, the set of certificates is
+fixed at creation time.
+
+#### Resource Ownership
+
+Shard resources (StatefulSets, Services, ServiceMonitors, PDBs, Secrets) follow the
+existing framework pattern: they are owned by the HCP via standard owner references.
+Garbage collection on HCP deletion is automatic.
+
+#### Management Cluster Resource Impact
+
+Each shard adds the following resources per HCP. The exact footprint varies based on
+storage type and replica count:
+
+| Resource | 3 replicas, PVC | 3 replicas, EmptyDir | 1 replica, PVC | 1 replica, EmptyDir |
+| --- | --- | --- | --- | --- |
+| Pods | 3 | 3 | 1 | 1 |
+| PVCs | 3 | 0 | 1 | 0 |
+| Services | 2 (client + discovery) | 2 (client + discovery) | 2 (client + discovery) | 2 (client + discovery) |
+| ServiceMonitor | 1 | 1 | 1 | 1 |
+| PDB | 1 | 1 | 1 | 1 |
+| TLS Secrets | ~3 (server, peer, client) | ~3 (server, peer, client) | ~3 (server, peer, client) | ~3 (server, peer, client) |
+
+**Note on 1-replica shards:** Single-replica shards have no quorum redundancy — any pod
+failure makes the shard completely unavailable until the pod restarts. Use `replicas: 1`
+only for ephemeral data (e.g., events with EmptyDir) where brief unavailability is
+acceptable.
+
+**Example:** A 3-shard configuration (default + events + leases) with the default and
+leases on PVC (3 replicas each) and events on EmptyDir (1 replica) produces 7 pods,
+6 PVCs, 6 Services, 3 ServiceMonitors, 3 PDBs, and ~9 TLS Secrets per HCP. At scale
+(hundreds of HCPs on a single management cluster), operators should account for this
+linear growth when sizing the management cluster.
+
+#### Instance-Level I/O Budget Considerations
+
+EBS volumes are network-attached and logically independent, but all volumes attached to
+a single EC2 instance share the instance's EBS bandwidth and IOPS ceiling. At high HCP
+density, the instance-level limit — not the per-volume limit — becomes the bottleneck.
+
+**Example: 50 HCPs on m5.xlarge nodes (3 AZs, 3 replicas per shard)**
+
+An m5.xlarge instance provides ~81 MBps sustained EBS throughput (burst to 593 MBps)
+and 18,750 EBS IOPS, shared across all attached volumes. With soft colocation affinity
+packing ~5 HCPs per node, a single node may host ~15 etcd pods across different HCPs
+and shard types.
+
+Events are the highest-churn Kubernetes resource — continuous creation plus TTL-driven
+garbage collection generates sustained write I/O that can represent 40–60% of total
+etcd write volume. Leases generate moderate, steady writes proportional to node count.
+The default shard (pods, secrets, configmaps) has bursty but generally lower sustained
+I/O.
+
+| Scenario | EBS-backed pods per node | Estimated IOPS demand | m5.xlarge headroom |
+| --- | --- | --- | --- |
+| All shards on PVC | ~15 | 5,000–25,000 | Exceeds 18,750 limit at peak |
+| Events on EmptyDir | ~10 | 750–3,500 | 5–19% utilization |
+
+Moving events to EmptyDir removes the single largest I/O contributor from disk
+entirely. etcd writes are small (~1–2 KB per event), so throughput is
+IOPS-dominated — even 10 PVC-backed pods at default+leases write rates consume only
+5–10 MBps, well within the sustained instance bandwidth.
+
+This benefit applies equally to bare metal management clusters. On bare metal, multiple
+PVC-backed etcd pods sharing a node often share the same physical disk(s) via a local
+volume provisioner, making disk I/O a shared, contended resource. Unlike cloud
+environments where per-volume I/O budgets are at least logically independent, bare
+metal I/O contention is direct — every fsync from every shard on the same disk
+competes for the same spindle or flash channel. EmptyDir events shards eliminate the
+highest-churn writer from disk entirely, reducing contention for the remaining
+PVC-backed shards (default, leases) and improving WAL fsync latency across all HCPs
+on that node.
+
+**Recommended storage configuration for dense management clusters:**
+
+| Shard | Storage | Rationale |
+| --- | --- | --- |
+| Default (`/`) | PVC | Critical cluster state, moderate I/O, requires DR |
+| Events (`events`) | EmptyDir | Highest I/O, inherently ephemeral, eliminates disk I/O bottleneck |
+| Leases (`coordination.k8s.io/leases`) | PVC | Low I/O, survives full-cluster restarts, regenerable after DR |
+
+For cloud management clusters requiring maximum density, use instance types with
+dedicated (not burst) EBS bandwidth (e.g., m5.2xlarge at 593 MBps sustained,
+m5.4xlarge at 1,187 MBps sustained) or instance types with local NVMe storage (m5d,
+i3en). For bare metal management clusters, ensure the local volume provisioner
+allocates distinct physical disks per PV rather than partitions of a shared disk, and
+consider per-shard `nodeSelector` to pin PVC-backed shards to nodes with
+high-performance storage.
+
+### Risks and Mitigations
+
+**Risk: Non-default shard failure causes partial API unavailability.**
+KAS returns errors for resources routed to an unavailable shard with no automatic
+fallback ([Kubernetes limitation](https://github.com/kubernetes/kubernetes/issues/98814)).
+See [Failure Modes](#failure-modes) for per-storage-type impact and recovery procedures.
+**Mitigation:** Service teams should only route high-churn, regenerable resources
+(Events, Leases) to non-default shards. The default shard handles all unmapped resources.
+
+**Risk: Shard configuration errors are silent at runtime.**
+If `resources` entries contain syntactically valid but semantically meaningless names
+(e.g., `event` instead of `events`), KAS accepts the override but it has no effect —
+the resource is silently stored in the default shard.
+**Mitigation:** Admission-time CEL validation catches structural format errors (invalid
+apiGroup/resource characters). Semantic validation against a known resource list was
+considered but rejected due to maintenance burden (static list requires updating every
+Kubernetes rebase) and version skew (management cluster VAP vs. hosted cluster KAS may
+disagree on valid resources). Operators can verify their configuration by checking
+per-shard Prometheus metrics — an events shard with no traffic indicates a likely
+misconfiguration. Documentation should list commonly sharded resources and their
+correct names.
+
+**Risk: Management cluster resource pressure with many shards.**
+Each shard creates a StatefulSet (with PVCs), two Services, a ServiceMonitor, and a PDB.
+**Mitigation:** `MaxItems=10` enforced at the CRD level bounds resource growth. Service
+teams can further restrict via `ValidatingAdmissionPolicy` appropriate for their
+environment.
+
+**Risk: Disk I/O saturation at high HCP density.**
+At high density (e.g., 50 HCPs on a management cluster), a single node may host 10–15
+etcd pods from multiple HCPs, and their combined I/O demand can exceed per-node disk
+limits. On cloud instances, EBS volumes are network-attached and share the instance's
+bandwidth and IOPS ceiling regardless of per-volume provisioned IOPS. On bare metal,
+PVC-backed pods sharing a node often share the same physical disk(s) via a local volume
+provisioner, creating direct I/O contention. In both cases, saturation manifests as WAL
+fsync latency spikes, slow proposals, and potential leader elections across all HCPs on
+the affected node.
+**Mitigation:** Configure events shards with `EmptyDir` storage to remove the
+highest-churn resource from the disk I/O budget. This reduces per-node IOPS demand
+by 40–60%, making dense configurations viable on standard instance types and reducing
+contention for PVC-backed shards on bare metal. For cloud management clusters
+requiring maximum density, use instance types with dedicated (not burst) EBS bandwidth
+(m5.2xlarge+) or local NVMe storage (m5d, i3en). For bare metal, ensure the local
+volume provisioner allocates distinct physical disks per PV. See
+[Instance-Level I/O Budget Considerations](#instance-level-io-budget-considerations).
+
+### Drawbacks
+
+1. **Conditional registration at startup.** Component list is not fully static. This is
+ a new pattern — existing components (including cloud provider controllers) are
+ registered unconditionally and use predicates at reconcile time. Shard immutability
+ ensures the component list does not need to change after startup.
+
+2. **Prefix check is convention-based.** The `strings.HasPrefix(name, "etcd")` check
+ assumes no future non-etcd component will have a name starting with "etcd". A
+ naming collision would be caught immediately (wrong priority class, unexpected
+ replica count), but the long-term fix is builder methods for explicit opt-in.
+
+3. **No single parent CR.** No aggregated `ControlPlaneComponent/etcd` CR reflecting all
+ shards' combined health. Operators must inspect individual shard CRs.
+
+4. **Shard list is fully immutable.** Operators cannot add or remove shards after cluster
+ creation. This is a deliberate constraint — shard changes would require data migration.
+
+## Alternatives (Not Implemented)
+
+### Alternative A: Standalone Sharding Package
+
+Introduce a new `etcd/` package outside the CPO component framework that hand-rolls per-shard
+`StatefulSet`, `Service`, `ServiceMonitor`, and `PodDisruptionBudget` objects.
+
+**Why rejected:** Bypasses the CPO component framework, requiring manual reimplementation of
+priority class, node isolation, colocation affinity, topology spread, config hash
+annotations, scale-to-zero, restart propagation, and PDB semantics.
+
+### Alternative B: Parameterized Components (Instance Multiplier)
+
+Extend `controlPlaneWorkload[T]` to carry a set of named instances and loop in
+`reconcileWorkload()`.
+
+**Why rejected:** Requires changing the `adapt` function signature — a package-wide
+breaking change affecting all ~40 registered components.
+
+### Alternative C: Full Template-Based Asset Loading
+
+Make asset loading fully instance-aware with per-shard template directories and
+template-driven reconcile loops.
+
+**Why rejected as a full alternative:** Framework `Reconcile()` still reconciles one
+workload. Framework features must still be reimplemented manually. However, a
+narrower form of template rendering *is* adopted in the chosen approach — Go
+templates in shared asset YAMLs handle resource naming (`{{ .Name }}`), eliminating
+per-manifest rename adapt functions while keeping the framework's existing reconcile
+loop and lifecycle management intact.
+
+### Alternative D: CompositeComponent
+
+Introduce `CompositeComponent` type that owns a parent CR and delegates to child
+`controlPlaneWorkload[T]` instances via a `ChildFactory`.
+
+**Why not chosen:** Maintainer consensus preferred separate components. Composite approach
+adds two-level CR hierarchy, factory-based child construction, and orphan cleanup —
+conceptual overhead that separate components avoids.
+
+### Alternative E: Single Component with Internal Loop
+
+Extend the existing `etcd` component's `Reconcile()` to loop over shards internally.
+
+**Why rejected:** Still requires the same framework prefix check for KAS
+dependency exclusion. Produces less maintainable, non-reusable code.
+
+## Open Questions
+
+1. ~~Should semantic validation of resource names against known
+ built-in Kubernetes resources be enforced at admission time, or
+ is format-only CEL validation sufficient for the initial
+ implementation?~~
+ **Resolved:** Format-only CEL validation at admission time. No semantic
+ validation against a known resource list — the maintenance burden of
+ keeping a static list accurate across Kubernetes rebases is too high,
+ and version skew between the management cluster (where the VAP runs)
+ and the hosted cluster (where KAS serves the resources) makes accuracy
+ unreliable. Operators can detect misconfigured resources by checking
+ per-shard Prometheus metrics — an events shard with no write activity
+ indicates a likely typo. Documentation should list commonly sharded
+ resources and their correct names.
+2. ~~Should there be a maximum `sizeLimit` enforced for `EmptyDir`
+ shards to prevent accidental memory exhaustion on management
+ cluster nodes?~~
+ **Resolved:** `sizeLimit` is not user-configurable. The CPO derives it
+ from the parent `ManagedEtcdSpec.Storage.PersistentVolume.Size`
+ (defaulting to 8Gi) to match the etcd backend quota ceiling. Since
+ tmpfs does not pre-allocate memory, matching the PVC size has no
+ upfront cost — memory is consumed only as etcd writes data. A
+ `sizeLimit` lower than the quota would cause filesystem-full errors
+ (ENOSPC) before etcd's clean quota alarm fires, which is a worse
+ failure mode. This is the same reasoning that led to not exposing
+ per-shard PVC size.
+3. ~~Is there a need for a cross-shard health aggregation mechanism
+ (e.g., a synthetic condition on the HCP) to simplify monitoring,
+ or are individual `ControlPlaneComponent` CRs sufficient?~~
+ **Resolved:** Individual `ControlPlaneComponent` CRs are sufficient.
+ Each shard reports `Available` and `Progressing` conditions, and
+ per-shard Prometheus metrics with distinct `job` labels enable
+ granular alerting. An aggregated HCP-level condition would require
+ defining severity thresholds across shards with very different
+ blast radii (events vs. leases vs. default), adding complexity
+ without clear benefit over per-shard signals.
+
+## Test Plan
+
+All tests for sharding functionality must carry the `[OCPFeatureGate:EtcdSharding]`
+label to ensure they only run on clusters where the feature gate is enabled.
+
+### Unit Tests
+
+- `support/controlplane-component/status_test.go`: verify `checkDependencies()` excludes
+ KAS for etcd-prefixed components; verify unmanaged-etcd logic is unaffected.
+- `support/controlplane-component/defaults_test.go`: verify `priorityClass()` and
+ `DefaultReplicas()` return etcd values for etcd-prefixed components.
+- `control-plane-operator/controllers/hostedcontrolplane/v2/etcd/`: existing
+ `NewStatefulSetComponentTest` harness exercised for each shard variant (`default`,
+ `events`, `leases`). Golden fixtures generated per shard name.
+
+### Integration Tests
+
+- HCP with single-shard etcd: verifies backward-compatible `StatefulSet/etcd` naming,
+ `ControlPlaneComponent/etcd` CR existence, and all framework features applied.
+- HCP with three-shard etcd: verifies three `StatefulSet`s, three independent
+ `ControlPlaneComponent` CRs, and no parent/child hierarchy.
+- Priority class `hypershift-etcd` applied to all shard pods.
+- Node isolation affinity/toleration applied to all shard pods.
+
+### E2E Tests
+
+- Existing etcd-related E2E tests must pass unmodified for single-shard HCPs.
+- Multi-shard E2E test creates cluster with 3-shard configuration (main, events, leases).
+- E2E test verifies resources land in correct shards via `--etcd-servers-overrides`.
+
+#### Failure Mode and Resilience Tests
+
+- **Non-default shard unavailability:** Scale the events shard StatefulSet to 0 replicas.
+ Verify that KAS returns errors for Event API requests while other resource types
+ (pods, configmaps) continue to operate normally via the default shard.
+- **KAS startup with unavailable shard:** Delete a non-default shard's client Service,
+ then trigger a KAS pod restart. Verify the `wait-for-etcd` init container blocks
+ KAS startup until the service is recreated and DNS-resolvable.
+- **EmptyDir shard data loss and recovery:** Delete all pods in an EmptyDir-backed shard.
+ Verify that pods restart with empty data, events begin recording again, and the
+ `ControlPlaneComponent` CR returns to `Available=True`.
+- **Upgrade from pre-sharding CPO:** Create a single-shard HCP, then upgrade the CPO
+ to the sharding-capable version. Verify the existing `StatefulSet/etcd` and
+ `ControlPlaneComponent/etcd` are unchanged, `EffectiveShards()` returns a single
+ default shard, and KAS continues operating without `--etcd-servers-overrides`.
+- **Per-shard scheduling verification:** Create a multi-shard HCP with per-shard
+ `nodeSelector` labels. Verify that shard pods are placed on nodes matching the
+ specified selectors and that the framework's control plane node isolation constraints
+ are preserved.
+
+### CI Lane Design
+
+**Presubmit job** (`pull-ci-openshift-hypershift-*-e2e-etcd-sharding`):
+- Runs on PRs to `openshift/hypershift` that touch `v2/etcd/`, `support/controlplane-component/`,
+ or etcd-related API types.
+- Creates a 3-shard HostedCluster (default + events + leases) on AWS.
+- Verifies events are routed to the events shard by creating Events via the guest
+ cluster API and confirming they are stored in the events shard's etcd (not the
+ default shard).
+- Runs the full `[OCPFeatureGate:EtcdSharding]` test suite.
+
+**Periodic jobs** (`periodic-ci-openshift-hypershift-*-e2e-etcd-sharding`):
+- Runs daily on AWS (both `TechPreviewNoUpgrade` and `Default` variants) to meet
+ the 7 runs/week threshold.
+- Same 3-shard cluster configuration as the presubmit job.
+- Includes failure mode and resilience tests (shard unavailability, upgrade).
+
+### Envtest (API Validation Tests)
+
+YAML-driven envtest cases focus on CEL validation rules that cannot be expressed by
+kubebuilder schema markers alone. Simple constraints (name length, pattern, enum values,
+endpoint format) are enforced by the CRD schema and do not need envtest coverage.
+
+`ManagedEtcdShardSpec` CEL validation:
+
+- Valid: single non-default shard with `{resource: "events"}`
+- Valid: two shards with distinct resources
+- Invalid: duplicate resources across shards (cross-shard CEL rule)
+- Invalid: adding shards to a previously unsharded cluster
+- Invalid: removing shards from a sharded cluster
+- Immutability: shards cannot be added or removed after creation (list size)
+- Immutability: shard `name`, `resources`, `replicas` cannot change after creation
+- Immutability: renaming a shard (changing the map key) is rejected
+- Mutability: shard `storage` and `scheduling` can be updated after creation
+
+`UnmanagedEtcdShardSpec` CEL validation:
+
+- Valid: single shard with `{resource: "events"}` and its own endpoint/TLS
+- Valid: two shards with distinct resources and separate endpoints/TLS
+- Invalid: duplicate resources across shards (cross-shard CEL rule)
+- Invalid: adding shards to a previously unsharded cluster
+- Immutability: shards cannot be added or removed after creation (list size)
+- Immutability: shard `name`, `resources`, `endpoint`, `tls`
+ cannot change after creation (per-field CEL rules)
+- Immutability: renaming a shard (changing the map key) is rejected
+
+## Graduation Criteria
+
+### Dev Preview -> Tech Preview
+
+- `EtcdSharding` feature gate defined in `TechPreviewNoUpgrade` feature set.
+- Framework prefix check (`strings.HasPrefix(name, "etcd")`) implemented and merged.
+- Etcd shard components registered via `NewShardComponent`.
+- All framework features verified on shard components by unit + integration tests.
+- Single-shard backward compatibility confirmed.
+- Per-shard scheduling (`nodeSelector`, `tolerations`) implemented and verified.
+- SLIs exposed as Prometheus metrics with per-shard `job` labels.
+- Symptoms-based alerts written for per-shard etcd health (leader changes,
+ WAL fsync latency, DB size).
+
+### Tech Preview -> GA
+
+- **Testing thresholds met** (per
+ [feature-zero-to-hero.md](../../dev-guide/feature-zero-to-hero.md)):
+ - At least 5 tests carrying the `[OCPFeatureGate:EtcdSharding]` label.
+ - All tests run at least 7 times per week via periodic Prow jobs.
+ - All tests run at least 14 times per supported platform.
+ - All tests pass at least 95% of the time.
+ - Testing in place no less than 14 days before branch cut.
+ - Tests run in both `TechPreviewNoUpgrade` and `Default` Prow job variants.
+- Multi-shard etcd validated in ROSA HCP production clusters.
+- Shard immutability validated across HCP upgrade scenarios.
+- Load testing completed: 3-shard HCP under sustained write pressure
+ (events + leases) with metrics demonstrating per-shard isolation.
+- User-facing documentation created in
+ [openshift-docs](https://github.com/openshift/openshift-docs/).
+
+### Removing a deprecated feature
+
+Not applicable. This enhancement does not deprecate or remove any existing feature.
+
+## Upgrade / Downgrade Strategy
+
+### Upgrade
+
+Upgrade is transparent. Existing unsharded HCPs continue to register only the `etcd`
+component via `EffectiveShards()` synthesizing a single default shard. The
+`StatefulSet/etcd` name and `ControlPlaneComponent/etcd` CR are unchanged. The etcd
+component is already a CPO component, so no new scheduling hints, priority
+classes, or topology rules are applied during upgrade.
+
+Sharding requires creating a new cluster with the shard list specified at creation time.
+There is no migration path from unsharded to sharded.
+
+### Downgrade
+
+CPO downgrade is not a supported operational procedure in HyperShift. The HyperShift
+team always rolls forward, even in the face of issues. A CPO rollback to a version
+that does not understand shards is not expected to occur in practice.
+
+If a rollback were to happen, unsharded HCPs would be unaffected. Sharded HCPs
+are incompatible with a pre-sharding CPO: non-default shard components would lose
+reconciliation (no controller watching them), their `ControlPlaneComponent` CR
+conditions would go stale, and if shard pods crash the old CPO would not restart
+them. Existing StatefulSets and Services would continue running only as long as
+the shard pods remain available. This is consistent with HyperShift's forward-only
+operational model.
+
+## Version Skew Strategy
+
+During upgrade, the new CPO reconciles the existing `StatefulSet/etcd` through the
+framework with no behavioral change. New shard StatefulSets are additive. The shard
+list is immutable, so no coordination is needed between old and new CPO versions — only
+one version runs at a time per HCP.
+
+**EUS upgrades:** The recording rules change (`{job="etcd"}` → `{job=~"etcd.*"}`) is
+backward compatible — the regex still matches the single `job="etcd"` label. An EUS
+upgrade that skips an intermediate version applies the change atomically with no
+intermediate state. No special handling is required.
+
+## Operational Aspects of API Extensions
+
+The `ManagedEtcdShardSpec` and `UnmanagedEtcdShardSpec` types extend existing CRDs
+(`HostedCluster` and `HostedControlPlane`). No new CRDs, webhooks, or aggregated API
+servers are introduced.
+
+**SLIs for shard health:**
+
+- Each shard's `ControlPlaneComponent` CR reports `Available` and `Progressing` conditions.
+- Prometheus metrics per shard: `etcd_mvcc_db_total_size_in_bytes`,
+ `etcd_disk_wal_fsync_duration_seconds`, `etcd_server_leader_changes_seen_total`,
+ each scoped by `job` label (`etcd`, `etcd-events`, etc.).
+
+**Impact on existing SLIs:**
+
+- No impact on existing single-shard HCPs — `EffectiveShards()` synthesizes a default
+ shard that produces identical behavior.
+- Multi-shard HCPs add N-1 additional StatefulSets and associated resources to the
+ management cluster namespace. Resource consumption scales linearly with shard count.
+
+**Failure modes:**
+
+- Non-default shard unavailability: KAS returns errors for resources routed to that shard.
+ No automatic fallback. Impact depends on storage type and backup configuration
+ (see Failure Modes section below).
+- All-shard unavailability: identical to current single-etcd failure — full API outage.
+
+## Support Procedures
+
+### Identifying an unhealthy shard
+
+**Symptoms:** API errors for specific resource types (e.g., events not recording, lease
+renewal failures) while other resource types work normally.
+
+**Detection:**
+
+```bash
+# List all etcd shard ControlPlaneComponent CRs
+oc get controlplanecomponent -n | grep etcd
+
+# Check conditions on a specific shard
+oc get controlplanecomponent etcd-events -n -o yaml
+
+# Check etcd pod health for a shard
+oc get pods -n -l app=etcd-events
+```
+
+**Prometheus alerts:** SRE alerts using `job=~".*etcd.*"` fire per-shard. The `job` label
+identifies the affected shard (e.g., `job="etcd-events"`).
+
+### Restarting a failed shard
+
+```bash
+# Delete pods to trigger StatefulSet restart
+oc delete pods -n -l app=etcd-events
+```
+
+For `EmptyDir` shards, data survives container restarts but is lost on pod
+rescheduling or node restart; it is regenerated by the system. For
+PVC-backed shards, data persists across restarts.
+
+### Disabling sharding
+
+Sharding cannot be disabled on a running cluster — the shard list is immutable. To
+revert to a single-shard configuration, create a new cluster without sharding and
+migrate workloads.
+
+## Failure Modes
+
+### Non-Default Shard Unavailability
+
+When a non-default etcd shard becomes unavailable, KAS returns errors for API requests
+targeting resources routed to that shard via `--etcd-servers-overrides`. There is no
+automatic fallback to the default shard — this is a
+[Kubernetes limitation](https://github.com/kubernetes/kubernetes/issues/98814).
+
+**Detection:** The shard's `ControlPlaneComponent` CR reports unhealthy conditions.
+SRE alerts fire per-shard via distinct `job` labels.
+
+**Impact depends on both storage type and which resources are routed to the shard:**
+
+| Storage | Resource type | Impact of total shard loss | Recovery |
+| --- | --- | --- | --- |
+| PVC | Default (`/`) | Critical — all unmapped resources inaccessible | Restore from backup |
+| PVC | Leases | High — leader election stops, controllers/scheduler halt until leases re-acquired | Restart StatefulSet; clients re-acquire leases (thundering herd) |
+| EmptyDir | Leases | High — same as PVC, but data must be fully rebuilt | Restart StatefulSet; longer re-acquisition window |
+| EmptyDir | Events | Low — events lost, system regenerates continuously | Restart StatefulSet |
+
+**Note on replica count:** With 3 replicas, a single pod loss does not cause shard
+unavailability — quorum is maintained by the remaining 2 replicas and the restarted
+member catches up automatically. With 1 replica, any pod loss makes the shard
+completely unavailable until the pod restarts.
+
+**Note on leases:** Lease shard loss — regardless of storage type — causes all
+leader election to stop, halting controllers, scheduler, and cloud-controller-manager.
+In HA (3 replicas), a single pod failure does not lose data since quorum is maintained.
+Total shard loss (all replicas) is the dangerous scenario. PVC storage reduces recovery
+time after total failure because lease data survives on disk, avoiding a thundering herd
+of concurrent lease re-acquisitions. EmptyDir is acceptable for leases in HA
+deployments where total simultaneous failure of all replicas is unlikely.
+
+### KAS Startup Before All Shards Ready
+
+The `wait-for-etcd` init container checks DNS resolution for all shard client services
+before KAS starts. KAS blocks until all shards are available.
+
+### Incorrect ResourcePrefixes Configuration
+
+Syntactically valid but semantically meaningless prefixes (nonexistent resources) are
+accepted by KAS but have no effect — resources are stored in the default shard.
+Admission-time CEL validation catches format errors only.
+
+## Assumptions
+
+1. All supported HyperShift versions run Kubernetes 1.20+ (required for
+ `--etcd-servers-overrides`)
+2. Current etcd version supports multiple independent clusters in the same namespace
+3. Each HostedCluster has a dedicated namespace preventing cross-cluster interference
+4. Storage class supports dynamic provisioning for multiple PVCs
+5. Default behavior (no shards configured) maintains current single-etcd architecture
+
+## Out of Scope (Future Enhancements)
+
+- Dynamic shard rebalancing (moving prefixes between shards post-creation)
+- Migration from non-sharded to sharded (must create new cluster)
+- Shard list mutation after cluster creation
+- Auto-sharding based on resource usage patterns
+- Shard merging and cross-shard transactions
+- Per-shard backup opt-out for PVC-backed shards
+- Routing openshift-apiserver/oauth-apiserver resources to specific shards
+- Framework-level builder methods for priority class and default replicas
+- More than 10 etcd shards per HCP (requires a CRD `MaxItems` increase)
+
+## Infrastructure Needed
+
+- CI access to HCP clusters with 2-3 etcd shards for multi-shard integration tests.
+- Metrics/alerts review for new shard `ControlPlaneComponent` CRs to ensure existing
+ dashboards display per-shard health correctly.