Skip to content

CNTRLPLANE-3524: feat: add etcd data re-encryption after encryption key rotation#8219

Open
muraee wants to merge 9 commits into
openshift:mainfrom
muraee:etcd-reencryption-on-key-rotation
Open

CNTRLPLANE-3524: feat: add etcd data re-encryption after encryption key rotation#8219
muraee wants to merge 9 commits into
openshift:mainfrom
muraee:etcd-reencryption-on-key-rotation

Conversation

@muraee
Copy link
Copy Markdown
Contributor

@muraee muraee commented Apr 13, 2026

Summary

  • Add a re-encryption controller in the HCCO that detects encryption key rotations and triggers StorageVersionMigration CRs to re-encrypt all existing etcd data with the new active key
  • Track progress through a new EtcdDataEncryptionUpToDate condition on HCP/HostedCluster
  • Vendor library-go's KubeStorageVersionMigrator and kube-storage-version-migrator informer/lister packages

Details

Problem

HyperShift supports encryption key rotation (setting a new active key + backup key), but has no mechanism to re-encrypt existing etcd data with the new key. Old data remains encrypted with the previous key indefinitely, violating compliance requirements (OCPSTRAT-2527, OCPSTRAT-2540).

Solution

Key change detection is performed entirely within the HCCO re-encryption controller — no CPO changes needed. The controller computes a SHA-256 fingerprint of the active key identity from the HCP spec on each reconcile and compares it against a hypershift.openshift.io/encryption-migrated-key-hash annotation on the kas-secret-encryption-config secret. When the hashes differ, re-encryption is triggered.

API (api/hypershift/v1beta1/hostedcluster_conditions.go):

  • EtcdDataEncryptionUpToDate condition type
  • Reasons: ReEncryptionInProgress, ReEncryptionCompleted, ReEncryptionFailed, WaitingForKASConvergence

HCCO (control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/):

  • New reencryption controller using library-go's KubeStorageVersionMigrator
  • computeActiveKeyFingerprint() — SHA-256 hash of active key identity (supports AWS KMS, Azure KMS, IBM Cloud KMS, AESCBC)
  • On first deployment, initializes migrated hash to active hash (assumes data is already encrypted with current key)
  • Watches HCP, encryption config secret, and KAS deployment on management cluster
  • Waits for KAS deployment convergence before starting migrations
  • Creates StorageVersionMigration CRs in the guest cluster for all encrypted resources (secrets, configmaps, routes, oauth tokens)
  • Retries failed migrations after 5-minute interval via PruneMigration
  • Sets EtcdDataEncryptionUpToDate condition on HCP with detailed progress messages

HyperShift Operator (hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go):

  • Bubbles up EtcdDataEncryptionUpToDate condition from HCP to HostedCluster (only when encryption is configured and condition exists)

Vendoring

  • library-go/pkg/operator/encryption/controllers/migrators/KubeStorageVersionMigrator, Migrator interface
  • kube-storage-version-migrator/pkg/clients/informer/ and lister/ — SVM informer/lister

Ref: OCPSTRAT-2527, OCPSTRAT-2540
Enhancement: openshift/enhancements#1969

Test plan

  • Unit tests for computeActiveKeyFingerprint() — all provider types, determinism, different keys produce different hashes, unsupported type returns error
  • Unit tests for re-encryption controller reconcile — no encryption, first deploy init, hashes match, KAS not converged, migrations in progress, all succeeded, failed with retry/prune, cache not synced
  • go build ./... passes
  • go test -race passes for all modified packages
  • E2E: AWS KMS key rotation with re-encryption verification
  • E2E: Azure KMS key rotation with re-encryption verification
  • E2E: AESCBC key rotation with re-encryption verification

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added etcd data encryption status tracking and re-encryption progress monitoring to hosted clusters.
    • Extended encryption key status fields across hosted control planes and multiple KMS providers.
    • Introduced encryption migration history tracking and lifecycle state management.
  • Chores

    • Updated storage version migrator dependency classification.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds etcd data encryption status tracking and key rotation lifecycle management to HyperShift. It introduces a new EtcdDataEncryptionUpToDate condition type with five associated re-encryption lifecycle reasons to report encryption state. The API layer extends HostedControlPlaneStatus and HostedClusterStatus with a SecretEncryption status field tracking active key fingerprints, provider identity, migration state, and rotation history. Provider-specific key status types are added for AWS, Azure, and IBM Cloud KMS along with AESCBC encryption, and deprecated backupKey fields are annotated. A new re-encryption controller reconciles HostedControlPlane objects to manage etcd data re-encryption when encryption keys rotate.

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ❓ Inconclusive Custom check specifies Ginkgo patterns; PR uses Go standard table-driven tests. Check's applicability unclear given framework mismatch. Assertion messages are weak (5%). Clarify scope: does check apply to Go standard tests? If yes: criterion 4 fails (assertion messages missing on 74/78 assertions). Add descriptive messages: g.Expect(err).ToNot(HaveOccurred(), "expected successful reconcile")
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in reencryption_test.go are stable and deterministic. No dynamic content like generated IDs, timestamps, or variable interpolation. Tests use static BDD-style descriptions.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests were added. The new test file (reencryption_test.go) uses standard Go unit testing, not Ginkgo framework. No e2e test files were modified. Check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests added. Only new test file is reencryption_test.go with standard Go unit tests, not e2e tests. SNO check not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed No scheduling constraints, affinity rules, or topology assumptions introduced. All changes are API types, controller logic, and dependencies. Existing operator deployment has no constraints.
Ote Binary Stdout Contract ✅ Passed No stdout violations found. All logging uses logr.Logger. No fmt.Print or direct klog calls in process-level code. Tests use standard Go testing, not Ginkgo suite setup.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests were added. Only unit tests in reencryption_test.go using Go testing package. Check is not applicable to this PR.
Title check ✅ Passed The title clearly and specifically describes the main change: adding etcd data re-encryption functionality triggered by encryption key rotation, which aligns with the substantial changes across the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/api Indicates the PR includes changes for the API area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels Apr 13, 2026
@openshift-ci openshift-ci Bot requested review from cblecker and enxebre April 13, 2026 13:43
@muraee muraee force-pushed the etcd-reencryption-on-key-rotation branch from d5e5097 to 0d54a40 Compare April 13, 2026 13:46
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go`:
- Around line 98-111: The code is writing the active key fingerprint and
unconditionally setting rekey-needed when the annotation was previously absent;
change the logic in the block that uses computeActiveKeyFingerprint,
EncryptionActiveKeyHashAnnotation and EncryptionRekeyNeededAnnotation so that
you always persist the fingerprint into secret.Annotations when fingerprint !=
"", but only set secret.Annotations[EncryptionRekeyNeededAnnotation] = "true" if
an existingHash was present (existingHash != "") and different from the new
fingerprint; do not mark rekey-needed when writing the hash for the first time.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`:
- Around line 245-252: Reorder the finalization so the HCP status is updated to
show re-encryption completed before removing the rekey-needed annotation: call
setReEncryptionCondition(hcp, metav1.ConditionTrue,
hyperv1.ReEncryptionCompletedReason, ...) and persist that status update (as
done in the parent Reconcile flow) first, and only after the status update
succeeds call r.removeRekeyAnnotation(ctx, encryptionSecret); this ensures that
if the status update fails the reconcile will retry (via Reconcile's rekeyNeeded
logic) and that a later annotation-removal failure won't leave the condition
stuck in InProgress.

In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`:
- Around line 897-904: The current logic only sets EtcdDataEncryptionUpToDate on
hcluster when the HCP condition exists, which leaves a stale condition when
encryption is disabled, hcp is nil, or the HCP removes the condition; update the
block around hcp and hcluster.Spec.SecretEncryption to handle the
missing/disabled case by removing the bubbled condition instead of doing
nothing: check if hcp is nil or hcluster.Spec.SecretEncryption is nil or
meta.FindStatusCondition(hcp.Status.Conditions,
string(hyperv1.EtcdDataEncryptionUpToDate)) returns nil, and in those cases call
meta.RemoveStatusCondition(&hcluster.Status.Conditions,
string(hyperv1.EtcdDataEncryptionUpToDate)); otherwise (when cond exists)
continue to set cond.ObservedGeneration = hcluster.Generation and
meta.SetStatusCondition as currently done.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 3f62b72e-be23-40c7-8620-e6a59a833dde

📥 Commits

Reviewing files that changed from the base of the PR and between 72647a4 and d5e5097.

⛔ Files ignored due to path filters (18)
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/errors.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/inprocess.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/inprocess_processor.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/kubestorageversionmigrator.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/metrics.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/modules.txt is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/factory.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/generic.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/internalinterfaces/factory_interfaces.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/migration/interface.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/migration/v1alpha1/interface.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/migration/v1alpha1/storagestate.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/migration/v1alpha1/storageversionmigration.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/lister/migration/v1alpha1/expansion_generated.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/lister/migration/v1alpha1/storagestate.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/lister/migration/v1alpha1/storageversionmigration.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (8)
  • api/hypershift/v1beta1/hostedcluster_conditions.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption_test.go
  • control-plane-operator/hostedclusterconfigoperator/api/scheme.go
  • control-plane-operator/hostedclusterconfigoperator/cmd.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption_test.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go

Comment thread control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go Outdated
Comment thread hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go (1)

103-111: ⚠️ Potential issue | 🟠 Major

Don't mark re-encryption needed on the first hash write.

When EncryptionActiveKeyHashAnnotation is absent (empty string), the condition existingHash != fingerprint evaluates to true, which sets encryption-rekey-needed=true. This will spuriously trigger re-encryption migrations for new clusters and for existing clusters upgrading to this version.

The hash should always be persisted, but rekey-needed should only be set when a previous hash existed and differs from the new one.

Suggested fix
 	if fingerprint != "" {
 		if secret.Annotations == nil {
 			secret.Annotations = map[string]string{}
 		}
 		existingHash := secret.Annotations[EncryptionActiveKeyHashAnnotation]
-		if existingHash != fingerprint {
-			secret.Annotations[EncryptionActiveKeyHashAnnotation] = fingerprint
-			secret.Annotations[EncryptionRekeyNeededAnnotation] = "true"
+		secret.Annotations[EncryptionActiveKeyHashAnnotation] = fingerprint
+		if existingHash != "" && existingHash != fingerprint {
+			secret.Annotations[EncryptionRekeyNeededAnnotation] = "true"
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go`
around lines 103 - 111, The current logic in secrtencryption.go sets
EncryptionRekeyNeededAnnotation when existingHash != fingerprint even if there
was no previous hash, causing spurious re-encryption; update the block around
secret.Annotations and EncryptionActiveKeyHashAnnotation so you always persist
the new fingerprint but only set EncryptionRekeyNeededAnnotation="true" if an
existing non-empty hash existed and is different (i.e., existingHash != "" &&
existingHash != fingerprint). Keep the nil-check for secret.Annotations and
ensure the new fingerprint is written to EncryptionActiveKeyHashAnnotation in
all cases.
control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go (1)

245-252: ⚠️ Potential issue | 🟡 Minor

Consider ordering: update status before removing annotation.

If the annotation removal at line 247 succeeds but the subsequent HCP status update in the parent Reconcile (line 138) fails, the next reconcile will find rekeyNeeded != "true" (line 167-168) and return early. This leaves the condition potentially stuck in its previous state (e.g., InProgress) rather than Completed.

To ensure consistency, consider splitting this into two reconcile cycles or restructuring to ensure status is persisted before annotation removal is committed.

Alternative approach: set condition before annotation removal
 	// All migrations succeeded — remove the rekey-needed annotation and set condition True.
 	log.Info("All re-encryption migrations completed successfully")
+	setReEncryptionCondition(hcp, metav1.ConditionTrue, hyperv1.ReEncryptionCompletedReason,
+		"All etcd data has been re-encrypted with the active encryption key")
+
+	// Note: If annotation removal fails after status update succeeds in parent Reconcile,
+	// the next reconcile will re-process and retry annotation removal (migrations already done).
 	if err := r.removeRekeyAnnotation(ctx, encryptionSecret); err != nil {
 		return reconcile.Result{}, fmt.Errorf("failed to remove rekey-needed annotation: %w", err)
 	}
-	setReEncryptionCondition(hcp, metav1.ConditionTrue, hyperv1.ReEncryptionCompletedReason,
-		"All etcd data has been re-encrypted with the active encryption key")
 	return reconcile.Result{}, nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`
around lines 245 - 252, Reorder the success path so the HCP status is updated to
ReEncryptionCompleted before removing the rekey-needed annotation: call
setReEncryptionCondition(hcp, metav1.ConditionTrue,
hyperv1.ReEncryptionCompletedReason, ...) and ensure the status update is
persisted (via the same status update code used in Reconcile) prior to calling
removeRekeyAnnotation(ctx, encryptionSecret); if persisting status can fail,
return the error and only remove the annotation after status update succeeds,
ensuring the Reconcile loop sees the completed condition even if annotation
removal later fails; references: setReEncryptionCondition,
removeRekeyAnnotation, Reconcile, rekeyNeeded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go`:
- Around line 103-111: The current logic in secrtencryption.go sets
EncryptionRekeyNeededAnnotation when existingHash != fingerprint even if there
was no previous hash, causing spurious re-encryption; update the block around
secret.Annotations and EncryptionActiveKeyHashAnnotation so you always persist
the new fingerprint but only set EncryptionRekeyNeededAnnotation="true" if an
existing non-empty hash existed and is different (i.e., existingHash != "" &&
existingHash != fingerprint). Keep the nil-check for secret.Annotations and
ensure the new fingerprint is written to EncryptionActiveKeyHashAnnotation in
all cases.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`:
- Around line 245-252: Reorder the success path so the HCP status is updated to
ReEncryptionCompleted before removing the rekey-needed annotation: call
setReEncryptionCondition(hcp, metav1.ConditionTrue,
hyperv1.ReEncryptionCompletedReason, ...) and ensure the status update is
persisted (via the same status update code used in Reconcile) prior to calling
removeRekeyAnnotation(ctx, encryptionSecret); if persisting status can fail,
return the error and only remove the annotation after status update succeeds,
ensuring the Reconcile loop sees the completed condition even if annotation
removal later fails; references: setReEncryptionCondition,
removeRekeyAnnotation, Reconcile, rekeyNeeded.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 969834bc-33b1-4893-9f05-17b72afb37c2

📥 Commits

Reviewing files that changed from the base of the PR and between d5e5097 and 0d54a40.

⛔ Files ignored due to path filters (18)
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/errors.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/inprocess.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/inprocess_processor.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/kubestorageversionmigrator.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/metrics.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/library-go/pkg/operator/encryption/controllers/migrators/types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/modules.txt is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/factory.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/generic.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/internalinterfaces/factory_interfaces.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/migration/interface.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/migration/v1alpha1/interface.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/migration/v1alpha1/storagestate.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/informer/migration/v1alpha1/storageversionmigration.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/lister/migration/v1alpha1/expansion_generated.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/lister/migration/v1alpha1/storagestate.go is excluded by !vendor/**, !**/vendor/**
  • vendor/sigs.k8s.io/kube-storage-version-migrator/pkg/clients/lister/migration/v1alpha1/storageversionmigration.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (8)
  • api/hypershift/v1beta1/hostedcluster_conditions.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption_test.go
  • control-plane-operator/hostedclusterconfigoperator/api/scheme.go
  • control-plane-operator/hostedclusterconfigoperator/cmd.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption_test.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
✅ Files skipped from review due to trivial changes (1)
  • control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • control-plane-operator/hostedclusterconfigoperator/cmd.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption_test.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • api/hypershift/v1beta1/hostedcluster_conditions.go

@muraee muraee force-pushed the etcd-reencryption-on-key-rotation branch from 0d54a40 to e29501b Compare April 13, 2026 16:35
@openshift-ci openshift-ci Bot added the area/documentation Indicates the PR includes changes for documentation label Apr 13, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go (1)

245-251: ⚠️ Potential issue | 🟠 Major

Persist ReEncryptionCompleted before clearing encryption-rekey-needed.

removeRekeyAnnotation() runs before the outer Reconcile() call writes hcp.Status. If the secret update succeeds but that later status update fails, the next pass sees rekey-needed absent and exits early, leaving EtcdDataEncryptionUpToDate stuck at its previous value. Move annotation removal to a second phase after the completed condition has been successfully persisted.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`
around lines 245 - 251, The code currently removes the encryption-rekey-needed
annotation (removeRekeyAnnotation) before the controller persists the updated
HCP status (setReEncryptionCondition), which can cause the requeue logic to skip
remaining work if the status update later fails; change the order so you first
set and persist the ReEncryptionCompleted condition on the HostedControlPlane
(call setReEncryptionCondition and ensure the status update is saved via the
status writer used in Reconcile), verify that the status update succeeded, and
only then call removeRekeyAnnotation to delete the encryption-rekey-needed
annotation; use the existing functions removeRekeyAnnotation,
setReEncryptionCondition and the same status update flow in Reconcile to
implement this two-phase commit.
hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go (1)

897-904: ⚠️ Potential issue | 🟠 Major

Remove the bubbled condition when HCP no longer reports it.

This only copies EtcdDataEncryptionUpToDate when the HCP currently has it. If encryption is disabled, the HCP is gone, or the reencryption controller removes the HCP condition, the old HostedCluster value sticks around and becomes stale.

Suggested fix
-	// Bubble up the EtcdDataEncryptionUpToDate condition from HCP when encryption is configured.
-	if hcp != nil && hcluster.Spec.SecretEncryption != nil {
-		cond := meta.FindStatusCondition(hcp.Status.Conditions, string(hyperv1.EtcdDataEncryptionUpToDate))
-		if cond != nil {
-			cond.ObservedGeneration = hcluster.Generation
-			meta.SetStatusCondition(&hcluster.Status.Conditions, *cond)
-		}
-	}
+	// Bubble up the EtcdDataEncryptionUpToDate condition from HCP when encryption is configured.
+	if hcluster.Spec.SecretEncryption == nil || hcp == nil {
+		meta.RemoveStatusCondition(&hcluster.Status.Conditions, string(hyperv1.EtcdDataEncryptionUpToDate))
+	} else if cond := meta.FindStatusCondition(hcp.Status.Conditions, string(hyperv1.EtcdDataEncryptionUpToDate)); cond != nil {
+		cond.ObservedGeneration = hcluster.Generation
+		meta.SetStatusCondition(&hcluster.Status.Conditions, *cond)
+	} else {
+		meta.RemoveStatusCondition(&hcluster.Status.Conditions, string(hyperv1.EtcdDataEncryptionUpToDate))
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`
around lines 897 - 904, The current bubble-up only sets
EtcdDataEncryptionUpToDate when HCP has the condition, but never removes it when
HCP no longer reports it; update the block in hostedcluster_controller.go that
references hcp, hcluster, meta.FindStatusCondition and meta.SetStatusCondition
to also handle the nil case: if hcp is nil, hcluster.Spec.SecretEncryption is
nil, or meta.FindStatusCondition(...) returns nil, call
meta.RemoveStatusCondition(&hcluster.Status.Conditions,
string(hyperv1.EtcdDataEncryptionUpToDate)) to delete the stale condition;
otherwise keep the existing behavior of updating ObservedGeneration and calling
meta.SetStatusCondition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`:
- Around line 127-130: The Get calls that fetch the HostedControlPlane (hcp :=
resourcemanifests.HostedControlPlane(...) followed by r.cpClient.Get(...)) and
the similar Gets for the encryption config Secret and KAS Deployment should
treat apierrors.IsNotFound(err) as a transient condition instead of returning a
hard error; update the error handling in the r.cpClient.Get(...) blocks
(including the HCP fetch and the other Get calls around 158-176) to check
apierrors.IsNotFound(err) and return a non-error reconcile result (e.g.,
reconcile.Result{} or a short requeue with nil error) so the controller
idles/publishes waiting state instead of entering error backoff.
- Around line 165-166: The code uses
encryptionSecret.Annotations[kas.EncryptionActiveKeyHashAnnotation] (and the
same pattern at the other occurrence around lines ~200–201) as the writeKey for
EnsureMigration which is a deterministic key fingerprint and can collide when
rotating back to a previously used key; change this to a monotonic rotation
token annotation (e.g., kas.EncryptionRotationTokenAnnotation) that is
incremented (or replaced with a monotonic nonce) each time a rotation is
initiated and persisted on encryptionSecret, and update the code that reads the
annotation (the places referencing
encryptionSecret.Annotations[kas.EncryptionActiveKeyHashAnnotation]) to read the
rotation token instead and pass that token as the writeKey into EnsureMigration
so each rotation event is unique even if the underlying key value repeats.

---

Duplicate comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`:
- Around line 245-251: The code currently removes the encryption-rekey-needed
annotation (removeRekeyAnnotation) before the controller persists the updated
HCP status (setReEncryptionCondition), which can cause the requeue logic to skip
remaining work if the status update later fails; change the order so you first
set and persist the ReEncryptionCompleted condition on the HostedControlPlane
(call setReEncryptionCondition and ensure the status update is saved via the
status writer used in Reconcile), verify that the status update succeeded, and
only then call removeRekeyAnnotation to delete the encryption-rekey-needed
annotation; use the existing functions removeRekeyAnnotation,
setReEncryptionCondition and the same status update flow in Reconcile to
implement this two-phase commit.

In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go`:
- Around line 897-904: The current bubble-up only sets
EtcdDataEncryptionUpToDate when HCP has the condition, but never removes it when
HCP no longer reports it; update the block in hostedcluster_controller.go that
references hcp, hcluster, meta.FindStatusCondition and meta.SetStatusCondition
to also handle the nil case: if hcp is nil, hcluster.Spec.SecretEncryption is
nil, or meta.FindStatusCondition(...) returns nil, call
meta.RemoveStatusCondition(&hcluster.Status.Conditions,
string(hyperv1.EtcdDataEncryptionUpToDate)) to delete the stale condition;
otherwise keep the existing behavior of updating ObservedGeneration and calling
meta.SetStatusCondition.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: fe17a0ce-bc5e-4d0c-87ad-e9027725f712

📥 Commits

Reviewing files that changed from the base of the PR and between 0d54a40 and e29501b.

⛔ Files ignored due to path filters (4)
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-apiserver/AROSwift/zz_fixture_TestControlPlaneComponents_kas_secret_encryption_config_secret.yaml is excluded by !**/testdata/**
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (9)
  • api/hypershift/v1beta1/hostedcluster_conditions.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption_test.go
  • control-plane-operator/hostedclusterconfigoperator/api/scheme.go
  • control-plane-operator/hostedclusterconfigoperator/cmd.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption_test.go
  • go.mod
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
✅ Files skipped from review due to trivial changes (1)
  • control-plane-operator/hostedclusterconfigoperator/api/scheme.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • control-plane-operator/hostedclusterconfigoperator/cmd.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption_test.go
  • api/hypershift/v1beta1/hostedcluster_conditions.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 13, 2026

Codecov Report

❌ Patch coverage is 58.40108% with 307 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.58%. Comparing base (36dfb1b) to head (2db12b1).
⚠️ Report is 36 commits behind head on main.

⚠️ Current head 2db12b1 differs from pull request most recent head c816d88

Please upload reports for the commit c816d88 to get more accurate results.

Files with missing lines Patch % Lines
...goperator/controllers/reencryption/reencryption.go 68.16% 98 Missing and 22 partials ⚠️
...erconfigoperator/controllers/reencryption/setup.go 0.00% 58 Missing ⚠️
...rator/controllers/hostedcontrolplane/v2/kas/kms.go 52.57% 33 Missing and 13 partials ⚠️
...lers/hostedcontrolplane/v2/kas/secretencryption.go 58.06% 27 Missing and 12 partials ⚠️
...lane/v2/kube_storage_version_migrator/component.go 0.00% 16 Missing ⚠️
...ontrollers/hostedcontrolplane/v2/kas/deployment.go 0.00% 15 Missing ⚠️
...or/controllers/hostedcontrolplane/v2/kas/aescbc.go 66.66% 2 Missing and 2 partials ⚠️
...r/controllers/hostedcontrolplane/v2/kas/kms/aws.go 81.25% 2 Missing and 1 partial ⚠️
...controllers/hostedcontrolplane/v2/kas/kms/azure.go 86.36% 2 Missing and 1 partial ⚠️
...trollers/hostedcontrolplane/v2/kas/kms/ibmcloud.go 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8219      +/-   ##
==========================================
+ Coverage   40.40%   40.58%   +0.17%     
==========================================
  Files         755      762       +7     
  Lines       93235    94126     +891     
==========================================
+ Hits        37675    38202     +527     
- Misses      52858    53166     +308     
- Partials     2702     2758      +56     
Files with missing lines Coverage Δ
...ostedcontrolplane/hostedcontrolplane_controller.go 45.04% <100.00%> (+0.02%) ⬆️
...ontrollers/hostedcontrolplane/v2/cvo/deployment.go 41.17% <100.00%> (+0.58%) ⬆️
...r/controllers/hostedcontrolplane/v2/kas/kms/kms.go 100.00% <ø> (ø)
...-plane-operator/hostedclusterconfigoperator/cmd.go 0.00% <ø> (ø)
...configoperator/controllers/reencryption/metrics.go 100.00% <100.00%> (ø)
...trollers/hostedcluster/hostedcluster_controller.go 43.71% <ø> (+0.02%) ⬆️
hypershift-operator/main.go 0.00% <ø> (ø)
support/config/kms.go 0.00% <ø> (ø)
support/secretencryption/encryptionconfig.go 80.76% <ø> (ø)
support/secretencryption/fingerprint.go 53.84% <ø> (ø)
... and 11 more
Flag Coverage Δ
cmd-support 34.56% <ø> (+0.12%) ⬆️
cpo-hostedcontrolplane 42.15% <53.42%> (+0.38%) ⬆️
cpo-other 41.08% <61.38%> (+0.76%) ⬆️
hypershift-operator 50.59% <ø> (-0.14%) ⬇️
other 31.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@muraee muraee force-pushed the etcd-reencryption-on-key-rotation branch from e29501b to e1b396f Compare April 13, 2026 19:07
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go (2)

254-263: ⚠️ Potential issue | 🟡 Minor

Completed status can still be lost on the failure path.

Line 261 persists annotation removal before the caller persists the updated HCP status. If removeRekeyAnnotation() succeeds and the later Status().Update() fails, the next reconcile hits rekeyNeeded != "true" and exits early, leaving EtcdDataEncryptionUpToDate stale instead of ReEncryptionCompleted.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`
around lines 254 - 263, The code removes the rekey-needed annotation before the
updated HCP status is persisted, risking loss of the Completed condition if the
subsequent status update fails; change the order so you persist the updated HCP
status with the ReEncryptionCompleted condition (call the HCP status update via
r.Client.Status().Update or the existing status update helper) first, check that
it succeeded, and only then call removeRekeyAnnotation(encryptionSecret);
reference setReEncryptionCondition, the HCP Status().Update call, and
removeRekeyAnnotation to locate and reorder the operations.

169-171: ⚠️ Potential issue | 🟠 Major

Use a rotation token here, not the active-key fingerprint.

activeKeyHash identifies the current key, not the rotation event. If a cluster rotates A → B → A, this passes the same writeKey back into EnsureMigration, so an older completed migration for key A can satisfy the check even though the latest rotation still needs objects rewritten from B to A.

Also applies to: 208-210

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`
around lines 169 - 171, The code is using the active-key fingerprint
(encryptionSecret.Annotations[kas.EncryptionActiveKeyHashAnnotation]) to drive
EnsureMigration, which breaks for A→B→A rotations; instead read and use the
rotation token annotation (e.g.,
encryptionSecret.Annotations[kas.EncryptionRotationTokenAnnotation]) as the
writeKey passed to EnsureMigration so each rotation event is unique; update the
usage in the block that reads rekeyNeeded and activeKeyHash and the duplicate
occurrence around the EnsureMigration call (the lines referencing
kas.EncryptionActiveKeyHashAnnotation and the EnsureMigration/writeKey variable)
to use the rotation-token annotation and pass that token through.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go`:
- Around line 188-199: The fingerprinting currently includes the Secret name
(secretEncryption.AESCBC.ActiveKey.Name) which causes renames/copies to change
the fingerprint; change the input construction in the AESCBC case to only
fingerprint the key bytes: compute the SHA256 of
activeKeySecret.Data[hyperv1.AESCBCKeySecretKey] and set input to include only
the algorithm and the hash (e.g. "aescbc/<sha256>"), removing any use of
ActiveKey.Name; ensure you reference activeKeySecret and
hyperv1.AESCBCKeySecretKey when locating the key bytes and preserve existing
error handling around cpContext.Client.Get.

---

Duplicate comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`:
- Around line 254-263: The code removes the rekey-needed annotation before the
updated HCP status is persisted, risking loss of the Completed condition if the
subsequent status update fails; change the order so you persist the updated HCP
status with the ReEncryptionCompleted condition (call the HCP status update via
r.Client.Status().Update or the existing status update helper) first, check that
it succeeded, and only then call removeRekeyAnnotation(encryptionSecret);
reference setReEncryptionCondition, the HCP Status().Update call, and
removeRekeyAnnotation to locate and reorder the operations.
- Around line 169-171: The code is using the active-key fingerprint
(encryptionSecret.Annotations[kas.EncryptionActiveKeyHashAnnotation]) to drive
EnsureMigration, which breaks for A→B→A rotations; instead read and use the
rotation token annotation (e.g.,
encryptionSecret.Annotations[kas.EncryptionRotationTokenAnnotation]) as the
writeKey passed to EnsureMigration so each rotation event is unique; update the
usage in the block that reads rekeyNeeded and activeKeyHash and the duplicate
occurrence around the EnsureMigration call (the lines referencing
kas.EncryptionActiveKeyHashAnnotation and the EnsureMigration/writeKey variable)
to use the rotation-token annotation and pass that token through.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: b08bffff-7931-4567-990b-d445f0cabf23

📥 Commits

Reviewing files that changed from the base of the PR and between e29501b and e1b396f.

⛔ Files ignored due to path filters (4)
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-apiserver/AROSwift/zz_fixture_TestControlPlaneComponents_kas_secret_encryption_config_secret.yaml is excluded by !**/testdata/**
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (9)
  • api/hypershift/v1beta1/hostedcluster_conditions.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption_test.go
  • control-plane-operator/hostedclusterconfigoperator/api/scheme.go
  • control-plane-operator/hostedclusterconfigoperator/cmd.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption_test.go
  • go.mod
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
✅ Files skipped from review due to trivial changes (3)
  • control-plane-operator/hostedclusterconfigoperator/api/scheme.go
  • go.mod
  • control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption_test.go
  • api/hypershift/v1beta1/hostedcluster_conditions.go

Comment thread control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go Outdated
@muraee muraee force-pushed the etcd-reencryption-on-key-rotation branch from e1b396f to d9f1f6a Compare April 14, 2026 10:04
@devguyio
Copy link
Copy Markdown
Contributor

devguyio commented Apr 14, 2026

I now have all the information needed for a complete root cause analysis. Here is the report:


Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Commit fab20778d3:
1: CT1 Title does not start with one of fix, feat, chore, docs, style, refactor, perf, test, revert, ci, build:
   "vendor: add library-go migrators and kube-storage-version-migrator informer/lister"

Root Cause

The first commit in the PR (`fab20778d3`) has a title that **violates the Conventional Commits specification** enforced by the repository's `.gitlint` configuration.

The commit title is:

vendor: add library-go migrators and kube-storage-version-migrator informer/lister

The .gitlint configuration at the repository root requires titles to follow Conventional Commits format, with the allowed type prefixes being:

types = fix,feat,chore,docs,style,refactor,perf,test,revert,ci,build

The prefix vendor is not in the allowed list. The gitlint contrib-title-conventional-commits rule (CT1) rejects any commit title that does not begin with one of these types followed by a colon.

The second commit (d5e509743d) passes validation — its title feat: add etcd data re-encryption after encryption key rotation correctly uses the feat prefix.

Recommendations

These are some recommendations based on the failure analysis 1. **Reword the first commit's title** to use an allowed Conventional Commits prefix. The most appropriate prefix for a vendoring commit is `chore` or `build`: ``` chore: vendor library-go migrators and kube-storage-version-migrator informer/lister ``` or: ``` build: vendor library-go migrators and kube-storage-version-migrator informer/lister ```
  1. Interactive rebase to amend the commit message:
    git rebase -i HEAD~2
    # Change 'pick' to 'reword' for the first commit
    # Update the title to use chore: or build: prefix
    git push --force-with-lease

Evidence

This is some collected evidence that helped reach the mentioned root cause
Evidence Detail
Failing commit fab20778d318dbac81809b5f18434bbfacd3716e
Commit title vendor: add library-go migrators and kube-storage-version-migrator informer/lister
Gitlint rule violated CT1 (contrib-title-conventional-commits)
Invalid prefix used vendor
Allowed prefixes fix, feat, chore, docs, style, refactor, perf, test, revert, ci, build
Passing commit d5e509743d — title correctly uses feat: prefix

@muraee muraee force-pushed the etcd-reencryption-on-key-rotation branch from d9f1f6a to 968dac4 Compare April 14, 2026 10:14
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go (1)

135-136: ⚠️ Potential issue | 🟡 Minor

Handle deleted HostedControlPlane as a no-op.

If the watched HCP is already gone, this returns an error and the controller backs off during normal teardown. Treat apierrors.IsNotFound(err) here the same way as the Secret/Deployment lookups and just exit cleanly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`
around lines 135 - 136, When fetching the HostedControlPlane (hcp) with
r.cpClient.Get in the Reconcile handler, treat a not-found error as a benign
delete and exit cleanly instead of returning an error; update the error handling
around r.cpClient.Get(ctx, client.ObjectKeyFromObject(hcp), hcp) to check
apierrors.IsNotFound(err) and return reconcile.Result{}, nil for that case,
while preserving the existing fmt.Errorf path for other errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`:
- Around line 185-200: The code treats a missing
EncryptionMigratedKeyHashAnnotation as "first deployment" and initializes it to
activeKeyHash (via setMigratedKeyHash), which incorrectly skips re-encryption
when kas-secret-encryption-config is recreated during a pending rotation; modify
the logic in the reconcile path that reads EncryptionMigratedKeyHashAnnotation
and compares it to activeKeyHash so that both the "initialize migrated key hash"
branch and the migratedKeyHash == activeKeyHash fast-path are only taken when
the rekey-needed signal (hypershift.openshift.io/encryption-rekey-needed) is not
present/true; i.e., check the hypershift.openshift.io/encryption-rekey-needed
annotation before calling setMigratedKeyHash or returning early on equal hashes
so pending rekey flows continue to perform re-encryption.
- Around line 363-367: The code uses
activeKeySecret.Data[hyperv1.AESCBCKeySecretKey] without validating it and then
hashes potentially empty bytes; update the logic around the c.Get(...) result in
reencryption.go to check that activeKeySecret.Data contains a non-empty entry
for hyperv1.AESCBCKeySecretKey before computing input (the fingerprint), and if
the key is missing or empty return a clear error (or handle it upstream) instead
of generating a fingerprint for empty bytes; adjust any callers that rely on
input/EncryptionMigratedKeyHashAnnotation to expect this error path.

---

Duplicate comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go`:
- Around line 135-136: When fetching the HostedControlPlane (hcp) with
r.cpClient.Get in the Reconcile handler, treat a not-found error as a benign
delete and exit cleanly instead of returning an error; update the error handling
around r.cpClient.Get(ctx, client.ObjectKeyFromObject(hcp), hcp) to check
apierrors.IsNotFound(err) and return reconcile.Result{}, nil for that case,
while preserving the existing fmt.Errorf path for other errors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 68ffe0d4-c58d-4278-b70b-957f7c8b917b

📥 Commits

Reviewing files that changed from the base of the PR and between e1b396f and d9f1f6a.

⛔ Files ignored due to path filters (4)
  • control-plane-operator/controllers/hostedcontrolplane/testdata/kube-apiserver/AROSwift/zz_fixture_TestControlPlaneComponents_kas_secret_encryption_config_secret.yaml is excluded by !**/testdata/**
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
  • docs/content/reference/api.md is excluded by !docs/content/reference/api.md
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/hostedcluster_conditions.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (8)
  • api/hypershift/v1beta1/hostedcluster_conditions.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go
  • control-plane-operator/hostedclusterconfigoperator/api/scheme.go
  • control-plane-operator/hostedclusterconfigoperator/cmd.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/reencryption_test.go
  • go.mod
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
✅ Files skipped from review due to trivial changes (3)
  • control-plane-operator/hostedclusterconfigoperator/api/scheme.go
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/kas/secretencryption.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • go.mod

@muraee
Copy link
Copy Markdown
Contributor Author

muraee commented Apr 14, 2026

Re: @coderabbitai's comment on AESCBC key validation:

Done. Added validation to return empty fingerprint when key data is missing or empty.


AI-assisted response via Claude Code

@muraee
Copy link
Copy Markdown
Contributor Author

muraee commented Apr 14, 2026

Re: @coderabbitai's comment on missing migrated-hash during rotation:

The encryption-rekey-needed annotation was intentionally removed in this revision. The controller now detects re-encryption need by comparing the active key fingerprint (computed from HCP spec) against encryption-migrated-key-hash on the secret — no boolean signal involved.

The scenario described (secret recreated during pending rotation) requires manual deletion of an operator-managed secret, which would also disrupt KAS itself. In that edge case, the controller initializes the migrated hash to the current active key, which is correct — the CPO will recreate the secret with the current encryption config, and KAS will use the current key going forward. If re-encryption for the previous key was incomplete, the next key rotation will trigger a full re-encryption.


AI-assisted response via Claude Code

@muraee muraee force-pushed the etcd-reencryption-on-key-rotation branch from 968dac4 to 2b0d02c Compare April 14, 2026 10:42
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 14, 2026

Tip

For best results, initiate chat on the files or code changes.

@muraee: Thanks for addressing this! Adding the guard to return an empty fingerprint when the AESCBC key data is missing or empty is the right approach — it prevents computing a potentially incorrect fingerprint from incomplete data and avoids triggering unnecessary re-encryption cycles.

@muraee muraee force-pushed the etcd-reencryption-on-key-rotation branch from 670e29c to 2cd3d4e Compare May 21, 2026 11:31
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8219 May 21, 2026 11:33 Inactive
@muraee muraee force-pushed the etcd-reencryption-on-key-rotation branch from 2cd3d4e to 3526885 Compare May 21, 2026 11:56
@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 21, 2026
muraee added 6 commits May 21, 2026 13:57
…oDate condition

Add API types for tracking etcd data re-encryption after key rotation:
- SecretEncryptionStatus with ActiveKey, TargetKey, and History fields
- SecretEncryptionKeyStatus union type (Azure, AWS, IBMCloud, AESCBC)
- Per-provider key status types (AzureKMSKeyStatus, AWSKMSKeyStatus, etc.)
- EncryptionMigrationHistory and EncryptionMigrationState types
- EtcdDataEncryptionUpToDate condition with 7 reason constants
- SecretEncryption field added to HostedClusterStatus and HCPStatus
- Deprecated backupKey fields on AWSKMSSpec, AzureKMSSpec, AESCBCSpec

Signed-off-by: Mulham Raee <mulham.raee@gmail.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
Auto-generated output from make update after API type changes.
Includes vendored library-go migrators and kube-storage-version-migrator
informer/lister packages required by the HCCO re-encryption controller.

Signed-off-by: Mulham Raee <mulham.raee@gmail.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
…-up, and rotation guard VAP

- Add support/secretencryption/ package with fingerprint computation,
  key status construction, and EncryptionConfiguration parsing
- Copy EtcdDataEncryptionUpToDate condition from HCP to HostedCluster
  using the copy-if-present pattern
- Copy SecretEncryption status from HCP to HostedCluster
- Deploy ValidatingAdmissionPolicy that blocks active key changes while
  re-encryption is in progress (EtcdDataEncryptionUpToDate=False)

Signed-off-by: Mulham Raee <mulham.raee@gmail.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
…out and deploy migrator

Two-stage KAS encryption key rollout:
- Rewrite adaptSecretEncryptionConfig to derive write/read key assignment
  from HCP status and live EncryptionConfiguration
- ReadOnlyDeploy: old key writes, new key read-only
- WritePromote: new key writes once in config and KAS converged
- Extract AWSKMSProviderName, AzureKMSProviderName, AESCBCKeyName as
  shared functions for provider name computation
- Add --encryption-provider-config-automatic-reload=false for Azure KMS

Control plane kube-storage-version-migrator:
- New CPOv2 component with platform predicate (skip IBM/PowerVS)
- Add data-plane migrator deployments to CVO resourcesToRemove

Signed-off-by: Mulham Raee <mulham.raee@gmail.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
Add re-encryption controller to the HCCO that drives the key rotation
lifecycle by deriving the current phase from observable state:

- Detect key changes by comparing spec vs status fingerprints
- Derive phase from EncryptionConfiguration contents and KAS convergence
- Create StorageVersionMigration CRs via KubeStorageVersionMigrator
- Manage targetKey, activeKey, and history in HCP status
- Set EtcdDataEncryptionUpToDate condition with phase-specific reasons
- Expose Prometheus metrics for rotation state, duration, and failures

Signed-off-by: Mulham Raee <mulham.raee@gmail.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
Signed-off-by: Mulham Raee <mulham.raee@gmail.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
@muraee muraee force-pushed the etcd-reencryption-on-key-rotation branch from 3526885 to f5ba6bd Compare May 21, 2026 12:05
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 21, 2026
@github-actions github-actions Bot temporarily deployed to docs-preview/pr-8219 May 21, 2026 12:09 Inactive
Signed-off-by: Mulham Raee <mulham.raee@gmail.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
@hypershift-jira-solve-ci
Copy link
Copy Markdown

I have all the evidence. The lint target runs both api-lint and the main golangci-lint. The api-lint step failed with 1 issue (exit code 1), and even though the main lint step passed (0 issues), the Makefile takes the maximum exit code (exit $(( api_rc > main_rc ? api_rc : main_rc ))), resulting in exit code 2 overall.

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

hypershift/v1beta1/ibmcloud.go:64:2: requiredfields: field IBMCloudKMSKeyStatus.KeyVersion has a valid zero value (0) and should be a pointer. (kubeapilinter)
	KeyVersion int32 `json:"keyVersion,omitempty"`
	^
1 issues:
* kubeapilinter: 1
make[1]: *** [Makefile:112: api-lint] Error 1

Summary

The api-lint step (golangci-lint with kubeapilinter plugin) flagged the newly added IBMCloudKMSKeyStatus.KeyVersion field at api/hypershift/v1beta1/ibmcloud.go:64. The field is declared as int32 with a +required kubebuilder marker, but int32 has a valid zero value (0), which means the API server cannot distinguish between "user explicitly set 0" and "user omitted the field." The kubeapilinter requiredfields rule requires that such fields be declared as a pointer type (*int32) so that nil can represent an unset value. The main golangci-lint pass (non-API linters) reported 0 issues, but the Makefile lint target returns the maximum exit code of both passes, so the overall job failed with exit code 2.

Root Cause

The PR adds a new struct IBMCloudKMSKeyStatus in api/hypershift/v1beta1/ibmcloud.go with the field:

// keyVersion is the key version number.
// +required
// +kubebuilder:validation:Minimum=0
KeyVersion int32 `json:"keyVersion,omitempty"`

The kubeapilinter requiredfields linter enforces that any field marked +required whose Go type has a meaningful zero value (like int32 where 0 is valid) must be declared as a pointer (*int32). This is a Kubernetes API convention: for non-pointer numeric types, the API server cannot distinguish between a user setting the value to 0 and the user omitting the field entirely (since Go initializes int32 to 0). Without a pointer, deserializing an omitted field silently produces 0, violating the +required contract.

The fix is to change the field type from int32 to *int32. Note that the existing IBMCloudKMSKeyEntry.KeyVersion field (line 43) uses int and is also marked +required — it wasn't flagged only because it's not new code (the api-lint target uses --new-from-rev to only lint new/changed lines).

Recommendations
  1. Change KeyVersion to a pointer type in IBMCloudKMSKeyStatus:

    KeyVersion *int32 `json:"keyVersion,omitempty"`
  2. Update deepcopy-gen output — since IBMCloudKMSKeyStatus is marked +k8s:deepcopy-gen=true, regenerate deep copy methods to handle the pointer field properly (run make generate).

  3. Update all code referencing IBMCloudKMSKeyStatus.KeyVersion to dereference the pointer safely (nil checks). Check control-plane-operator/controllers/hostedcontrolplane/v2/kas/kms/ibmcloud.go and the generated apply-configuration file client/applyconfiguration/hypershift/v1beta1/ibmcloudkmskeystatus.go.

  4. Update the vendored copy at vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/ibmcloud.go to match.

  5. Consider the same fix for IBMCloudKMSKeyEntry.KeyVersion (line 43) which has the identical pattern (int + +required) — it wasn't flagged because --new-from-rev only checks new code, but it has the same API correctness issue.

Evidence
Evidence Detail
Lint error hypershift/v1beta1/ibmcloud.go:64:2: requiredfields: field IBMCloudKMSKeyStatus.KeyVersion has a valid zero value (0) and should be a pointer. (kubeapilinter)
Failing target make[1]: *** [Makefile:112: api-lint] Error 1make: *** [Makefile:120: lint] Error 2
Flagged field KeyVersion int32 \json:"keyVersion,omitempty"`(line 64 ofapi/hypershift/v1beta1/ibmcloud.go`)
Linter rule requiredfields from kubeapilinter plugin — requires pointer types for +required fields with valid Go zero values
New code from PR IBMCloudKMSKeyStatus struct added entirely by this PR (+35 lines in ibmcloud.go)
Main lint result 0 issues (passed) — failure is exclusively from api-lint
golangci-lint version 2.11.4 with go1.25.7

Comment thread api/hypershift/v1beta1/hostedcluster_types.go Outdated
Comment thread api/hypershift/v1beta1/ibmcloud.go Outdated
Comment thread api/hypershift/v1beta1/hostedcluster_types.go
Comment thread api/hypershift/v1beta1/aws.go Outdated
Comment thread api/hypershift/v1beta1/ibmcloud.go Outdated
Comment thread api/hypershift/v1beta1/hostedcluster_types.go
muraee and others added 2 commits May 26, 2026 13:01
Replace the duplicated *KeyStatus types (AWSKMSKeyStatus,
AzureKMSKeyStatus, IBMCloudKMSKeyStatus) with the existing spec types
(AWSKMSKeyEntry, AzureKMSKey, IBMCloudKMSKeyEntry) in
SecretEncryptionKeyStatus. This follows the Kubernetes convention of
reusing spec types in status when the data shape is the same.

Also adds omitempty to the inner fields of the three spec types,
fixing a pre-existing API convention violation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…omitzero

Use metav1.Time (non-pointer) with omitzero instead of *metav1.Time
with omitempty. The IsZero() method on metav1.Time makes omitzero
correctly omit the field when unset, removing the need for a pointer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 26, 2026

@muraee: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@muraee muraee changed the title feat: add etcd data re-encryption after encryption key rotation CNTRLPLANE-3524: feat: add etcd data re-encryption after encryption key rotation May 28, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 28, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 28, 2026

@muraee: This pull request references CNTRLPLANE-3524 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Add a re-encryption controller in the HCCO that detects encryption key rotations and triggers StorageVersionMigration CRs to re-encrypt all existing etcd data with the new active key
  • Track progress through a new EtcdDataEncryptionUpToDate condition on HCP/HostedCluster
  • Vendor library-go's KubeStorageVersionMigrator and kube-storage-version-migrator informer/lister packages

Details

Problem

HyperShift supports encryption key rotation (setting a new active key + backup key), but has no mechanism to re-encrypt existing etcd data with the new key. Old data remains encrypted with the previous key indefinitely, violating compliance requirements (OCPSTRAT-2527, OCPSTRAT-2540).

Solution

Key change detection is performed entirely within the HCCO re-encryption controller — no CPO changes needed. The controller computes a SHA-256 fingerprint of the active key identity from the HCP spec on each reconcile and compares it against a hypershift.openshift.io/encryption-migrated-key-hash annotation on the kas-secret-encryption-config secret. When the hashes differ, re-encryption is triggered.

API (api/hypershift/v1beta1/hostedcluster_conditions.go):

  • EtcdDataEncryptionUpToDate condition type
  • Reasons: ReEncryptionInProgress, ReEncryptionCompleted, ReEncryptionFailed, WaitingForKASConvergence

HCCO (control-plane-operator/hostedclusterconfigoperator/controllers/reencryption/):

  • New reencryption controller using library-go's KubeStorageVersionMigrator
  • computeActiveKeyFingerprint() — SHA-256 hash of active key identity (supports AWS KMS, Azure KMS, IBM Cloud KMS, AESCBC)
  • On first deployment, initializes migrated hash to active hash (assumes data is already encrypted with current key)
  • Watches HCP, encryption config secret, and KAS deployment on management cluster
  • Waits for KAS deployment convergence before starting migrations
  • Creates StorageVersionMigration CRs in the guest cluster for all encrypted resources (secrets, configmaps, routes, oauth tokens)
  • Retries failed migrations after 5-minute interval via PruneMigration
  • Sets EtcdDataEncryptionUpToDate condition on HCP with detailed progress messages

HyperShift Operator (hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go):

  • Bubbles up EtcdDataEncryptionUpToDate condition from HCP to HostedCluster (only when encryption is configured and condition exists)

Vendoring

  • library-go/pkg/operator/encryption/controllers/migrators/KubeStorageVersionMigrator, Migrator interface
  • kube-storage-version-migrator/pkg/clients/informer/ and lister/ — SVM informer/lister

Ref: OCPSTRAT-2527, OCPSTRAT-2540
Enhancement: openshift/enhancements#1969

Test plan

  • Unit tests for computeActiveKeyFingerprint() — all provider types, determinism, different keys produce different hashes, unsupported type returns error
  • Unit tests for re-encryption controller reconcile — no encryption, first deploy init, hashes match, KAS not converged, migrations in progress, all succeeded, failed with retry/prune, cache not synced
  • go build ./... passes
  • go test -race passes for all modified packages
  • E2E: AWS KMS key rotation with re-encryption verification
  • E2E: Azure KMS key rotation with re-encryption verification
  • E2E: AESCBC key rotation with re-encryption verification

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

  • Added etcd data encryption status tracking and re-encryption progress monitoring to hosted clusters.

  • Extended encryption key status fields across hosted control planes and multiple KMS providers.

  • Introduced encryption migration history tracking and lifecycle state management.

  • Chores

  • Updated storage version migrator dependency classification.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/ibmcloud PR/issue for IBMCloud (IBMCloudPlatform) platform jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants