CNTRLPLANE-3237: In-place field update for KMS mode regardless of the convergence#2241
CNTRLPLANE-3237: In-place field update for KMS mode regardless of the convergence#2241ardaguclu wants to merge 1 commit into
Conversation
|
@ardaguclu: This pull request references CNTRLPLANE-3237 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR implements in-place updates of KMS plugin configuration to allow changes like image updates without creating new encryption keys. The key controller detects when the latest KMS key's plugin config differs from the current APIServer KMS config and updates the secret in-place unless the change would trigger a migration. The state controller then propagates these updates to the encryption-config secret during transitioning states. End-to-end tests validate the complete flow. ChangesKMS Plugin Configuration In-Place Updates
🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 9 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ardaguclu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
test/e2e-encryption/encryption_test.go (1)
646-648: ⚡ Quick winMake the “no new key” assertion relative to the pre-update count.
Line 648 hardcodes
13, which couples this check to earlier scenario history. Capture the key count before the in-place patch and compare before/after so this assertion only validates the intended behavior.Proposed fix
+ beforeKeys, err := kubeClient.CoreV1().Secrets("openshift-config-managed").List(ctx, metav1.ListOptions{LabelSelector: keySecretsLabel}) + require.NoError(t, err) + t.Logf("In-place KMS plugin config update: change image (non-migration field)") _, err = fakeApiServerClient.Patch(ctx, "cluster", types.MergePatchType, []byte(`{"spec":{"encryption":{"type":"KMS","kms":{"type":"Vault","vault":{"kmsPluginImage":"registry.example.com/kms-plugin@sha256:1111111111111111111111111111111111111111111111111111111111111111","vaultAddress":"https://vault.example.com","authentication":{"type":"AppRole","appRole":{"secret":{"name":"vault-approle-secret"}}},"transitKey":"test-transit-key"}}}}}`), metav1.PatchOptions{}) require.NoError(t, err) @@ currentKeys, err := kubeClient.CoreV1().Secrets("openshift-config-managed").List(ctx, metav1.ListOptions{LabelSelector: keySecretsLabel}) require.NoError(t, err) - require.Equal(t, 13, len(currentKeys.Items), "in-place update should not create new keys") + require.Equal(t, len(beforeKeys.Items), len(currentKeys.Items), "in-place update should not create new keys")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e-encryption/encryption_test.go` around lines 646 - 648, Before performing the in-place patch, save the current secret count (e.g., call kubeClient.CoreV1().Secrets("openshift-config-managed").List(...) and store len(items) into a variable like previousKeyCount), then after the patch replace the hardcoded assertion require.Equal(t, 13, len(currentKeys.Items), ...) with require.Equal(t, previousKeyCount, len(currentKeys.Items), ...) so the test verifies no new keys were created relative to the pre-update count; update references to currentKeys and the List call accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/operator/encryption/controllers/state_controller.go`:
- Around line 219-227: The current guard only compares Encryption.Resources and
can allow a rebuiltConfig that has lost KMS plugin entries or secret data to be
applied; before calling applyEncryptionConfigSecret, add an explicit check
comparing existingConfig.Encryption.KMSPlugins (and/or the presence of expected
secret data) to rebuiltConfig.Encryption.KMSPlugins and bail out if existing had
plugins or secret references that are absent or empty in rebuiltConfig. In other
words, in the same block that uses equality.Semantic.DeepEqual for
Encryption.Resources, also ensure the KMSPlugins slice/map and any required
secret references in rebuiltConfig match existingConfig (or at minimum that
rebuiltConfig does not drop KMSPlugins that exist in current state) and only
call applyEncryptionConfigSecret when both Resources and KMSPlugins integrity
checks pass.
In `@test/e2e-encryption/encryption_test.go`:
- Around line 657-659: The current predicate passed to
waitForConfigEventuallyCond (the anonymous func at the call site) only checks
strings.Contains(got, "identity") which can match fallback mentions; replace it
with a stronger check that proves the KMS write position actually transitioned —
for example, verify the config contains a clear primary/active marker for
identity (e.g. a line like "kms: identity" or "primary: identity" using an exact
substring or regex match) and also assert the previous KMS no longer has write
privileges (ensure there is no "write: true" under the old KMS entry). Update
the anonymous function used in waitForConfigEventuallyCond to return true only
when both conditions (identity as the active KMS and absence of the old KMS
write flag) are met.
---
Nitpick comments:
In `@test/e2e-encryption/encryption_test.go`:
- Around line 646-648: Before performing the in-place patch, save the current
secret count (e.g., call
kubeClient.CoreV1().Secrets("openshift-config-managed").List(...) and store
len(items) into a variable like previousKeyCount), then after the patch replace
the hardcoded assertion require.Equal(t, 13, len(currentKeys.Items), ...) with
require.Equal(t, previousKeyCount, len(currentKeys.Items), ...) so the test
verifies no new keys were created relative to the pre-update count; update
references to currentKeys and the List call accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8a91a0b7-e17e-455b-9e3a-d539cda57d10
📒 Files selected for processing (7)
pkg/operator/encryption/controllers/key_controller.gopkg/operator/encryption/controllers/key_controller_test.gopkg/operator/encryption/controllers/state_controller.gopkg/operator/encryption/controllers/state_controller_test.gopkg/operator/encryption/encryptiondata/config.gopkg/operator/encryption/secrets/secrets.gotest/e2e-encryption/encryption_test.go
|
/retest |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/operator/encryption/controllers/state_controller.go`:
- Around line 226-235: The current check in the state reconciliation (the block
comparing existingConfig.KMSPlugins and rebuiltConfig.KMSPlugins) only ensures
keys exist but doesn't guard KMSPluginsSecretData; update the verification to
also ensure that for every keyID present in existingConfig.KMSPlugins the
corresponding entry exists in rebuiltConfig.KMSPluginsSecretData (and is
non-nil/contains expected data), and return an error if any secret-data entry is
missing or has been removed to prevent in-place propagation from silently
dropping credentials (reference existingConfig.KMSPlugins,
rebuiltConfig.KMSPlugins, and KMSPluginsSecretData in your change).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d4754b08-3a50-4bc5-b0e7-01cf2619e50e
📒 Files selected for processing (7)
pkg/operator/encryption/controllers/key_controller.gopkg/operator/encryption/controllers/key_controller_test.gopkg/operator/encryption/controllers/state_controller.gopkg/operator/encryption/controllers/state_controller_test.gopkg/operator/encryption/encryptiondata/config.gopkg/operator/encryption/secrets/secrets.gotest/e2e-encryption/encryption_test.go
|
/uncc @dgrisonnet |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/operator/encryption/controllers/key_controller.go`:
- Around line 284-300: secrets.ToKeyStates filters out invalid secrets before
sorting, so using backedKeys[0] can return an older parsed key when the newest
raw secret is malformed; change the logic in Key controller to first pick the
newest raw secret from keySecrets (by key ID / creation ordering), then parse
only that secret (using the same parsing used by ToKeyStates) and return an
error immediately if parsing fails; once you have the parsed newest secret, use
that (check Mode == state.KMS and call HasKMSPlugin) instead of using
backedKeys[0], ensuring failures surface fast and you don’t accidentally update
the wrong object.
In `@test/e2e-encryption/encryption_test.go`:
- Around line 623-626: The polling loop currently swallows errors from
encoding.DecodeKMSPluginConfig(pluginData) by returning (false, nil); change it
to return the decode error so the test fails fast—e.g., return false, err (or
wrap the error with context) instead of nil—so DecodeKMSPluginConfig failures
aren’t hidden during polling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7a7d77d2-8e39-474d-8f3c-ad7cc26eda9a
📒 Files selected for processing (7)
pkg/operator/encryption/controllers/key_controller.gopkg/operator/encryption/controllers/key_controller_test.gopkg/operator/encryption/controllers/state_controller.gopkg/operator/encryption/controllers/state_controller_test.gopkg/operator/encryption/encryptiondata/config.gopkg/operator/encryption/secrets/secrets.gotest/e2e-encryption/encryption_test.go
There was a problem hiding this comment.
🧹 Nitpick comments (2)
test/e2e-encryption/encryption_test.go (1)
288-320: ⚡ Quick winConsider failing fast on decode errors during polling.
At lines 294-296 and 310-313, decode errors are silently swallowed by returning
(false, nil), which continues polling. While this may be intentional for transient states, it can hide real regressions and only surface as a timeout failure, making debugging harder.Given that line 645 in the in-place update test now correctly returns the decode error (addressing past review feedback), consider applying the same pattern here for consistency.
Proposed fix
cfg, err = encryptiondata.FromSecret(encryptionConfigSecret) if err != nil { - return false, nil + return false, fmt.Errorf("failed to decode encryption config: %w", err) } for keyID := range expectedKeyIDs { ... keyPluginConfig, err := encoding.DecodeKMSPluginConfig(keyPluginData) if err != nil { - return false, nil + return false, fmt.Errorf("failed to decode kms-plugin-config for key %s: %w", keyID, err) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e-encryption/encryption_test.go` around lines 288 - 320, The polling callback in wait.PollUntilContextTimeout is swallowing decode errors by returning (false, nil); change those to return (false, err) so decode failures surface immediately. Specifically, when calling encryptiondata.FromSecret (currently returning (false, nil) on err) and encoding.DecodeKMSPluginConfig (also returning (false, nil) on err), propagate the actual error instead of nil; leave transient-not-ready conditions that should continue polling as (false, nil), but return the real err for decode failures so the test fails fast.pkg/operator/encryption/controllers/key_controller.go (1)
347-360: 💤 Low valueConsider explicit handling of unknown KMS provider types.
Currently,
kmsMigrationRequiredreturnsfalsefor any KMS provider type other than Vault. If a new provider type is added in the future without updating this function, all field changes would be treated as in-place-safe, potentially causing issues. Consider returningtruefor unknown types as a safe default, or adding a comment documenting that this function must be updated when new providers are added.Given this is Tech Preview with only Vault supported, this is acceptable for now.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/operator/encryption/controllers/key_controller.go` around lines 347 - 360, The function kmsMigrationRequired currently treats any non-Vault provider as in-place-safe, which is unsafe if new providers are added; update kmsMigrationRequired (which takes latest, current configv1.KMSPluginConfig) to explicitly handle provider types (e.g., switch on latest.Type / current.Type and keep the Vault-specific checks) and ensure the default/unknown-case returns true (or add a clear TODO comment that any new KMS provider must be added here) so that unknown provider type changes force a migration rather than silently allowing in-place changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/operator/encryption/controllers/key_controller.go`:
- Around line 347-360: The function kmsMigrationRequired currently treats any
non-Vault provider as in-place-safe, which is unsafe if new providers are added;
update kmsMigrationRequired (which takes latest, current
configv1.KMSPluginConfig) to explicitly handle provider types (e.g., switch on
latest.Type / current.Type and keep the Vault-specific checks) and ensure the
default/unknown-case returns true (or add a clear TODO comment that any new KMS
provider must be added here) so that unknown provider type changes force a
migration rather than silently allowing in-place changes.
In `@test/e2e-encryption/encryption_test.go`:
- Around line 288-320: The polling callback in wait.PollUntilContextTimeout is
swallowing decode errors by returning (false, nil); change those to return
(false, err) so decode failures surface immediately. Specifically, when calling
encryptiondata.FromSecret (currently returning (false, nil) on err) and
encoding.DecodeKMSPluginConfig (also returning (false, nil) on err), propagate
the actual error instead of nil; leave transient-not-ready conditions that
should continue polling as (false, nil), but return the real err for decode
failures so the test fails fast.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1920114c-47a2-4cb2-955c-7cc718c87c37
📒 Files selected for processing (5)
pkg/operator/encryption/controllers/key_controller.gopkg/operator/encryption/controllers/key_controller_test.gopkg/operator/encryption/controllers/state_controller.gopkg/operator/encryption/controllers/state_controller_test.gotest/e2e-encryption/encryption_test.go
231db91 to
fa1c366
Compare
When a KMS sidecar revision is stuck (e.g. wrong container image), the cluster-admin needs to update APIServer.spec.encryption.kms with corrected values. Previously, all encryption controllers blocked on revision convergence, so the fix could never be applied. This change allows the key_controller to update the latest key secret's KMS plugin config in-place for non-migration fields (image, TLS, AppRole) regardless of convergence state. If there is a change in migration-triggering fields, we should wait until convergence happens. State controller carries in-place fields updates only when there is no change in the encryption-config data key.
|
@ardaguclu: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
KAS is not degrading after increasing waittime also it only stuck in previous revision |
|
|
@gangwgr that is great news, so this means that we have verified that the changes in this PR work as expected. Just one question: I saw kube-apiserver is in |
|
|
@ardaguclu tested with again with new cluster, in that it is recovering from stuck state |
When a KMS sidecar revision is stuck (e.g. wrong container image), the cluster-admin needs to update APIServer.spec.encryption.kms with corrected values. Previously, all encryption controllers blocked on revision convergence, so the fix could never be applied.
This change allows the key_controller to update the latest key secret's KMS plugin config in-place for non-migration fields (image, TLS, AppRole) regardless of convergence state. If there is a change in migration-triggering fields, we should wait until convergence happens.
State controller carries in-place fields updates only when there is no change in the encryption-config data key.
Summary by CodeRabbit
Bug Fixes
Refactor
Tests