-
Notifications
You must be signed in to change notification settings - Fork 214
RHOAIENG-37741: Don't delete non-owned resources #2809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: stable-2.x
Are you sure you want to change the base?
Conversation
https://issues.redhat.com/browse/RHOAIENG-37741 Add a data-driven unit test that demonstrates the bug where cleanUpTemplatedResources deletes KnativeServing resources that were not created by the Kserve controller. The test covers both scenarios from RHOAIENG-37741: Scenario 1: ServiceMesh Removed, Serving Unmanaged - User wants to use existing KnativeServing but not manage it - Expected: External KnativeServing should be left alone - Actual: It gets deleted (bug) Scenario 2: ServiceMesh Removed, Serving Removed - User wants RawDeployment mode without Serverless - Expected: External KnativeServing should be left alone - Actual: It gets deleted (bug) The bug occurs in cleanUpTemplatedResources (line 294 in kserve_controller_actions.go) where the code deletes resources based on name and namespace alone, without checking if the actual cluster resource has the ownership label (platform.opendatahub.io/part-of: kserve). The code iterates through rr.Resources (which come from templates and have the platform.opendatahub.io/dependency: serverless label), and for each matching resource, it deletes any cluster resource with the same name/namespace regardless of whether that cluster resource was actually created by the Kserve controller. This causes the Kserve controller to delete externally-managed KnativeServing resources, which disrupts users who have Serverless installed for other purposes. The test creates a KnativeServing CR without the kserve ownership label, simulating an external resource, then verifies it gets deleted during cleanup (demonstrating the bug). This test will need to be updated once the bug is fixed to expect the resource NOT to be deleted. Assisted-By: Claude <[email protected]>
https://issues.redhat.com/browse/RHOAIENG-37741 Add a test demonstrating the bug in the second deletion loop (lines 311-326) of cleanUpTemplatedResources when authorino is NOT installed. This test complements the existing data-driven test by covering a different code path. While the first test covers deletion when ServiceMesh.ManagementState == Removed (line 288-306), this test covers deletion when authorino is not installed (line 311-326). The bug is identical in both code paths: resources are deleted based on name/namespace alone without checking for Kserve OwnerReferences. Test setup: - Creates external EnvoyFilter "activator-host-header" in istio-system - WITHOUT Kserve OwnerReference (simulates user-created resource) - Does NOT create authorino-operator Subscription (triggers line 311) - ServiceMesh: Managed (avoids first deletion loop) Expected result: - EnvoyFilter gets deleted (demonstrates bug) - Should only delete if resource has Kserve OwnerReference Both deletion loops need the same fix: fetch cluster resource first and check for Kserve OwnerReference before deleting. Assisted-By: Claude <[email protected]>
https://issues.redhat.com/browse/RHOAIENG-37741 This commit fixes a bug in cleanUpTemplatedResources where the Kserve controller would delete cluster resources based on name/namespace alone, without checking if those resources were actually owned by the Kserve controller. The bug manifested in two deletion loops: 1. Lines 288-333: When ServiceMesh.ManagementState is set to Removed, the controller deletes resources with the serverless or servicemesh dependency labels. 2. Lines 338-384: When authorino is not installed, the controller deletes resources with the servicemesh dependency label. In both cases, the code would iterate through rr.Resources (which contains template resources with dependency labels) and delete any cluster resource matching the name/namespace, regardless of whether that cluster resource was created by the Kserve controller. This caused problems in two scenarios from RHOAIENG-37741: - Scenario 1: User sets ServiceMesh to Removed and Serving to Unmanaged, wanting to use an existing KnativeServing CR they manage themselves. The Kserve controller would delete their KnativeServing. - Scenario 2: User sets ServiceMesh to Removed and Serving to Removed, wanting RawDeployment mode. If they had servicemesh resources for other purposes, the Kserve controller would delete them. The fix: 1. Fetch the cluster resource to get its current state 2. Check if it has a Kserve OwnerReference using isKserveOwnerRef() 3. Only delete if the OwnerReference exists 4. Skip resources not owned by the Kserve controller This respects OwnerReferences as the authoritative ownership mechanism and prevents the controller from deleting resources it doesn't own. Added two tests to validate the fix: - TestCleanUpTemplatedResources_DeletesResourcesWithoutKserveLabel: Data-driven test covering both Jira scenarios with KnativeServing - TestCleanUpTemplatedResources_DeletesResourcesWithoutKserveLabel_NoAuthorino: Tests the second deletion loop with EnvoyFilter resources Assisted-By: Claude <[email protected]>
This adds unit tests demonstrating the current behavior of getAndRemoveOwnerReferences, including a test case that shows the function fails when a CRD is not installed on the cluster. Related to: https://issues.redhat.com/browse/RHOAIENG-37741 Assisted-By: Claude <[email protected]>
When removing owner references from unmanaged resources, the function now gracefully handles the case where the resource's CRD is not installed on the cluster by ignoring meta.NoKindMatchError errors. This allows KServe to be enabled on clusters where Service Mesh 2.x is configured as Removed/Unmanaged and the OSSM CRDs are not installed, without causing reconciliation failures. Fixes: https://issues.redhat.com/browse/RHOAIENG-37741 Assisted-By: Claude <[email protected]>
WalkthroughRefactored resource deletion logic in the Kserve controller by extracting direct deletion calls into a centralized helper function Changes
Sequence DiagramsequenceDiagram
participant ctrl as cleanUpTemplatedResources
participant helper as deleteResourceIfOwnedByKserve
participant client as Kubernetes Client
participant meta as meta.NoKindMatchError
ctrl->>helper: Call with resource, GVK
helper->>client: Get resource by GVK
alt Resource Not Found
client-->>helper: NotFound error
helper-->>ctrl: Return (silent)
else CRD Not Installed
client-->>meta: NoKindMatchError
meta-->>helper: Return error
helper-->>ctrl: Return (silent)
else Resource Found
client-->>helper: Unstructured resource
helper->>helper: Check OwnerReferences for Kserve
alt Kserve Owner Present
helper->>client: Delete (foreground propagation)
client-->>helper: Success
helper->>helper: Log deletion
helper-->>ctrl: Return nil
else No Kserve Owner
helper-->>ctrl: Return nil (not owned)
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
https://issues.redhat.com/browse/RHOAIENG-37741 This commit addresses all golangci-lint issues in the kserve controller: - Extract deleteResourceIfOwnedByKserve helper function to eliminate duplicate code blocks in cleanUpTemplatedResources (fixes dupl) - Fix import alias from k8serrors to k8serr for consistency (fixes importas) - Reduce cyclomatic complexity by extracting helper function (fixes gocyclo) - Consolidate duplicate test functions into data-driven test (fixes dupl in tests) - Run make fmt to fix import formatting (fixes gci) The helper function in kserve_support.go encapsulates the common pattern of fetching a cluster resource, checking for Kserve OwnerReferences, and safely deleting only resources owned by the Kserve controller. Test refactoring merged TestCleanUpTemplatedResources_DeletesResourcesWithoutKserveLabel and TestCleanUpTemplatedResources_DeletesResourcesWithoutKserveLabel_NoAuthorino into a single data-driven test covering both code paths. Assisted-By: Claude <[email protected]>
|
@coderabbitai help |
ChatThere are 3 ways to chat with CodeRabbit:
CodeRabbit commands
Other keywords and placeholders
CodeRabbit configuration file (
|
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
internal/controller/components/kserve/kserve_support.go (1)
313-355: Consider adding debug logging when skipping non-owned resources.The ownership check correctly prevents deletion of user-created resources. However, adding debug-level logging when skipping deletion (line 340) would improve observability and help diagnose unexpected behavior during troubleshooting.
Apply this diff to add debug logging:
if !hasKserveOwner { + logger.V(1).Info("Skipping deletion of non-Kserve-owned resource", + "kind", clusterRes.GetKind(), + "name", clusterRes.GetName(), + "namespace", clusterRes.GetNamespace()) return nil // Skip resources not owned by Kserve controller }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
internal/controller/components/kserve/kserve_controller_actions.go(2 hunks)internal/controller/components/kserve/kserve_controller_actions_test.go(2 hunks)internal/controller/components/kserve/kserve_support.go(5 hunks)internal/controller/components/kserve/kserve_support_test.go(1 hunks)
🔇 Additional comments (6)
internal/controller/components/kserve/kserve_support.go (2)
8-8: LGTM: Import additions support the new ownership-aware deletion logic.The new imports enable centralized error handling (k8serr alias), CRD-missing detection (meta), structured logging (logr), and error wrapping (odherrors).
Also applies to: 13-13, 16-17, 30-30
140-140: LGTM: Consistent error handling with k8serr alias and NoKindMatchError checks.The updated error checks properly handle both missing resources and missing CRDs, aligning with the centralized error handling pattern.
Also applies to: 276-280
internal/controller/components/kserve/kserve_support_test.go (1)
1-131: LGTM: Comprehensive test coverage for owner reference handling.The tests effectively cover success scenarios, missing resources, and missing CRDs using appropriate fake client patterns and interceptors. The use of NoKindMatchError simulation (lines 100-112) is particularly well-designed.
internal/controller/components/kserve/kserve_controller_actions.go (1)
293-295: LGTM: Deletion logic correctly centralized via helper function.The refactoring replaces direct deletion calls with
deleteResourceIfOwnedByKserve, which provides consistent ownership checking, error handling, and logging. This reduces code duplication and ensures uniform deletion behavior across both code paths.Also applies to: 305-307
internal/controller/components/kserve/kserve_controller_actions_test.go (2)
12-13: LGTM: Import additions support dynamic test resource construction.The schema and client imports enable the test to dynamically construct and verify unstructured resources with proper GVK metadata.
485-624: LGTM: Excellent test coverage for ownership-based deletion protection.The test thoroughly validates the core fix across multiple scenarios:
- Different combinations of ManagementState (Managed/Unmanaged/Removed)
- Multiple code paths in cleanUpTemplatedResources
- Both with and without Authorino
The test design is particularly strong:
- Creates external resources without Kserve OwnerReferences (lines 541-549)
- Verifies resources persist after cleanup (lines 619-621)
- Covers the bug described in PR objectives: preventing accidental deletion of user-created resources
|
/retest |
2 similar comments
|
/retest |
|
/retest |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: asanzgom The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
/hold until approved for patch release. |
|
/retest |
1 similar comment
|
/retest |
|
@grdryn: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description
Jira: https://issues.redhat.com/browse/RHOAIENG-37741#
There are two separate bugs mentioned in the issue and fixed here:
See commit messages and unit tests for more details.
How Has This Been Tested?
Unit tests added. Separate commits show behaviour before and after the fix.
Screenshot or short clip
Merge criteria
E2E test suite update requirement
When bringing new changes to the operator code, such changes are by default required to be accompanied by extending and/or updating the E2E test suite accordingly.
To opt-out of this requirement:
E2E update requirement opt-out justificationsection belowE2E update requirement opt-out justification
Summary by CodeRabbit
Bug Fixes
Refactor
Tests