Skip to content

Conversation

@grdryn
Copy link
Member

@grdryn grdryn commented Nov 5, 2025

Description

Jira: https://issues.redhat.com/browse/RHOAIENG-37741#

There are two separate bugs mentioned in the issue and fixed here:

  • Shouldn't error if trying to ensure an unmanaged resource doesn't have an owner ref, if the kind (and therefore the resource) doesn't exist.
  • Should only delete resources that it has created / owns, and not e.g. an existing KnativeServing CR that someone else created which happens to have the same name.

See commit messages and unit tests for more details.

How Has This Been Tested?

Unit tests added. Separate commits show behaviour before and after the fix.

Screenshot or short clip

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work
  • The developer has run the integration test pipeline and verified that it passed successfully

E2E test suite update requirement

When bringing new changes to the operator code, such changes are by default required to be accompanied by extending and/or updating the E2E test suite accordingly.

To opt-out of this requirement:

  1. Please inspect the opt-out guidelines, to determine if the nature of the PR changes allows for skipping this requirement
  2. If opt-out is applicable, provide justification in the dedicated E2E update requirement opt-out justification section below
  3. Check the checkbox below:
  • Skip requirement to update E2E test suite for this PR
  1. Submit/save these changes to the PR description. This will automatically trigger the check.

E2E update requirement opt-out justification

Summary by CodeRabbit

  • Bug Fixes

    • Improved resource cleanup to prevent accidental deletion of external resources not managed by KServe.
  • Refactor

    • Consolidated resource deletion logic into a centralized helper for more consistent error handling.
  • Tests

    • Added comprehensive test coverage for resource cleanup and owner reference handling.

https://issues.redhat.com/browse/RHOAIENG-37741

Add a data-driven unit test that demonstrates the bug where
cleanUpTemplatedResources deletes KnativeServing resources that were
not created by the Kserve controller.

The test covers both scenarios from RHOAIENG-37741:

Scenario 1: ServiceMesh Removed, Serving Unmanaged
- User wants to use existing KnativeServing but not manage it
- Expected: External KnativeServing should be left alone
- Actual: It gets deleted (bug)

Scenario 2: ServiceMesh Removed, Serving Removed
- User wants RawDeployment mode without Serverless
- Expected: External KnativeServing should be left alone
- Actual: It gets deleted (bug)

The bug occurs in cleanUpTemplatedResources (line 294 in
kserve_controller_actions.go) where the code deletes resources based on
name and namespace alone, without checking if the actual cluster
resource has the ownership label (platform.opendatahub.io/part-of:
kserve).

The code iterates through rr.Resources (which come from templates and
have the platform.opendatahub.io/dependency: serverless label), and for
each matching resource, it deletes any cluster resource with the same
name/namespace regardless of whether that cluster resource was actually
created by the Kserve controller.

This causes the Kserve controller to delete externally-managed
KnativeServing resources, which disrupts users who have Serverless
installed for other purposes.

The test creates a KnativeServing CR without the kserve ownership label,
simulating an external resource, then verifies it gets deleted during
cleanup (demonstrating the bug). This test will need to be updated once
the bug is fixed to expect the resource NOT to be deleted.

Assisted-By: Claude <[email protected]>
https://issues.redhat.com/browse/RHOAIENG-37741

Add a test demonstrating the bug in the second deletion loop (lines
311-326) of cleanUpTemplatedResources when authorino is NOT installed.

This test complements the existing data-driven test by covering a
different code path. While the first test covers deletion when
ServiceMesh.ManagementState == Removed (line 288-306), this test covers
deletion when authorino is not installed (line 311-326).

The bug is identical in both code paths: resources are deleted based on
name/namespace alone without checking for Kserve OwnerReferences.

Test setup:
- Creates external EnvoyFilter "activator-host-header" in istio-system
- WITHOUT Kserve OwnerReference (simulates user-created resource)
- Does NOT create authorino-operator Subscription (triggers line 311)
- ServiceMesh: Managed (avoids first deletion loop)

Expected result:
- EnvoyFilter gets deleted (demonstrates bug)
- Should only delete if resource has Kserve OwnerReference

Both deletion loops need the same fix: fetch cluster resource first and
check for Kserve OwnerReference before deleting.

Assisted-By: Claude <[email protected]>
https://issues.redhat.com/browse/RHOAIENG-37741

This commit fixes a bug in cleanUpTemplatedResources where the Kserve
controller would delete cluster resources based on name/namespace alone,
without checking if those resources were actually owned by the Kserve
controller.

The bug manifested in two deletion loops:

1. Lines 288-333: When ServiceMesh.ManagementState is set to Removed,
   the controller deletes resources with the serverless or servicemesh
   dependency labels.

2. Lines 338-384: When authorino is not installed, the controller
   deletes resources with the servicemesh dependency label.

In both cases, the code would iterate through rr.Resources (which
contains template resources with dependency labels) and delete any
cluster resource matching the name/namespace, regardless of whether
that cluster resource was created by the Kserve controller.

This caused problems in two scenarios from RHOAIENG-37741:
- Scenario 1: User sets ServiceMesh to Removed and Serving to Unmanaged,
  wanting to use an existing KnativeServing CR they manage themselves.
  The Kserve controller would delete their KnativeServing.
- Scenario 2: User sets ServiceMesh to Removed and Serving to Removed,
  wanting RawDeployment mode. If they had servicemesh resources for
  other purposes, the Kserve controller would delete them.

The fix:
1. Fetch the cluster resource to get its current state
2. Check if it has a Kserve OwnerReference using isKserveOwnerRef()
3. Only delete if the OwnerReference exists
4. Skip resources not owned by the Kserve controller

This respects OwnerReferences as the authoritative ownership mechanism
and prevents the controller from deleting resources it doesn't own.

Added two tests to validate the fix:
- TestCleanUpTemplatedResources_DeletesResourcesWithoutKserveLabel:
  Data-driven test covering both Jira scenarios with KnativeServing
- TestCleanUpTemplatedResources_DeletesResourcesWithoutKserveLabel_NoAuthorino:
  Tests the second deletion loop with EnvoyFilter resources

Assisted-By: Claude <[email protected]>
This adds unit tests demonstrating the current behavior of
getAndRemoveOwnerReferences, including a test case that shows
the function fails when a CRD is not installed on the cluster.

Related to: https://issues.redhat.com/browse/RHOAIENG-37741

Assisted-By: Claude <[email protected]>
When removing owner references from unmanaged resources, the function
now gracefully handles the case where the resource's CRD is not
installed on the cluster by ignoring meta.NoKindMatchError errors.

This allows KServe to be enabled on clusters where Service Mesh 2.x
is configured as Removed/Unmanaged and the OSSM CRDs are not installed,
without causing reconciliation failures.

Fixes: https://issues.redhat.com/browse/RHOAIENG-37741

Assisted-By: Claude <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 5, 2025

Walkthrough

Refactored resource deletion logic in the Kserve controller by extracting direct deletion calls into a centralized helper function deleteResourceIfOwnedByKserve that verifies Kserve ownership before deletion and gracefully handles missing resources and CRDs.

Changes

Cohort / File(s) Summary
Controller Actions Refactoring
internal/controller/components/kserve/kserve_controller_actions.go
Replaced inline deletion logic with calls to centralized deleteResourceIfOwnedByKserve helper. Removed direct error handling for NotFound and NoKindMatchError, and eliminated per-deletion log statements. Dropped import of k8s.io/apimachinery/pkg/api/meta.
Support Package - Centralized Helper
internal/controller/components/kserve/kserve_support.go
Introduced new helper function deleteResourceIfOwnedByKserve that fetches resources by GVK, verifies Kserve ownership via OwnerReferences, and performs deletion with foreground propagation. Added error aliases and imports for logr, meta, and error packages. Updated existing functions to use k8serr.IsNotFound() and meta.NoKindMatchError.
Test Suite Expansion
internal/controller/components/kserve/kserve_controller_actions_test.go
Added TestCleanUpTemplatedResources_DoesNotDeleteExternalResources to verify external resources lacking Kserve OwnerReferences are preserved during cleanup. Expanded imports for runtime/schema and controller-runtime client support.
Support Function Tests
internal/controller/components/kserve/kserve_support_test.go
New test file with three tests for getAndRemoveOwnerReferences: success path validates owner reference and label removal; error path verifies graceful handling of non-existent resources and missing CRDs using fake client interceptors.

Sequence Diagram

sequenceDiagram
    participant ctrl as cleanUpTemplatedResources
    participant helper as deleteResourceIfOwnedByKserve
    participant client as Kubernetes Client
    participant meta as meta.NoKindMatchError

    ctrl->>helper: Call with resource, GVK
    helper->>client: Get resource by GVK
    alt Resource Not Found
        client-->>helper: NotFound error
        helper-->>ctrl: Return (silent)
    else CRD Not Installed
        client-->>meta: NoKindMatchError
        meta-->>helper: Return error
        helper-->>ctrl: Return (silent)
    else Resource Found
        client-->>helper: Unstructured resource
        helper->>helper: Check OwnerReferences for Kserve
        alt Kserve Owner Present
            helper->>client: Delete (foreground propagation)
            client-->>helper: Success
            helper->>helper: Log deletion
            helper-->>ctrl: Return nil
        else No Kserve Owner
            helper-->>ctrl: Return nil (not owned)
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Review focus areas:
    • Ownership verification logic in deleteResourceIfOwnedByKserve for correctness and security implications
    • Error handling paths: verify NoKindMatchError and NotFound are silently handled appropriately
    • Foreground propagation policy choice and cascading deletion behavior
    • Test coverage completeness for external resource retention scenarios and error edge cases
    • Consistency of error wrapping with StopError across refactored code paths

Poem

A helper emerges from the tangled code,
Ownership checked on the deletion road,
No more scattered deletion calls,
Centralized cleanup within these walls,
The Kserve garden, now tended with care! 🐰✨

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: preventing deletion of resources not owned by the operator, which is the core objective across the modified files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

https://issues.redhat.com/browse/RHOAIENG-37741

This commit addresses all golangci-lint issues in the kserve controller:

- Extract deleteResourceIfOwnedByKserve helper function to eliminate
  duplicate code blocks in cleanUpTemplatedResources (fixes dupl)
- Fix import alias from k8serrors to k8serr for consistency (fixes importas)
- Reduce cyclomatic complexity by extracting helper function (fixes gocyclo)
- Consolidate duplicate test functions into data-driven test (fixes dupl in tests)
- Run make fmt to fix import formatting (fixes gci)

The helper function in kserve_support.go encapsulates the common pattern
of fetching a cluster resource, checking for Kserve OwnerReferences, and
safely deleting only resources owned by the Kserve controller.

Test refactoring merged TestCleanUpTemplatedResources_DeletesResourcesWithoutKserveLabel
and TestCleanUpTemplatedResources_DeletesResourcesWithoutKserveLabel_NoAuthorino
into a single data-driven test covering both code paths.

Assisted-By: Claude <[email protected]>
@grdryn
Copy link
Member Author

grdryn commented Nov 5, 2025

@coderabbitai help

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 5, 2025

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
    • @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
    • @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve to resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit configuration file (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, support, documentation and community

  • Visit our status page to check the current availability of CodeRabbit.
  • Create a ticket on our support page for assistance with any issues or questions.
  • Visit our documentation site for detailed information on how to use CodeRabbit.
  • Join our Discord community to connect with other users and get help from the community.
  • Follow us on X/Twitter for updates and announcements.

@grdryn
Copy link
Member Author

grdryn commented Nov 5, 2025

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 5, 2025

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
internal/controller/components/kserve/kserve_support.go (1)

313-355: Consider adding debug logging when skipping non-owned resources.

The ownership check correctly prevents deletion of user-created resources. However, adding debug-level logging when skipping deletion (line 340) would improve observability and help diagnose unexpected behavior during troubleshooting.

Apply this diff to add debug logging:

 	if !hasKserveOwner {
+		logger.V(1).Info("Skipping deletion of non-Kserve-owned resource", 
+			"kind", clusterRes.GetKind(), 
+			"name", clusterRes.GetName(), 
+			"namespace", clusterRes.GetNamespace())
 		return nil // Skip resources not owned by Kserve controller
 	}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bdd0f76 and 630122e.

📒 Files selected for processing (4)
  • internal/controller/components/kserve/kserve_controller_actions.go (2 hunks)
  • internal/controller/components/kserve/kserve_controller_actions_test.go (2 hunks)
  • internal/controller/components/kserve/kserve_support.go (5 hunks)
  • internal/controller/components/kserve/kserve_support_test.go (1 hunks)
🔇 Additional comments (6)
internal/controller/components/kserve/kserve_support.go (2)

8-8: LGTM: Import additions support the new ownership-aware deletion logic.

The new imports enable centralized error handling (k8serr alias), CRD-missing detection (meta), structured logging (logr), and error wrapping (odherrors).

Also applies to: 13-13, 16-17, 30-30


140-140: LGTM: Consistent error handling with k8serr alias and NoKindMatchError checks.

The updated error checks properly handle both missing resources and missing CRDs, aligning with the centralized error handling pattern.

Also applies to: 276-280

internal/controller/components/kserve/kserve_support_test.go (1)

1-131: LGTM: Comprehensive test coverage for owner reference handling.

The tests effectively cover success scenarios, missing resources, and missing CRDs using appropriate fake client patterns and interceptors. The use of NoKindMatchError simulation (lines 100-112) is particularly well-designed.

internal/controller/components/kserve/kserve_controller_actions.go (1)

293-295: LGTM: Deletion logic correctly centralized via helper function.

The refactoring replaces direct deletion calls with deleteResourceIfOwnedByKserve, which provides consistent ownership checking, error handling, and logging. This reduces code duplication and ensures uniform deletion behavior across both code paths.

Also applies to: 305-307

internal/controller/components/kserve/kserve_controller_actions_test.go (2)

12-13: LGTM: Import additions support dynamic test resource construction.

The schema and client imports enable the test to dynamically construct and verify unstructured resources with proper GVK metadata.


485-624: LGTM: Excellent test coverage for ownership-based deletion protection.

The test thoroughly validates the core fix across multiple scenarios:

  • Different combinations of ManagementState (Managed/Unmanaged/Removed)
  • Multiple code paths in cleanUpTemplatedResources
  • Both with and without Authorino

The test design is particularly strong:

  • Creates external resources without Kserve OwnerReferences (lines 541-549)
  • Verifies resources persist after cleanup (lines 619-621)
  • Covers the bug described in PR objectives: preventing accidental deletion of user-created resources

@grdryn
Copy link
Member Author

grdryn commented Nov 5, 2025

/retest

2 similar comments
@grdryn
Copy link
Member Author

grdryn commented Nov 6, 2025

/retest

@grdryn
Copy link
Member Author

grdryn commented Nov 6, 2025

/retest

@grdryn grdryn changed the title WIP: RHOAIENG-37741: Don't delete non-owned resources RHOAIENG-37741: Don't delete non-owned resources Nov 6, 2025
@openshift-ci
Copy link

openshift-ci bot commented Nov 6, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: asanzgom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Nov 6, 2025
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD bdd0f76 and 2 for PR HEAD 630122e in total

@grdryn
Copy link
Member Author

grdryn commented Nov 6, 2025

/retest

@grdryn
Copy link
Member Author

grdryn commented Nov 6, 2025

/hold until approved for patch release.

@grdryn
Copy link
Member Author

grdryn commented Nov 7, 2025

/retest

1 similar comment
@grdryn
Copy link
Member Author

grdryn commented Nov 7, 2025

/retest

@openshift-ci
Copy link

openshift-ci bot commented Nov 7, 2025

@grdryn: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/opendatahub-operator-e2e 630122e link true /test opendatahub-operator-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

3 participants