Skip to content

Conversation

@zdtsw
Copy link
Member

@zdtsw zdtsw commented Oct 16, 2025

  • remove functions and calls used in the v2 upgrade, but not needed any more for v3
  1. jupyterhub resource cleanup
  2. rbac for modelreg
  3. envoyfilter file for serving
  4. waterson resource
  5. patch odhdashboardconfig for trusty enablement
  • move some functions into resources pkg can be reuse later
  • create support functions for hwprofile migration

Description

How Has This Been Tested?

Screenshot or short clip

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work
  • The developer has run the integration test pipeline and verified that it passed successfully

E2E test suite update requirement

When bringing new changes to the operator code, such changes are by default required to be accompanied by extending and/or updating the E2E test suite accordingly.

To opt-out of this requirement:

  1. Please inspect the opt-out guidelines, to determine if the nature of the PR changes allows for skipping this requirement
  2. If opt-out is applicable, provide justification in the dedicated E2E update requirement opt-out justification section below
  3. Check the checkbox below:
  • Skip requirement to update E2E test suite for this PR
  1. Submit/save these changes to the PR description. This will automatically trigger the check.

E2E update requirement opt-out justification

cleanup code, no e2e

Summary by CodeRabbit

  • Bug Fixes

    • Improved reliability of deployed-release detection during initialization.
  • New Features

    • Idempotent hardware profile creation to avoid duplicate errors.
    • Batch deletion utilities for targeted cluster resources and a utility to clear owner references.
  • Refactor

    • Streamlined hardware-profile creation and annotation flows with centralized naming and visibility settings.
    • Removed legacy cleanup paths and consolidated upgrade-related configuration.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 16, 2025

Walkthrough

Moved deployed-release lookup into pkg/cluster; added idempotent HardwareProfile creation in pkg/cluster; added resource-deletion and owner-reference utilities in pkg/resources; removed legacy cleanup/resource-spec logic from pkg/upgrade and introduced hardware-profile and feature constants; cmd/main now calls cluster.GetDeployedRelease.

Changes

Cohort / File(s) Change Summary
CLI / Startup
cmd/main.go
Replaced call to upgrade.GetDeployedRelease with cluster.GetDeployedRelease.
Cluster utilities
pkg/cluster/cluster_config.go, pkg/cluster/resources.go
Added GetDeployedRelease(ctx, cli) to locate deployed release by checking DSCInitialization then DataScienceCluster. Added CreateHardwareProfile(ctx, cli, hwp) which creates a HardwareProfile idempotently (no-op on AlreadyExists).
Resource utilities
pkg/resources/resources.go
Added public ResourceSpec type and functions: DeleteResources, DeleteOneResource, and UnsetOwnerReferences; uses list/filter/delete semantics and multierror aggregation with controller-runtime logging.
Upgrade package refactor
pkg/upgrade/upgrade.go
Removed large deprecated cleanup and ResourceSpec machinery. Introduced constants (odhDashboardConfigName, customServing, hardwareProfileAnnotation, featureVisibility). Renamed CreateCustomServingHardwareProfilecreateCustomServingHardwareProfile and switched HardwareProfile creation to cluster.CreateHardwareProfile.
Upgrade utilities
pkg/upgrade/upgrade_utils.go
Replaced hard-coded names with constants; added createHardwareProfileAnnotations(...); updated hardware-profile generation to use annotations helper and cluster.CreateHardwareProfile; updated logging and feature-visibility constants.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant CLI as cmd/main
  participant ClusterPkg as pkg/cluster
  participant Upgrade as pkg/upgrade
  participant K8s as Kubernetes API

  CLI->>ClusterPkg: GetDeployedRelease(ctx)
  alt DSCI exists
    ClusterPkg->>K8s: GET DSCInitialization
    K8s-->>ClusterPkg: DSCI with Status.Release
    ClusterPkg-->>CLI: Release
  else DSCI missing, DSC exists
    ClusterPkg->>K8s: GET DataScienceCluster
    K8s-->>ClusterPkg: DSC with Status.Release
    ClusterPkg-->>CLI: Release
  else none exist
    ClusterPkg-->>CLI: empty Release / not-found
  end

  Upgrade->>ClusterPkg: CreateHardwareProfile(ctx,hwp)
  ClusterPkg->>K8s: CREATE HardwareProfile
  alt already exists
    K8s-->>ClusterPkg: AlreadyExists
    ClusterPkg-->>Upgrade: nil (idempotent)
  else created
    K8s-->>ClusterPkg: Created
    ClusterPkg-->>Upgrade: created
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Inspect GetDeployedRelease: ensure empty Release behavior is handled by callers.
  • Verify CreateHardwareProfile error wrapping and idempotency semantics.
  • Review new ResourceSpec filtering logic, multierror aggregation, and logging.
  • Check upgrade constants and renamed function call sites for consistent usage.

Poem

🐰 I hopped through code with care and cheer,
Found releases moved and profiles near.
I planted annotations, tidy and bright,
Cleared old crumbs to make things right.
A small rabbit's refactor — quick and light.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 67.86% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: cleanup of deprecated upgrade logic and movement of functions to other packages.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@zdtsw zdtsw requested review from davidebianchi, lburgazzoli and ugiordan and removed request for resoluteCoder and valdar October 16, 2025 09:16
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
pkg/cluster/resources.go (1)

160-167: Harden CreateHardwareProfile: add nil guard and retry/backoff (webhook/startup ready).

Prevents nil deref on hwp and aligns creation with existing CreateWithRetry behavior.

-func CreateHardwareProfile(ctx context.Context, cli client.Client, hwp *infrav1.HardwareProfile) error {
-	if err := cli.Create(ctx, hwp); err != nil {
-		if !k8serr.IsAlreadyExists(err) {
-			return fmt.Errorf("failed to create HardwareProfile '%s/%s': %w", hwp.Namespace, hwp.Name, err)
-		}
-	}
-	return nil
-}
+func CreateHardwareProfile(ctx context.Context, cli client.Client, hwp *infrav1.HardwareProfile) error {
+	if hwp == nil {
+		return errors.New("nil HardwareProfile")
+	}
+	if err := CreateWithRetry(ctx, cli, hwp); err != nil {
+		return fmt.Errorf("failed to create HardwareProfile '%s/%s': %w", hwp.Namespace, hwp.Name, err)
+	}
+	return nil
+}
cmd/main.go (1)

261-263: Don’t drop GetDeployedRelease error; log and continue with zero value.

Improves debuggability without changing behavior.

-	// get old release version before we create default DSCI CR
-	oldReleaseVersion, _ := cluster.GetDeployedRelease(ctx, setupClient)
+	// get old release version before we create default DSCI CR
+	oldReleaseVersion, err := cluster.GetDeployedRelease(ctx, setupClient)
+	if err != nil {
+		setupLog.Info("unable to get deployed release; defaulting to empty", "err", err)
+	}
pkg/upgrade/upgrade.go (2)

766-775: Preserve standardized HWP annotations; avoid AP overwriting them.

Currently AP annotations override “feature-visibility”, “modified-date”, etc. Prefer copying only non-conflicting keys.

-	// Copy existing annotations from AP
-	if apAnnotations := ap.GetAnnotations(); apAnnotations != nil {
-		for k, v := range apAnnotations {
-			annotations[k] = v
-		}
-	}
+	// Copy existing annotations from AP without overriding standardized keys
+	if apAnnotations := ap.GetAnnotations(); apAnnotations != nil {
+		for k, v := range apAnnotations {
+			if _, exists := annotations[k]; !exists {
+				annotations[k] = v
+			}
+		}
+	}

941-961: Use UTC for modified-date annotation.

Ensures consistent, timezone-agnostic timestamps.

-		"opendatahub.io/modified-date":                time.Now().Format(time.RFC3339),
+		"opendatahub.io/modified-date":                time.Now().UTC().Format(time.RFC3339),
pkg/resources/resources.go (1)

681-717: Make DeleteOneResource resilient: skip missing paths, ignore NotFound, aggregate errors.

Avoid aborting the whole batch when a single item lacks the field or is concurrently deleted.

-func DeleteOneResource(ctx context.Context, c client.Client, res ResourceSpec) error {
-	log := logf.FromContext(ctx)
-	list := &unstructured.UnstructuredList{}
-	list.SetGroupVersionKind(res.Gvk)
-
-	err := c.List(ctx, list, client.InNamespace(res.Namespace))
-	if err != nil {
-		if meta.IsNoMatchError(err) {
-			log.Info("CRD not found, will not delete", "gvk", res.Gvk.String())
-			return nil
-		}
-		return fmt.Errorf("failed to list %s: %w", res.Gvk.Kind, err)
-	}
-
-	for _, item := range list.Items {
-		v, ok, err := unstructured.NestedString(item.Object, res.Path...)
-		if err != nil {
-			return fmt.Errorf("failed to get field %v for %s %s/%s: %w", res.Path, res.Gvk.Kind, res.Namespace, item.GetName(), err)
-		}
-
-		if !ok {
-			return fmt.Errorf("unexisting path to delete: %v", res.Path)
-		}
-
-		for _, toDelete := range res.Values {
-			if v == toDelete {
-				err = c.Delete(ctx, &item)
-				if err != nil {
-					return fmt.Errorf("failed to delete %s %s/%s: %w", res.Gvk.Kind, res.Namespace, item.GetName(), err)
-				}
-				log.Info("Deleted object", "name", item.GetName(), "gvk", res.Gvk.String(), "namespace", res.Namespace)
-			}
-		}
-	}
-
-	return nil
-}
+func DeleteOneResource(ctx context.Context, c client.Client, res ResourceSpec) error {
+	log := logf.FromContext(ctx)
+	list := &unstructured.UnstructuredList{}
+	list.SetGroupVersionKind(res.Gvk)
+
+	if err := c.List(ctx, list, client.InNamespace(res.Namespace)); err != nil {
+		if meta.IsNoMatchError(err) {
+			log.Info("CRD not found, will not delete", "gvk", res.Gvk.String())
+			return nil
+		}
+		return fmt.Errorf("failed to list %s: %w", res.Gvk.Kind, err)
+	}
+
+	var merr *multierror.Error
+	for i := range list.Items {
+		item := list.Items[i]
+		v, ok, err := unstructured.NestedString(item.Object, res.Path...)
+		if err != nil {
+			merr = multierror.Append(merr, fmt.Errorf("failed to get field %v for %s %s/%s: %w", res.Path, res.Gvk.Kind, res.Namespace, item.GetName(), err))
+			continue
+		}
+		if !ok {
+			// field absent on this item; nothing to match
+			continue
+		}
+		for _, toDelete := range res.Values {
+			if v == toDelete {
+				if err := c.Delete(ctx, &item); err != nil {
+					if k8serr.IsNotFound(err) {
+						break
+					}
+					merr = multierror.Append(merr, fmt.Errorf("failed to delete %s %s/%s: %w", res.Gvk.Kind, res.Namespace, item.GetName(), err))
+					break
+				}
+				log.Info("Deleted object", "name", item.GetName(), "gvk", res.Gvk.String(), "namespace", res.Namespace)
+				break
+			}
+		}
+	}
+	return merr.ErrorOrNil()
+}

Based on learnings

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b427ed and bdc97a2.

📒 Files selected for processing (5)
  • cmd/main.go (1 hunks)
  • pkg/cluster/cluster_config.go (1 hunks)
  • pkg/cluster/resources.go (1 hunks)
  • pkg/resources/resources.go (3 hunks)
  • pkg/upgrade/upgrade.go (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
pkg/cluster/resources.go (2)
tests/e2e/helper_test.go (1)
  • CreateHardwareProfile (339-384)
pkg/cluster/gvk/gvk.go (2)
  • HardwareProfile (88-92)
  • Namespace (28-32)
cmd/main.go (1)
pkg/cluster/cluster_config.go (1)
  • GetDeployedRelease (107-131)
pkg/upgrade/upgrade.go (1)
pkg/cluster/resources.go (1)
  • CreateHardwareProfile (160-167)
pkg/cluster/cluster_config.go (2)
api/common/types.go (1)
  • Release (159-162)
pkg/cluster/resources.go (2)
  • GetDSCI (115-135)
  • GetDSC (92-112)
pkg/resources/resources.go (2)
pkg/cluster/gvk/gvk.go (1)
  • Namespace (28-32)
pkg/resources/resources_types.go (1)
  • UnstructuredList (11-11)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build/push catalog image
  • GitHub Check: golangci-lint
  • GitHub Check: Run tests and collect coverage on internal and pkg
🔇 Additional comments (5)
pkg/cluster/cluster_config.go (1)

93-131: GetDeployedRelease flow looks correct and robust.

NotFound handling and fallback to DSC are appropriate; empty Release on clean installs is clear.

pkg/upgrade/upgrade.go (1)

414-416: Good reuse of cluster.CreateHardwareProfile.

Centralizing creation improves consistency; pairs well if CreateHardwareProfile uses retry/backoff.

Ensure cluster.CreateHardwareProfile includes retry/backoff (see suggested change in pkg/cluster/resources.go).

Also applies to: 477-480

pkg/resources/resources.go (3)

33-42: ResourceSpec API looks good.

Clear contract for GVK, namespace, path, values.


659-669: Batch deletion with multierror is appropriate.

Append + ErrorOrNil used correctly. Based on learnings.


731-739: UnsetOwnerReferences behavior is fine.

Simple orphaning via Update; no changes needed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/resources/resources.go (1)

706-714: Add break after successful deletion.

After deleting an item when a FilterValue matches, the loop continues checking remaining FilterValues. While functionally harmless (the deleted item won't match again), this is inefficient. Add a break statement after the delete succeeds to exit the inner loop early.

Apply this diff:

 		for _, targetValue := range res.FilterValues {
 			if v == targetValue {
 				err = c.Delete(ctx, &item)
 				if err != nil {
 					return fmt.Errorf("failed to delete %s %s/%s: %w", res.Gvk.Kind, res.Namespace, item.GetName(), err)
 				}
 				log.Info("Deleted object", "name", item.GetName(), "gvk", res.Gvk.String(), "namespace", res.Namespace)
+				break
 			}
 		}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bdc97a2 and 7abab30.

📒 Files selected for processing (5)
  • cmd/main.go (1 hunks)
  • pkg/cluster/cluster_config.go (1 hunks)
  • pkg/cluster/resources.go (1 hunks)
  • pkg/resources/resources.go (3 hunks)
  • pkg/upgrade/upgrade.go (8 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • cmd/main.go
  • pkg/cluster/cluster_config.go
🧰 Additional context used
🧬 Code graph analysis (2)
pkg/cluster/resources.go (2)
tests/e2e/helper_test.go (1)
  • CreateHardwareProfile (339-384)
pkg/cluster/gvk/gvk.go (1)
  • HardwareProfile (88-92)
pkg/upgrade/upgrade.go (1)
pkg/cluster/resources.go (1)
  • CreateHardwareProfile (160-167)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build/push catalog image
  • GitHub Check: Run tests and collect coverage on internal and pkg
  • GitHub Check: golangci-lint
🔇 Additional comments (7)
pkg/cluster/resources.go (1)

150-167: LGTM: Idempotent HardwareProfile creation.

The function correctly implements idempotent resource creation by treating "AlreadyExists" errors as success. Error messages provide clear context including namespace and name.

pkg/upgrade/upgrade.go (3)

51-64: LGTM: Well-structured ContainerSize type.

The nested struct definition clearly maps to the OdhDashboardConfig structure, making it straightforward to parse container size specifications.


414-415: LGTM: Centralized HardwareProfile creation.

The refactoring to use cluster.CreateHardwareProfile centralizes creation logic and provides consistent error handling. The error messages include helpful context (profile type, namespace, container size name).

Also applies to: 477-479


942-961: LGTM: Centralized annotation generation.

The helper function ensures consistent annotation structure across all HardwareProfile creation paths, reducing duplication and potential inconsistencies.

pkg/resources/resources.go (3)

33-43: LGTM: Well-documented ResourceSpec type.

The struct is clearly defined with helpful comments explaining the purpose of each field, particularly the FieldPath and FilterValues fields.


650-669: LGTM: Proper error aggregation.

The function correctly uses multierror to collect and return all errors encountered during batch deletion, following the pattern established in the learnings.


720-740: LGTM: Clear orphaning utility.

The function correctly removes owner references and updates the resource, with appropriate error context.

@codecov
Copy link

codecov bot commented Oct 16, 2025

Codecov Report

❌ Patch coverage is 31.18280% with 64 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.74%. Comparing base (0ea2f33) to head (17f6efb).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
pkg/resources/resources.go 0.00% 35 Missing ⚠️
pkg/cluster/cluster_config.go 0.00% 18 Missing ⚠️
pkg/cluster/resources.go 0.00% 5 Missing ⚠️
pkg/upgrade/upgrade.go 78.57% 2 Missing and 1 partial ⚠️
pkg/upgrade/upgrade_utils.go 85.71% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2695      +/-   ##
==========================================
+ Coverage   49.65%   49.74%   +0.09%     
==========================================
  Files         144      144              
  Lines       10562    10388     -174     
==========================================
- Hits         5245     5168      -77     
+ Misses       4758     4666      -92     
+ Partials      559      554       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/upgrade/upgrade_utils.go (1)

110-136: Do not treat missing manifest as an error.

When the OdhDashboardConfig manifest isn’t present, return (nil, false, nil) so callers can gracefully skip. Currently, the os.Stat error propagates and aborts migration.

 func loadOdhDashboardConfigFromManifests(ctx context.Context) (*unstructured.Unstructured, bool, error) {
@@
-	_, err := os.Stat(manifestPath)
-	if err == nil {
+	if _, err := os.Stat(manifestPath); err == nil {
 		// ... read & parse manifest ...
 		// Verify it's an OdhDashboardConfig
 		if obj.GetKind() == "OdhDashboardConfig" {
 			log.Info("Successfully loaded OdhDashboardConfig from manifest", "path", manifestPath)
 			return &obj, true, nil
 		}
 	}
-	return nil, false, err
+	// Manifest not found (or not a matching kind) is not an error: signal “not found”.
+	return nil, false, nil
 }
♻️ Duplicate comments (2)
pkg/resources/resources.go (2)

683-718: Make deletion robust: skip missing field, ignore 404s, conditional namespace; fix typo.

  • Don’t fail the whole batch when FieldPath is absent on an item; skip it.
  • Treat NotFound on delete as benign (idempotent cleanup).
  • Apply InNamespace only when non-empty to avoid issues with cluster-scoped kinds.
  • Fix the “unexisting” typo in the error message.

Apply this diff:

@@
-	err := c.List(ctx, list, client.InNamespace(res.Namespace))
+	var opts []client.ListOption
+	if res.Namespace != "" {
+		opts = append(opts, client.InNamespace(res.Namespace))
+	}
+	err := c.List(ctx, list, opts...)
@@
-		if !ok {
-			return fmt.Errorf("unexisting path to delete: %v", res.FieldPath)
-		}
+		if !ok {
+			log.V(1).Info("Field path not found; skipping item",
+				"fieldPath", res.FieldPath, "name", item.GetName(),
+				"gvk", res.Gvk.String(), "namespace", res.Namespace)
+			continue
+		}
@@
-				err = c.Delete(ctx, &item)
-				if err != nil {
-					return fmt.Errorf("failed to delete %s %s/%s: %w", res.Gvk.Kind, res.Namespace, item.GetName(), err)
-				}
+				if err := c.Delete(ctx, &item); err != nil {
+					if k8serr.IsNotFound(err) {
+						log.V(1).Info("Object already deleted", "name", item.GetName(), "gvk", res.Gvk.String(), "namespace", res.Namespace)
+						continue
+					}
+					return fmt.Errorf("failed to delete %s %s/%s: %w", res.Gvk.Kind, res.Namespace, item.GetName(), err)
+				}
 				log.Info("Deleted object", "name", item.GetName(), "gvk", res.Gvk.String(), "namespace", res.Namespace)

702-704: Fix wording: “nonexistent field path”.

Update the error text for correctness.

-			return fmt.Errorf("unexisting path to delete: %v", res.FieldPath)
+			return fmt.Errorf("nonexistent field path: %v", res.FieldPath)
🧹 Nitpick comments (3)
pkg/resources/resources.go (1)

35-43: Nit: prefer conventional initialism “GVK”.

Rename field Gvk → GVK for consistency with Go initialism conventions. Low priority.

pkg/upgrade/upgrade_utils.go (2)

159-185: Minor: simplify max tracking; avoid self-shadowing.

Use existing variables directly to improve readability.

-	if LimitMemory := LimitMem; !LimitMemory.IsZero() && LimitMemory.Cmp(maxMemory) > 0 {
-		maxMemory = LimitMemory
-	}
-	if LimitCpu := LimitCpu; !LimitCpu.IsZero() && LimitCpu.Cmp(maxCpu) > 0 {
-		maxCpu = LimitCpu
-	}
+	if !LimitMem.IsZero() && LimitMem.Cmp(maxMemory) > 0 {
+		maxMemory = LimitMem
+	}
+	if !LimitCpu.IsZero() && LimitCpu.Cmp(maxCpu) > 0 {
+		maxCpu = LimitCpu
+	}

215-239: Optional: reorder returns for clarity (or return a struct).

Returning (ReqMem, ReqCpu, LimitMem, LimitCpu) is non-intuitive. Consider returning in CPU-first order and updating the call site, or define a small struct.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7abab30 and 49f4da7.

📒 Files selected for processing (6)
  • cmd/main.go (1 hunks)
  • pkg/cluster/cluster_config.go (1 hunks)
  • pkg/cluster/resources.go (1 hunks)
  • pkg/resources/resources.go (3 hunks)
  • pkg/upgrade/upgrade.go (7 hunks)
  • pkg/upgrade/upgrade_utils.go (10 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/cluster/resources.go
  • cmd/main.go
🧰 Additional context used
🧬 Code graph analysis (4)
pkg/cluster/cluster_config.go (2)
api/common/types.go (1)
  • Release (159-162)
pkg/cluster/resources.go (2)
  • GetDSCI (115-135)
  • GetDSC (92-112)
pkg/upgrade/upgrade_utils.go (1)
pkg/cluster/resources.go (1)
  • CreateHardwareProfile (160-167)
pkg/upgrade/upgrade.go (1)
pkg/cluster/resources.go (2)
  • GetHardwareProfile (138-148)
  • CreateHardwareProfile (160-167)
pkg/resources/resources.go (2)
pkg/cluster/gvk/gvk.go (1)
  • Namespace (28-32)
pkg/resources/resources_types.go (1)
  • UnstructuredList (11-11)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build/push catalog image
  • GitHub Check: Run tests and collect coverage on internal and pkg
  • GitHub Check: golangci-lint
🔇 Additional comments (3)
pkg/cluster/cluster_config.go (1)

93-131: LGTM: clean, correct fallback order (DSCI → DSC) and safe not-found handling.

pkg/upgrade/upgrade.go (2)

508-556: LGTM: idempotent creation of custom-serving HardwareProfile with standardized annotations.


327-349: LGTM: fetch OdhDashboardConfig once; ensure custom-serving HWP exists before ISVC migration.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/upgrade/upgrade_utils.go (1)

110-136: Treat missing manifest as “not found,” not an error.

os.Stat failure on absent file currently bubbles up and aborts migration. Return (nil, false, nil) for not-exist; only propagate other errors.

 	manifestPath := deploy.DefaultManifestPath + odhDashboardConfigPath
-	_, err := os.Stat(manifestPath)
-	if err == nil {
+	_, err := os.Stat(manifestPath)
+	if err == nil {
 		// ... load and return ...
-	}
-	return nil, false, err
+	}
+	if os.IsNotExist(err) {
+		return nil, false, nil
+	}
+	return nil, false, err
♻️ Duplicate comments (1)
pkg/resources/resources.go (1)

683-715: Make deletion robust: don’t fail on missing field, log with actual namespace, and break after delete.

Current behavior aborts the entire sweep if a single item lacks the field, and may attempt multiple deletes if FilterValues has duplicates.

-	for _, item := range list.Items {
-		v, ok, err := unstructured.NestedString(item.Object, res.FieldPath...)
+	for _, item := range list.Items {
+		v, ok, err := unstructured.NestedString(item.Object, res.FieldPath...)
 		if err != nil {
-			return fmt.Errorf("failed to get field %v for %s %s/%s: %w", res.FieldPath, res.Gvk.Kind, res.Namespace, item.GetName(), err)
+			return fmt.Errorf("failed to get field %v for %s %s/%s: %w",
+				res.FieldPath, res.Gvk.Kind, item.GetNamespace(), item.GetName(), err)
 		}
 
-		if !ok {
-			return fmt.Errorf("unexisting path to delete: %v", res.FieldPath)
-		}
+		if !ok {
+			log.Info("nonexistent field path; skipping item",
+				"fieldPath", res.FieldPath, "gvk", res.Gvk.String(),
+				"namespace", item.GetNamespace(), "name", item.GetName())
+			continue
+		}
 
 		for _, targetValue := range res.FilterValues {
 			if v == targetValue {
-				err = c.Delete(ctx, &item)
-				if err != nil {
-					return fmt.Errorf("failed to delete %s %s/%s: %w", res.Gvk.Kind, res.Namespace, item.GetName(), err)
-				}
-				log.Info("Deleted object", "name", item.GetName(), "gvk", res.Gvk.String(), "namespace", res.Namespace)
+				if err := c.Delete(ctx, &item); err != nil {
+					return fmt.Errorf("failed to delete %s %s/%s: %w",
+						res.Gvk.Kind, item.GetNamespace(), item.GetName(), err)
+				}
+				log.Info("Deleted object",
+					"gvk", res.Gvk.String(),
+					"namespace", item.GetNamespace(),
+					"name", item.GetName())
+				break
 			}
 		}
 	}

Also fixes the previously reported wording.
Based on past review comments

🧹 Nitpick comments (5)
pkg/cluster/resources.go (1)

150-167: Harden creation against webhook timing; reuse CreateWithRetry.

Use the existing exponential backoff helper to tolerate transient 500s from webhooks and keep idempotency.

 func CreateHardwareProfile(ctx context.Context, cli client.Client, hwp *infrav1.HardwareProfile) error {
-	if err := cli.Create(ctx, hwp); err != nil {
-		if !k8serr.IsAlreadyExists(err) {
-			return fmt.Errorf("failed to create HardwareProfile '%s/%s': %w", hwp.Namespace, hwp.Name, err)
-		}
-	}
-	return nil
+	if err := CreateWithRetry(ctx, cli, hwp); err != nil {
+		return fmt.Errorf("failed to create HardwareProfile '%s/%s': %w", hwp.Namespace, hwp.Name, err)
+	}
+	return nil
 }
pkg/resources/resources.go (2)

33-43: ResourceSpec is clear; minor naming nit.

Consider renaming field Gvk → GVK to follow Go initialism conventions. Not blocking.


660-669: Avoid shadowing stdlib errors and skip appending nils.

Rename the accumulator for readability and to avoid confusion with the imported errors package.

-func DeleteResources(ctx context.Context, c client.Client, resources []ResourceSpec) error {
-	var errors *multierror.Error
+func DeleteResources(ctx context.Context, c client.Client, resources []ResourceSpec) error {
+	var agg *multierror.Error
 	for _, res := range resources {
-		err := DeleteOneResource(ctx, c, res)
-		errors = multierror.Append(errors, err)
+		if err := DeleteOneResource(ctx, c, res); err != nil {
+			agg = multierror.Append(agg, err)
+		}
 	}
-	return errors.ErrorOrNil()
+	return agg.ErrorOrNil()
 }
pkg/upgrade/upgrade_utils.go (2)

671-674: Use the customServing constant instead of a string literal.

Prevents drift if the name changes once.

-	if strings.HasPrefix(hwpName, containerSizeHWPPrefix) || (hwpName == "custom-serving") {
+	if strings.HasPrefix(hwpName, containerSizeHWPPrefix) || hwpName == customServing {
 		annotations[hardwareProfileNamespaceAnnotation] = namespace
 	}

384-388: Don’t allow AP annotations to override managed HWP annotations.

Protect keys we set (visibility, modified-date, display-name, description, disabled) when copying from AP.

-	// Copy existing annotations from AP
-	if apAnnotations := ap.GetAnnotations(); apAnnotations != nil {
-		for k, v := range apAnnotations {
-			annotations[k] = v
-		}
-	}
+	// Copy non-managed annotations from AP
+	if apAnnotations := ap.GetAnnotations(); apAnnotations != nil {
+		reserved := map[string]struct{}{
+			hardwareProfileVisibilityAnnotation:   {},
+			hardwareProfileModifiedDateAnnotation: {},
+			hardwareProfileDisplayNameAnnotation:  {},
+			hardwareProfileDescriptionAnnotation:  {},
+			hardwareProfileDisabledAnnotation:     {},
+		}
+		for k, v := range apAnnotations {
+			if _, keep := reserved[k]; keep {
+				continue
+			}
+			annotations[k] = v
+		}
+	}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ce08d45 and 0ee1be6.

📒 Files selected for processing (6)
  • cmd/main.go (1 hunks)
  • pkg/cluster/cluster_config.go (1 hunks)
  • pkg/cluster/resources.go (1 hunks)
  • pkg/resources/resources.go (3 hunks)
  • pkg/upgrade/upgrade.go (7 hunks)
  • pkg/upgrade/upgrade_utils.go (10 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/main.go
🧰 Additional context used
🧬 Code graph analysis (5)
pkg/resources/resources.go (1)
pkg/cluster/gvk/gvk.go (1)
  • Namespace (28-32)
pkg/upgrade/upgrade_utils.go (1)
pkg/cluster/resources.go (1)
  • CreateHardwareProfile (160-167)
pkg/cluster/resources.go (2)
tests/e2e/helper_test.go (1)
  • CreateHardwareProfile (339-384)
pkg/cluster/gvk/gvk.go (1)
  • HardwareProfile (88-92)
pkg/cluster/cluster_config.go (2)
api/common/types.go (1)
  • Release (159-162)
pkg/cluster/resources.go (2)
  • GetDSCI (115-135)
  • GetDSC (92-112)
pkg/upgrade/upgrade.go (2)
pkg/cluster/resources.go (2)
  • GetHardwareProfile (138-148)
  • CreateHardwareProfile (160-167)
tests/e2e/helper_test.go (1)
  • CreateHardwareProfile (339-384)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build/push catalog image
  • GitHub Check: Run tests and collect coverage on internal and pkg
  • GitHub Check: golangci-lint
🔇 Additional comments (3)
pkg/cluster/cluster_config.go (1)

93-131: Release detection logic looks correct; fallback order is sensible.

DSCI → DSC fallback with not-found handling is clean. No action required.

pkg/upgrade/upgrade_utils.go (1)

98-101: Centralized HWP creation path looks good.

Error context is clear and uses the new cluster.CreateHardwareProfile.

pkg/upgrade/upgrade.go (1)

508-556: Custom-serving HWP creation flow is clean and idempotent.

Existence check → annotated create with centralized helper → clear logging. Looks good.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
pkg/resources/resources.go (1)

702-704: Fix typo in error message.

The error message uses "unexisting path" which is grammatically incorrect. As noted in a previous review, this should be "nonexistent field path" or similar.

Apply this diff:

-		if !ok {
-			return fmt.Errorf("unexisting path to delete: %v", res.FieldPath)
-		}
+		if !ok {
+			return fmt.Errorf("nonexistent field path: %v", res.FieldPath)
+		}
pkg/upgrade/upgrade.go (1)

354-354: Fix typo in comment.

"HardwarrfeProfiles" should be "HardwareProfiles". This was flagged in a previous review.

Apply this diff:

-// as described in RHOAIENG-33158. This creates 2 HardwarrfeProfiles for each AP (notebooks and serving).
+// as described in RHOAIENG-33158. This creates 2 HardwareProfiles for each AP (notebooks and serving).
🧹 Nitpick comments (2)
pkg/resources/resources.go (1)

33-43: Consider clarifying field naming per previous review feedback.

The FieldPath and FilterValues naming works but could be clearer. A past reviewer suggested considering names like FilterFieldPath and FilterFieldValues to make the filtering intent more explicit.

pkg/upgrade/upgrade.go (1)

632-632: Consider using string formatting for the log message.

The string concatenation works but could be cleaner with fmt.Sprintf or by adding the hardwareProfile value as a separate log field (which is already done).

Optional alternative:

-				log.Info("Set HardwareProfile annotation for InferenceService with "+customServing+" HardwareProfile", "isvc", isvc.GetName(), "hardwareProfile", hwpName)
+				log.Info("Set HardwareProfile annotation for InferenceService with custom-serving HardwareProfile", "isvc", isvc.GetName(), "hardwareProfile", hwpName)

Or simply:

-				log.Info("Set HardwareProfile annotation for InferenceService with "+customServing+" HardwareProfile", "isvc", isvc.GetName(), "hardwareProfile", hwpName)
+				log.Info("Set default HardwareProfile annotation for InferenceService", "isvc", isvc.GetName(), "hardwareProfile", hwpName)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ee1be6 and 8fac651.

📒 Files selected for processing (9)
  • Makefile (1 hunks)
  • bundle/manifests/opendatahub-operator.clusterserviceversion.yaml (3 hunks)
  • cmd/main.go (1 hunks)
  • config/manifests/bases/opendatahub-operator.clusterserviceversion.yaml (3 hunks)
  • pkg/cluster/cluster_config.go (1 hunks)
  • pkg/cluster/resources.go (1 hunks)
  • pkg/resources/resources.go (3 hunks)
  • pkg/upgrade/upgrade.go (7 hunks)
  • pkg/upgrade/upgrade_utils.go (10 hunks)
✅ Files skipped from review due to trivial changes (1)
  • Makefile
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/cluster/cluster_config.go
  • pkg/upgrade/upgrade_utils.go
🧰 Additional context used
🧬 Code graph analysis (4)
cmd/main.go (1)
pkg/cluster/cluster_config.go (1)
  • GetDeployedRelease (107-131)
pkg/cluster/resources.go (2)
tests/e2e/helper_test.go (1)
  • CreateHardwareProfile (339-384)
pkg/cluster/gvk/gvk.go (1)
  • HardwareProfile (88-92)
pkg/upgrade/upgrade.go (1)
pkg/cluster/resources.go (2)
  • GetHardwareProfile (138-148)
  • CreateHardwareProfile (160-167)
pkg/resources/resources.go (2)
pkg/cluster/gvk/gvk.go (1)
  • Namespace (28-32)
pkg/resources/resources_types.go (1)
  • UnstructuredList (11-11)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build/push catalog image
  • GitHub Check: Run tests and collect coverage on internal and pkg
  • GitHub Check: golangci-lint
🔇 Additional comments (9)
config/manifests/bases/opendatahub-operator.clusterserviceversion.yaml (1)

9-9: Version bumps are consistent across manifest identifiers.

All three version references (containerImage, metadata.name, and spec.version) have been updated uniformly from 3.0.0 to 3.1.0, maintaining consistency for the operator distribution manifest.

Also applies to: 20-20, 175-175

bundle/manifests/opendatahub-operator.clusterserviceversion.yaml (1)

85-85: Bundle manifest version bumps align with base manifest.

Version identifiers in the bundle manifest are updated consistently with the base manifest: containerImage, metadata.name, and spec.version all reflect the 3.1.0 release version uniformly.

Also applies to: 98-98, 1537-1537

pkg/cluster/resources.go (1)

150-167: LGTM!

The function correctly implements idempotent HardwareProfile creation with appropriate error handling and context.

pkg/resources/resources.go (2)

650-669: LGTM!

The error aggregation pattern correctly follows go-multierror best practices, using a pointer and calling ErrorOrNil() to return nil when no errors occurred.


720-740: LGTM!

The function correctly handles owner reference removal with appropriate early-return optimization when no owner references exist.

cmd/main.go (1)

262-262: LGTM!

The change correctly uses the relocated cluster.GetDeployedRelease function, aligning with the PR's objective to move functions to the cluster package for reuse.

pkg/upgrade/upgrade.go (3)

35-54: LGTM!

The new constants appropriately centralize magic strings and improve code maintainability with clear, consistent naming.


347-347: LGTM!

The function call correctly uses the renamed createCustomServingHardwareProfile (now package-private), which is appropriate since it's only used within the upgrade package.


508-556: LGTM!

The function correctly uses the centralized cluster.CreateHardwareProfile and follows proper idempotent creation patterns. The annotation handling appropriately sets the managed flag to "false" for the custom-serving profile.

@zdtsw zdtsw force-pushed the chore_386 branch 2 times, most recently from 38ec9b9 to 17a9c00 Compare November 3, 2025 08:34
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
pkg/resources/resources.go (1)

703-703: Fix typo in error message.

The error message contains "unexisting path" which should be "nonexistent path" or "non-existing path" for grammatical correctness.

Apply this diff:

-		return fmt.Errorf("unexisting path to delete: %v", res.FieldPath)
+		return fmt.Errorf("nonexistent field path: %v", res.FieldPath)
pkg/upgrade/upgrade.go (1)

354-354: Fix typo in comment.

"HardwarrfeProfiles" → "HardwareProfiles"

Apply this diff:

-// as described in RHOAIENG-33158. This creates 2 HardwarrfeProfiles for each AP (notebooks and serving).
+// as described in RHOAIENG-33158. This creates 2 HardwareProfiles for each AP (notebooks and serving).
🧹 Nitpick comments (2)
pkg/resources/resources.go (2)

706-714: Consider breaking after successful deletion.

After deleting a resource (line 708), the code continues checking remaining FilterValues. Since the item is already deleted, checking additional filter values is unnecessary.

Apply this diff to improve efficiency:

 		for _, targetValue := range res.FilterValues {
 			if v == targetValue {
 				err = c.Delete(ctx, &item)
 				if err != nil {
 					return fmt.Errorf("failed to delete %s %s/%s: %w", res.Gvk.Kind, res.Namespace, item.GetName(), err)
 				}
 				log.Info("Deleted object", "name", item.GetName(), "gvk", res.Gvk.String(), "namespace", res.Namespace)
+				break
 			}
 		}

720-740: Consider improving error message grammar.

The error message on line 736 reads "error unset ownerreference" which is grammatically awkward.

Apply this diff:

-			return fmt.Errorf("error unset ownerreference for CR %s : %w", instanceName, err)
+			return fmt.Errorf("failed to unset owner references for CR %s: %w", instanceName, err)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 38ec9b9 and 17a9c00.

📒 Files selected for processing (6)
  • cmd/main.go (1 hunks)
  • pkg/cluster/cluster_config.go (1 hunks)
  • pkg/cluster/resources.go (1 hunks)
  • pkg/resources/resources.go (3 hunks)
  • pkg/upgrade/upgrade.go (7 hunks)
  • pkg/upgrade/upgrade_utils.go (10 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/main.go
🧰 Additional context used
🧬 Code graph analysis (5)
pkg/cluster/cluster_config.go (2)
api/common/types.go (2)
  • Release (159-162)
  • Status (97-105)
pkg/cluster/resources.go (2)
  • GetDSCI (115-135)
  • GetDSC (92-112)
pkg/resources/resources.go (2)
pkg/cluster/gvk/gvk.go (1)
  • Namespace (28-32)
pkg/resources/resources_types.go (1)
  • UnstructuredList (11-11)
pkg/upgrade/upgrade.go (1)
pkg/cluster/resources.go (2)
  • GetHardwareProfile (164-174)
  • CreateHardwareProfile (186-193)
pkg/upgrade/upgrade_utils.go (1)
pkg/cluster/resources.go (1)
  • CreateHardwareProfile (186-193)
pkg/cluster/resources.go (2)
tests/e2e/helper_test.go (1)
  • CreateHardwareProfile (339-384)
pkg/cluster/gvk/gvk.go (1)
  • HardwareProfile (88-92)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build/push catalog image
  • GitHub Check: Run tests and collect coverage on internal and pkg
  • GitHub Check: golangci-lint
🔇 Additional comments (12)
pkg/resources/resources.go (4)

14-14: LGTM!

The new imports are appropriate for the added functionality: go-multierror for error aggregation, api/meta for API error handling, and controller-runtime logging.

Also applies to: 19-19, 28-28


33-43: LGTM!

The ResourceSpec struct is well-documented with clear field descriptions and examples. The structure is appropriate for identifying and filtering Kubernetes resources.


650-669: LGTM!

The DeleteResources function correctly aggregates errors from multiple deletion attempts using go-multierror, ensuring that all deletions are attempted even if some fail.


687-694: Good error handling for missing CRDs.

The function gracefully handles NoMatchError when a CRD is not found, logging the condition and continuing. This prevents failures when deleting resources for optional components.

pkg/cluster/cluster_config.go (1)

93-131: LGTM! Well-structured function with proper error handling.

The implementation correctly handles the fallback logic from DSCI to DSC and returns an empty Release when neither exists, which aligns with the "clean installation or both CRs deleted" scenario mentioned in the comment.

pkg/cluster/resources.go (1)

176-193: LGTM! Clean idempotent creation pattern.

The function correctly handles the "already exists" case and provides clear error messages with namespace and name context.

pkg/upgrade/upgrade_utils.go (4)

94-103: LGTM! Proper delegation to cluster package.

The error message provides comprehensive context for debugging, including the HardwareProfile name, container size name, profile type, and namespace.


341-359: LGTM! Consistent use of centralized creation logic.

The function signature improvement and delegation to cluster.CreateHardwareProfile provide better consistency across the codebase.


545-550: LGTM! Constants improve maintainability.

Using featureVisibilityModelServing and featureVisibilityWorkbench constants instead of hard-coded strings reduces the risk of typos and makes updates easier.


680-699: LGTM! Excellent consolidation of annotation logic.

This helper function ensures consistency across all HardwareProfile creation paths and reduces code duplication.

pkg/upgrade/upgrade.go (2)

35-55: LGTM! Excellent centralization of configuration.

The introduction of these constants significantly improves maintainability and reduces the risk of typos in string literals throughout the codebase.


508-556: LGTM! Clean refactor with proper use of helpers and constants.

The function correctly leverages the centralized createHardwareProfileAnnotations helper and cluster.CreateHardwareProfile for consistency. The renaming from exported to unexported is appropriate since this is an internal upgrade utility.

@davidebianchi
Copy link
Member

What about add unit tests to moved functions?

@zdtsw
Copy link
Member Author

zdtsw commented Nov 3, 2025

What about add unit tests to moved functions?

you are right. i should add unit test for functions moved out of upgrade package

@zdtsw zdtsw marked this pull request as draft November 3, 2025 11:59
@zdtsw zdtsw changed the title update: cleanup unnecessary upgrade logic + move functions [WIP]update: cleanup unnecessary upgrade logic + move functions Nov 3, 2025
@zdtsw zdtsw changed the title [WIP]update: cleanup unnecessary upgrade logic + move functions update: cleanup unnecessary upgrade logic + move functions Nov 16, 2025
- remove functions and calls not needed for 3.2 (should be removed in
  3.0 already)
  jupyterhub resource cleanup
  rbac for modelreg
  envoyfilter patch for serving
  watson resource docs
  patch odhdashboardconfig for trusty enablement
- move and cleanup function for AP to HWP migration

Signed-off-by: Wen Zhou <[email protected]>
@zdtsw zdtsw marked this pull request as ready for review November 16, 2025 10:05
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
pkg/cluster/cluster_config.go (1)

101-139: GetDeployedRelease fallback logic looks correct; consider adding unit tests.

The function cleanly prefers DSCI over DSC, treats NotFound as a fallback condition, and surfaces other errors unchanged, which matches the intended behavior and will be useful outside the upgrade path. Given this is now a central helper, it would be good to back it with unit tests covering: DSCI-only, DSC-only, both-missing, and non-NotFound error cases.

pkg/resources/resources.go (2)

14-43: ResourceSpec deletion utilities look sound; a couple of robustness tweaks to consider.

The ResourceSpec model and the DeleteResources/DeleteOneResource helpers read clearly and the use of multierror plus meta.IsNoMatchError makes sense for best-effort cleanup. Two minor robustness suggestions:

  • In DeleteOneResource, a missing FieldPath currently returns an error on the first object. If these specs are ever reused in contexts where the field might be absent on some instances, you may want to log and continue instead of failing the whole spec.
  • For potential cluster-scoped resources, you might consider only applying client.InNamespace(res.Namespace) when res.Namespace is non-empty to avoid relying on how the client treats an empty namespace for cluster-scoped GVKs.

These are non-blocking; the current behavior is acceptable for strictly-validated upgrade specs.

Also applies to: 650-718


720-740: UnsetOwnerReferences is correct but could be slightly safer.

The helper does what it says, but you might want to:

  • Guard against a nil odhObject to avoid a possible panic.
  • Check len(odhObject.GetOwnerReferences()) == 0 instead of just != nil so you don’t send an unnecessary Update when the slice is empty.

Behavior today is fine for controlled call sites; this would just harden it a bit.

pkg/cluster/resources.go (1)

176-193: Centralized HardwareProfile creation helper looks good; consider minor polish.

Wrapping cli.Create and swallowing IsAlreadyExists gives you a nice idempotent entry point and keeps callers simple. Two small, optional refinements:

  • You could use return client.IgnoreAlreadyExists(cli.Create(ctx, hwp)) to express the intent a bit more idiomatically.
  • If you expect this to be reused outside one-off migrations, consider whether you want the same retry behavior as CreateWithRetry (e.g., for webhook startup races).

Current implementation is perfectly serviceable for upgrade/migration flows.

pkg/upgrade/upgrade_utils.go (1)

381-382: Centralizing HardwareProfile annotations is nice; consider making the timestamp UTC/injectable.

createHardwareProfileAnnotations and getFeatureVisibility nicely standardize visibility, display name, description, disabled flag, and modified date across all HardwareProfile creation paths, and their reuse in both HWP generators and createCustomServingHardwareProfile reduces duplication and typo risk. One minor improvement: using time.Now().UTC().Format(time.RFC3339) (or injecting a clock for tests) would make hardwareProfileModifiedDateAnnotation timezone-independent and easier to assert in unit tests.

Also applies to: 463-465, 492-493, 525-526, 545-550, 680-699

pkg/upgrade/upgrade.go (2)

35-55: New constants for defaults, annotations, and customServing improve clarity; double‑check defaultMin vs defaultResourceLimits.*

Centralizing odhDashboardConfigName, customServing, the various hardwareProfile* annotation keys, feature-visibility JSON strings, and containerSizeHWPPrefix significantly reduces duplication and makes the later usages (e.g., in AttachHardwareProfileToInferenceServices’ default hwpName and log message) easier to follow. One thing worth re‑validating: defaultMinMemory is "1Mi" while defaultResourceLimits["minMemory"] is "8Gi"—if these are meant to represent the same conceptual default, you may want to align them; if they’re intentionally different for “no containerSizes” vs “computed from sizes” paths, a brief comment could help future readers.

Also applies to: 611-633


508-556: createCustomServingHardwareProfile logic is sound; only a small error-message nit.

The function correctly:

  • Checks for the existing HardwareProfile via cluster.GetHardwareProfile, treating only non-NotFound errors as fatal.
  • Uses createHardwareProfileAnnotations + hardwareProfileManagedAnnotation = "false" to initialize metadata.
  • Delegates creation to cluster.CreateHardwareProfile for idempotency and logs success.

If you touch this again, you might consider tweaking the error format string to something like:

return fmt.Errorf("failed to check HardwareProfile CR %q: %w", customServing, customServingError)

to read a bit more naturally.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 17a9c00 and 17f6efb.

📒 Files selected for processing (6)
  • cmd/main.go (1 hunks)
  • pkg/cluster/cluster_config.go (1 hunks)
  • pkg/cluster/resources.go (1 hunks)
  • pkg/resources/resources.go (3 hunks)
  • pkg/upgrade/upgrade.go (7 hunks)
  • pkg/upgrade/upgrade_utils.go (10 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/main.go
🧰 Additional context used
🧬 Code graph analysis (5)
pkg/cluster/resources.go (2)
tests/e2e/helper_test.go (1)
  • CreateHardwareProfile (345-390)
pkg/cluster/gvk/gvk.go (1)
  • HardwareProfile (88-92)
pkg/resources/resources.go (2)
pkg/cluster/gvk/gvk.go (1)
  • Namespace (28-32)
pkg/resources/resources_types.go (1)
  • UnstructuredList (11-11)
pkg/upgrade/upgrade_utils.go (1)
pkg/cluster/resources.go (1)
  • CreateHardwareProfile (186-193)
pkg/cluster/cluster_config.go (2)
api/common/types.go (1)
  • Release (159-162)
pkg/cluster/resources.go (2)
  • GetDSCI (115-135)
  • GetDSC (92-112)
pkg/upgrade/upgrade.go (1)
pkg/cluster/resources.go (2)
  • GetHardwareProfile (164-174)
  • CreateHardwareProfile (186-193)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build/push catalog image
  • GitHub Check: golangci-lint
  • GitHub Check: Run tests and collect coverage on internal and pkg
🔇 Additional comments (3)
pkg/upgrade/upgrade_utils.go (2)

60-92: Using odhDashboardConfigName constant in getOdhDashboardConfig improves consistency.

Switching the cluster lookup to use the shared odhDashboardConfigName constant keeps name and manifest path in sync and avoids hard‑coding. No issues from a behavior standpoint.


94-103: Routing hardware-profile creation through cluster.CreateHardwareProfile is a good consolidation.

Both createHardwareProfileFromContainerSize and createHardwareProfileFromAcceleratorProfile now defer creation to cluster.CreateHardwareProfile, so AlreadyExists is handled uniformly and call sites only need to wrap non-idempotent failures. This keeps upgrade logic thinner and makes future behavior changes (e.g., adding retries) localized to one place.

Also applies to: 341-359

pkg/upgrade/upgrade.go (1)

317-349: MigrateToInfraHardwareProfiles orchestration still looks correct with the new custom-serving helper.

The flow—fetch OdhDashboardConfig once, then migrate accelerator profiles, container sizes, notebooks, and finally inference services—remains intact, and inserting createCustomServingHardwareProfile just before attaching annotations to InferenceServices ensures the “custom-serving” HWP is created (or verified) before you annotate. Continuing even when one step fails, while aggregating errors via multierror, matches the existing best-effort pattern.

@carlkyrillos
Copy link
Member

@zdtsw Will any of the functions that are removed from the upgrade package impact upgrades from 2.25 to 3.2 or 3.3? AFAIK we aren't enforcing that users upgrade from 2.25 to 3.0 before 3.2/3.3 so I want to make sure we aren't breaking the upgrade path from 2.25 to 3.3.

@zdtsw
Copy link
Member Author

zdtsw commented Nov 18, 2025

@zdtsw Will any of the functions that are removed from the upgrade package impact upgrades from 2.25 to 3.2 or 3.3? AFAIK we aren't enforcing that users upgrade from 2.25 to 3.0 before 3.2/3.3 so I want to make sure we aren't breaking the upgrade path from 2.25 to 3.3.

i do not think so , the logic for v2 to v3 are staying in #2698 to be removed after 3.3
changes in this PR are for the internal v2.x -> v2.y upgrade.

@carlkyrillos
Copy link
Member

@zdtsw Will any of the functions that are removed from the upgrade package impact upgrades from 2.25 to 3.2 or 3.3? AFAIK we aren't enforcing that users upgrade from 2.25 to 3.0 before 3.2/3.3 so I want to make sure we aren't breaking the upgrade path from 2.25 to 3.3.

i do not think so , the logic for v2 to v3 are staying in #2698 to be removed after 3.3 changes in this PR are for the internal v2.x -> v2.y upgrade.

Okay in that case I think this is ready to merge.

@openshift-ci
Copy link

openshift-ci bot commented Nov 18, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: carlkyrillos

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 315e014 into opendatahub-io:main Nov 18, 2025
20 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in ODH Platform Planning Nov 18, 2025
@zdtsw
Copy link
Member Author

zdtsw commented Nov 19, 2025

/cherrypick rhoai

@openshift-cherrypick-robot

@zdtsw: #2695 failed to apply on top of branch "rhoai":

Applying: update: cleanup unnecessary upgrade logic
Using index info to reconstruct a base tree...
M	cmd/main.go
M	pkg/cluster/cluster_config.go
M	pkg/upgrade/upgrade.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/upgrade/upgrade.go
CONFLICT (content): Merge conflict in pkg/upgrade/upgrade.go
Auto-merging pkg/cluster/cluster_config.go
Auto-merging cmd/main.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 update: cleanup unnecessary upgrade logic

In response to this:

/cherrypick rhoai

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

zdtsw added a commit to zdtsw-forking/opendatahub-operator that referenced this pull request Nov 19, 2025
- remove functions and calls not needed for 3.2 (should be removed in
  3.0 already)
  jupyterhub resource cleanup
  rbac for modelreg
  envoyfilter patch for serving
  watson resource docs
  patch odhdashboardconfig for trusty enablement
- move and cleanup function for AP to HWP migration

Signed-off-by: Wen Zhou <[email protected]>
(cherry picked from commit 315e014)
@zdtsw
Copy link
Member Author

zdtsw commented Nov 19, 2025

follow up #2881

openshift-merge-bot bot pushed a commit that referenced this pull request Nov 19, 2025
- remove functions and calls not needed for 3.2 (should be removed in
  3.0 already)
  jupyterhub resource cleanup
  rbac for modelreg
  envoyfilter patch for serving
  watson resource docs
  patch odhdashboardconfig for trusty enablement
- move and cleanup function for AP to HWP migration


(cherry picked from commit 315e014)

Signed-off-by: Wen Zhou <[email protected]>
openshift-merge-bot bot added a commit that referenced this pull request Nov 27, 2025
* Remove servicemesh not installed check. (#2443)

After adding the gatewayconfig CRD and controller, ODH and RHOAI will
always have servicemesh installed automatically by the openshift ingress
controller. It will not be possible to get into a state without OSSM 3.x
installed.

* remove redundant e2e test case (#2454)

* Change naming convention in mod arch image for dashboard (#2449)

* fix: fix catalog image related_image env variable name as per devops config (#2452)

Signed-off-by: Dhiraj Bokde <[email protected]>

* update: sample of DSCI and DSC + README (#2447)

- since we have introduce GW API creation, we should update our sample
- ossm v3 is installed out of the box, we should not ossm v2 still "Managed" as we advocate
- set to Unmanaged than Removed, to avoid accidentally deletion
- this whole setion should be taken away soon in 3.0 release

Signed-off-by: Wen Zhou <[email protected]>

* fix: connection API on ISVC if type is changed + only create serviceaccount if type is s3 (#2433)

* fix: connection API on ISVC if type is changed

- we previously only handle when type is removed then we cleanup injection
- now we handle update as well
- refactor some old code
- handle serviceaccount creation and injection for s3 only

Signed-off-by: Wen Zhou <[email protected]>

* fix: lint

Signed-off-by: Wen Zhou <[email protected]>

* update: code review

- correct log and skip Patch if no cleanup is needed

Signed-off-by: Wen Zhou <[email protected]>

* update: fix case when there was no annotation

- if isvc exists but no annoatation: user might manually set certain part, e.g .spec.predictor.model.storageUri
 on update with annotation, old storageUri etc should be removed before new injection to avoid confliect

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* feat(RHOAIENG-30795): Add configuration support to TrustyAI DSC component (#2225)

* feat(RHOAIENG-30795): Add configuration support to TrustyAI DSC component

* fix: Add missing docs

* fix: Fix linting errors

* fix: Remove redundant labels and pointer to boolean fields

Since they have default values, no need to use pointers.

* fix: Add TrustyAI CM to operator's resources. Remove unneeded annotations.

* fix: Change tests to account for AddResources

* fix: Linting of trustyai_test.go

* Merge test for default CM and enabled component

* Fix linting (remove dupl)

* chore: make bundle

* fix: revert REPLACE_IMAGE

* chore: skip HWP migration if dashboard HWP CRD missing (#2470)

* feat: add support to pass Oauth proxy image for downstream (#2466)

* add override for new workbenches (#2465)

* prometheus test should check monitoring namespace not apps namespace (#2453)

* feat: add new VAP to block create/update operation on dashboard's HWProfile CR and AcceleratorProfile CR (#2378)

* feat: add new VAP to block create/update operation on dashboard's HWProfile CR and AcceleratorProfile CR

- since this will only work on OCP 4.19+, no need check OCP versoin, as VAP/VAPB is enabled in the cluster
- in the case if the cluster already have dashboard's AP/HWP CRD
  	DSCI should deploy and own 4 new resources: 2 VAP and 2 VAPB to block if any attmpt to create dashboard AcceleratorProfile/HardwareProfile CR or update existing AcceleratorProfile/HardwareProfile CR
	these 4 resoruces will be re-created if user try to delete them
	when DSCI is deleted, these 4 resources should be removed from cluster

Signed-off-by: Wen Zhou <[email protected]>

* fix: testscases

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* update: remove logic set default value in odhdashboardconfig in upgrade (#2469)

Signed-off-by: Wen Zhou <[email protected]>

* feat: optimize E2E test framework performance and error handling (#2418)

* feat: add comprehensive E2E testing framework with improved resource management

- Add DeleteResources function for efficient bulk resource deletion with options
- Implement EnsureResourceDeletedThenRecreated with UID-based verification
- Add WithRemoveFinalizersOnDelete option for automatic finalizer handling
- Support WithWaitForRecreation flag for controller-managed resources
- Replace ExpectedErr with AcceptableErrMatcher using proper Kubernetes error codes
- Fix tryRemoveFinalizers to handle success, NotFound, and NoKindMatchError gracefully
- Add controllerCacheRefreshDelay to prevent cache staleness issues
- Consolidate cleanup functions and remove redundant code patterns
- Add ResourceQuota GVK definition and improve bulk deletion documentation

This comprehensive framework eliminates race conditions in controller-managed
resource testing and provides consistent, reliable patterns for deletion-recreation
scenarios with enhanced safety and performance optimizations.

* feat: enhance core component E2E tests and ServiceMesh organization

- Add comprehensive component testing with enhanced determinism
- Improve test organization and maintainability across core test suites
- Move resilience tests after functional tests for cleaner failure attribution
- Fix KnativeServing test synchronization with deletion/recreation flags
- Add proper scoping to cleanupListResources with safety checks
- Optimize ServiceMesh test organization with consolidated validation patterns
- Extract common validation helpers following "with" pattern conventions
- Consolidate repetitive operator configuration patterns using shared constants
- Add reusable helpers for monitoring resource cleanup and DSCI validation
- Move shared operator namespace constants to centralized helper file
- Rename validation functions for consistency (buildCapabilityConditions → withCapabilityConditions)
- Reorganize helper functions placement for better code structure
- Optimize E2E performance with better resource management
- Apply framework patterns across component test suites
- Refactor UninstallOperator to use ResourceOpts pattern for better consistency
- Replace hard sleeps with WithWaitForDeletion in ServiceMesh operator uninstallation
- Remove custom uninstallOperatorWithChannel function in favor of standardized approach
- Eliminate duplicate validation patterns with DRY principle application
- Refactor Gateway tests and centralize GVKs

This enhances the core testing infrastructure with improved coverage,
resilience patterns, ServiceMesh test organization, and maintainable
validation helpers that reduce code duplication while following
established framework conventions.

# Conflicts:
#	tests/e2e/servicemesh_test.go

# Conflicts:
#	tests/e2e/creation_test.go
#	tests/e2e/dashboard_test.go
#	tests/e2e/monitoring_test.go

* Removing the ServiceMesh controller and some ServiceMesh/Authorino e2e tests, as they will no longer be present in 3.0.

* Commenting out gateway test cases for now, as the feature is still in the early stages of implementation.

* feat: add support for llmisvc to use kueue (#2293)

* feat: add support for llmisvc for kueue

- rename old kueue one for isvc so we can have both isvc and llmisvc

Signed-off-by: Wen Zhou <[email protected]>

* fix: code review

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* feat: add GH Action to enforce e2e test suite update for code-related PRs (#2426)

fix: split workflow into two to allow the bot to comment on fork PRs

reuse existing error comments and add automatic error comment cleanup, to prevent PR noise

add option to provide skip justification in PR descprition

re-use same error comment for both error cases

remove functionality to add justification via PR comment, to simplify the workflow and improve user experience

* addressed comments in PR 2404 (#2444)

* Use custom manifests without devFlags (#2427)

* Use custom manifests without devFlags

Jira: https://issues.redhat.com/browse/RHOAIENG-31642

This gives instructions for an alternative approach to using custom
manifests with the operator, which don't require the use of "devFlags"
in the DSC.

The intention is that this will allow us to remove the devFlags
functionality for 3.0.

This is a primitive approach for now. I expect that folks will start
to build better development patterns rather than needing to modify the
patch file for their component, etc. They can even have their own
approaches in their own repositories, and this may just be an example.

* Don't re-chown files in Dockerfile

They're already chowned as expected from the COPY instruction on L60.

* feat: add comprehensive debug utilities for E2E test failures and timeouts (#2486)

* feat: add comprehensive debug utilities for E2E test failures and timeouts

- Add debug_utils_test.go with global panic and failure handlers
- Integrate HandleGlobalPanic() and HandleTestFailure() at multiple test levels
- SetGlobalDebugClient() automatically configured in NewTestContext()
- Provides detailed diagnostics on ANY test failure (panics, timeouts, assertions):
  * Cluster state and node health with resource allocation percentages
  * Namespace resources (deployments, pods, container states)
  * Operator deployment status and pod logs
  * DSCI/DSC status and unhealthy conditions
  * Recent events from key namespaces (last 5 minutes)
  * Resource quota violations

Addresses debugging challenges for deployment recreation failures,
resource exhaustion, and other cluster-level issues in E2E tests.
Activates automatically without requiring manual intervention.

* Added security redaction for secrets, tokens, and credentials in logs

* fix: limit e2e update requirement check GH action only to the main branch (#2497)

chore: improve documentation and error messages

* update: temp. disable e2e test on dashboard (#2499)

- this should be reverted once dashboard is fixed

Signed-off-by: Wen Zhou <[email protected]>

* removing the py311 images (#2484)

Co-authored-by: Wen Zhou <[email protected]>

* RHOAIENG-29717 | feat: Deploy thanos querier frontend for metrics acces (#2448)

Co-authored-by: den-rgb <[email protected]>

* chore: e2e update requirement GH action: edit regex for justification section title, ensure guidelines are consistent (#2509)

* fix: prevent secret data leakage in operator error logs (#2471)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>

* test: increase resource check timeout, fix debug (#2506)

* Remove codeflare operator, monitoring and tests (#2468)

* feat(RHOAIENG-34034): remove codeflare operator, monitoring and tests

* test(RHOAIENG-34034): add e2e test to verify codeflare component is not removed

* fix(RHOAIENG-34034): improve test and add test flag in README

* test(RHOAIENG-34034): use consistently check

* test: improve test readibility and reusability

* fix: watch on Dashboard's AProfile/HWProfile by DSCI (#2512)

- add watch on these two CRD on create/update, so VAP/VAPB need to be
  created
- this is needed for the testcase on dashboard once it is enabled and
  still ship these two CRD
- we probably can remove this logic alongwith the dashbaord tests on
  VAP/VAPB once dashboard cleanup CRD from their manifests

Signed-off-by: Wen Zhou <[email protected]>

* chore: log on webhook for notebook with HWProfile (#2491)

- different for not able to find HWProfile VS client Get() failed
- update e2e tests to add new error code

Signed-off-by: Wen Zhou <[email protected]>

* feat:  auto detect auth, kube auth proxy, envoy filter (#2490)

* fix gateway controller namespace handling, add status sync, and improve test patterns

* feat: detect auth, kubeauthproxy, envoyfilter

* added better support for oidc, fixed kubeproxy/envoyfilter issue

* used apply instead of addresources for secret

* fixed cookie mismatch for oauth/oidc

* Initial effort at migrating hardwareprofiles from v1alpha1 to v1 (#2398)

Rebased, included v1alpha1 hardwareprofile changes



added deprecated markers to v1alpha1 hardwareprofiles



Rebased on latest, incremental Makefile version bump, regenerated bundle



v1alpha1 HardwareProfiles must be registered to scheme



minor comment update to reflect infrav1 hardwareprofiles



adding hardwareprofile test cases



# Conflicts:
#	bundle/manifests/opendatahub-operator.clusterserviceversion.yaml

Signed-off-by: Max Whittingham <[email protected]>

* chore: remove unused pkg/feature and related integration tests (#2521)

* chore: remove unused pkg/feature and related integration tests

* chore: remove unused unit-tests1.yaml workflow

* collector validation should check metrics resources or storage (#2515)

* refactor: remove duplicated reconciler crd existence check (#2483)

* chore: uplift version + fix bundle + fmt + fix monitoring tests (#2526)

Signed-off-by: Ugo Giordano <[email protected]>

* update: add cleanup when --fail-fast is called we still get leftover cleanup (#2478)

Signed-off-by: Wen Zhou <[email protected]>

* implemented destination rule for TLS to upstream services (#2536)

* feat: add default HWProfile CR "default-profile" (#2461)

* feat: add default HWProfile CR "default-profile"

- remove the sample one we added before, with this PR it will create one in the cluster
- user will be able to change this default-profile CR, operator wont reconcile
- if user change DSCI or DSCI gets reconciled, it should not reset
  default-profile CR with user modified value.
- if user delete this default-profile CR, operator will create a new one with default values

Signed-off-by: Wen Zhou <[email protected]>

* update: code review

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* feat: switch isvc connection annotation from type to protocol (#2532)

* feat: switch isvc connection annotation from type to protocol

* fix: addressed PR comments

* Update unit-tests2.yaml to codecov-actions v5.5.1 (#1) (#2542)

Update unit-tests2.yaml to codecov-actions v5.5.1

* fix: version inconsistency (#2544)

Signed-off-by: Wen Zhou <[email protected]>

* add support for overriding kserve-llm-d (#2537)

* feat: add oauth proxy parametrization to dashboard (#2547)

* feat: add support for llmisvc to use HWProfile (#2424)

* feat: add support for llmisvc to use HWProfile

- if HWprofile has .spec.identifiers, add this into .spec.tempalte.containers(llmisvc)
- if HWprofile has .spec.scheduling
  - if it is type: kueue, set label "kueue.x-k8s.io/queue-name" with value  .spec.scheduling.kueue.localqueue
  - if it is type: node and has .spec.scheduling.node
      .spec.scheduling.node.nodeSelector add into .spec.template.nodeSelector(llmisvc)
      .spec.scheduling.node.tolerations add into .spec.template.tolerations(llmisvc)

Signed-off-by: Wen Zhou <[email protected]>

* fix: add support for llmisvc with min config

- if user only set spec: {} when create llmisvc, we should:
  1. create a container named as "main"
  2. inject identifier from HWProfile

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* Support override for runtime images (#2464)

* support override for runtime images

* add override for vllm-spyre image

* fix: prevent cache false hits for resources stuck in deletion (#2527)

* fix: bypass cache for objects with deletionTimestamp to prevent stuck resources

- Add cache bypass logic for objects with deletionTimestamp in deploy action
- Implement proactive cache cleanup in ShouldSkip method
- Add Delete method to Cache for cleaning up stale entries
- Extract cache logic into isCachedAndShouldSkip helper method
- Add comprehensive unit tests for cache cleanup verification
- Apply line-of-sight principle for better code readability

Fixes cache bug where resources.Hash() removes deletionTimestamp causing
identical hashes for objects with/without deletion timestamps. This led
to false cache hits preventing recreation of any Kubernetes resource type
(deployments, services, configmaps, secrets, CRDs, etc.) during E2E tests.

Root cause: pkg/controller/predicates/resources/resources.go:295
Impact: E2E test failures when resources stuck in deletion while still in cache
Solution: Objects with deletionTimestamp always bypass cache and trigger cleanup
Scope: Affects all Kubernetes resources deployed via deploy action

* refactor: simplify cache test suite and improve terminating object handling

- Skip deployment for terminating objects instead of attempting and failing
- Refactor ShouldSkip to handle terminating objects before cache logic
- Rename ProcessCacheEntry to CleanupIfTerminating for clarity
- Make Cache.Delete method idempotent
- Convert TestDeployWithCacheAction to clean table-driven approach
- Remove unused test helper functions (saved ~150+ lines)
- Simplify test structure with direct function calls in table
- Apply micro-optimization to avoid DeepCopy on skip path

All tests pass with improved maintainability and performance.

* Update action_deploy_cache.go

Removed redundant check when deleting a key from the cache

* fix(RHOAIENG-34533): resolve TrustyAI DSC validation error when patching DSC eval flags (#2525)

* fix(RHOAIENG-34533): resolve TrustyAI DSC validation error when patching DSC eval flags

This change resolves RHOAIENG-34533 by converting TrustyAI evaluation configuration
fields from boolean to enum string values.

Changes:
- Convert PermitCodeExecution and PermitOnline from bool to string with enum validation (allow/deny)
- Add proper default handling and constants for evaluation permissions
- Update CRDs, bundle manifests, and test expectations
- Maintain backward compatibility in ConfigMap generation

* add updated API

* Simplify CM value assignment, revert image name to v3.0.0

* test: revert "temp. disable e2e test on dashboard" from PR2507 and fix VAP/VAPB e2e tests (#2561)

* Revert "update: temp. disable e2e test on dashboard (#2499)"

This reverts commit 9e847d546cb75c2bb14bd32ef54df0f1fc1c5894.

* fix: VAP/VAPB testcase

- move from DSCI to Dashbaord

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* feat: add oauth proxy parametrization to dashboard (#2554)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>

* RHOAIENG-34055: add ray sanity check for codeflare removal (#2514)

* feat(RHOAIENG-34055): add ray sanity check for codeflare removal

* feat(RHOAIENG-34055): create a common sanitycheck action

* feat: add GHA to auto build and push e2e tests image after each merge to main (#2543)

* checkin (#2555)

* remove useless kueue-batch-user-rolebinding (#2571)

Removed the config/kueue-configs/ directory and all associated references
as the kueue-batch-user-rolebinding functionality is no longer needed.

* OCP console link to odh should match gateway address (#2569)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>
Co-authored-by: James Tanner <[email protected]>

* fix: add llmisvc kueue test case (#2573)

* feat: add support for Connection API in LLMInferenceservice

- add create/update/removal case
- create $llmisvc-sa serviceaccount and inject into .spec.template.serviceAccountName
- serect should support both cases: data.URI and data.http-hosts
  according to kserve for both isvc and llmisvc
- simplify logging by remove duplicated error and caller print error
- only remove connectoin added imagepullsecret from list if remove
  annotation
- add support for only change conneciton-path value and .spec.mode.uri
  should updated too
- when update, if manually set path in spec, use it;  if annotation
  is set, use annotation; if neither set, and previously already has
  path (maybe was created by gitops or dashboard) keep it

Signed-off-by: Wen Zhou <[email protected]>

* update: code review

- fix typo
- update handleSA() to deal with servicename more precisely
- cleanup func and const not in use any more

Signed-off-by: Wen Zhou <[email protected]>

* update: code review

- remove duplicated function for serviceaccount
- add check on both annotations

Signed-off-by: Wen Zhou <[email protected]>

* update: code review

- serviceAccoutName should only be removed if it is not having the same
  value derived from serect's name
- serviceAccountName should only be injected if it did not have a value
  already

Signed-off-by: Wen Zhou <[email protected]>

* RHOAIENG-31870: Add DSC and DSCI v2 versions. (#2505)

RHOAIENG-35095: Webhook dsc, dsci and kueue integration_test.go unit-tests disabled
Added hack/buildLocal.sh : a script to build locally and deploy to crc

* cleanup unused referencegrant gvk code (#2576)

* Add clusterrole for allowedgroups (#2494)

* add clusterrole for allowedgroups

fix unit test

remove unused role for allowed group

* fix comment and error message

* Bug fix for gateway ODIC mode (#2591)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>
Co-authored-by: James Tanner <[email protected]>

* Removed monitoring namespace from flags, and documentation to set via env vars. Viper now sets the defaults vars for monitoring namespace instead. (#2538)

Signed-off-by: Max Whittingham <[email protected]>

* Update golangci-lint to v2.5.0 (#2582)

* chore: fix api version (#2594)

Signed-off-by: Wen Zhou <[email protected]>

* fix: retrieve all files in a PR (#2595)

* feat(modelregistry): add benchmark data image environment variable mapping (#2498)

Add IMAGES_BENCHMARK_DATA environment variable mapping to support
benchmark data injection in model registry deployments. This enables
the model-registry-operator to configure benchmark data images via
init containers for benchmark visualization and analysis features.

Related to: https://github.com/opendatahub-io/model-registry-operator/pull/324

Follows the same pattern established in previous PRs:
- https://github.com/opendatahub-io/opendatahub-operator/pull/2429
- https://github.com/opendatahub-io/opendatahub-operator/pull/2452

🤖 Generated with [Claude Code](https://claude.ai/code)

Signed-off-by: Chris Hambridge <[email protected]>
Co-authored-by: Claude <[email protected]>
Co-authored-by: Wen Zhou <[email protected]>

* RHOAIENG-33891: DSC v1 should not be created or updated with Kueue component state set to Managed (#2602)

* Add kube-linter to checks Kubernetes manifests against various best practices, with a focus on production readiness and security (#2605)

See:
- https://docs.kubelinter.io/
- https://github.com/stackrox/kube-linter

* Add webhook to prevent creation of dashbaord's HardwareProfile and AcceleratorProfile (#2599)

* chore: use generic Getter[T] instead of StringGetter (#2608)

Update StringGetter to use Go generics, allowing the type to work with
any return type instead of being limited to strings.

* chore: Update GitHub workflows and issue labels and fix bug (#2607)

- Add v3 label to issue_label_bot.yaml
- Rename workflow yaml: test for e2e, release for ODH release,
- Remove 'incubation' branch from linter.yaml triggers
- Update email
- Update API docs config to exclude all List types

Signed-off-by: Wen Zhou <[email protected]>

* RHOAIENG-34045: cleanup codeflare manifests, remove codeflare from DSC v2 (#2596)

* feat(RHOAIENG-34045): cleanup codeflare manifests, remove codeflare from DSC v2

* feat(RHOAIENG-34045): remove CodeFlare from api docs and remove finalizers update permissions

* update: add new members (#2612)

Signed-off-by: Wen Zhou <[email protected]>

* feat(e2e): improve test framework resilience and fix monitoring controller cleanup (#2550)

* feat(e2e): add explicit resource lifecycle methods and improve test framework

- Add EventuallyResourceCreated, EventuallyResourceUpdated, EventuallyResourcePatched
- Add Patch function in testf for explicit patch-only behavior
- Replace CreateOrUpdate with CreateOrPatch for better concurrency
- Fix resourceVersion conflicts and test timing issues
- Add comprehensive testf test coverage and helper functions
- Modernize kueue_test.go with jq.Match validation
- Reduce test code duplication and improve error messages

* refactor(monitoring): remove redundant management state checks and unify actions

- Remove isMonitoringManaged() function and all management state checks from actions
- Remove MonitoringNotManagedMessage constant (no longer needed)
- Combine MonitoringStack and ThanosQuerier into single deployMonitoringStackWithQuerier action
- Combine Tempo and Instrumentation into single deployTracingStack action
- Extract validateRequiredCRDs helper to reduce CRD validation duplication
- Add CRDRequirement struct for consistent condition handling
- Rename setCRDNotFoundCondition to setConditionFalse for clarity

The monitoring actions now focus purely on configuration validation and CRD availability checks. The optimization also reduces the monitoring controller from 5 deployment actions to 3
and eliminates condition duplication.

Breaking changes:
- Remove deployThanosQuerier, deployTempo, deployInstrumentation individual actions
- Add deployMonitoringStackWithQuerier and deployTracingStack unified actions

* test: temporarily disable RBAC and ServiceAccount deletion recovery tests

These tests are experiencing timing issues with external dependencies
and token refresh cycles. Temporarily disabling to unblock CI while
we investigate proper solutions for testing these resource types.

TODO: Re-enable after investigating timing issues.

* chore: remove placeholder CRD and skip kubebuilder for dashboard actions (#2609)

Remove empty CRD template file and add kubebuilder:skipmake directive to
dashboard_controller_actions.go to prevent unintended CRD generation.

* Use commit sha in manifest file (#2540)

* feat(RHOAIENG-34447): add support to commit sha component target and add workflow to update it

* feat: upgrade commit sha

* chore(kube-lint): add CEL-based linter to prevent system group bindings in ClusterRoleBinding (#2617)

Add custom kube-linter check to detect ClusterRoleBindings that target
system groups (e.g., system:authenticated, system:unauthenticated).

* feat: update IMAGES_BENCHMARK_DATA to new image (#2622)

Signed-off-by: Alessio Pragliola <[email protected]>

* add e2e tests for reconciliation resilience (#2364)

* Remove ModelMesh component and infrastructure from OpenDataHub Operator (#2565)

* feat(RHOAIENG-34026): remove modelmeshserving manifest download and component registration

- Remove modelmeshserving from get_all_manifests.sh
- Remove modelmeshserving import from cmd/main.go

# Conflicts:
#	get_all_manifests.sh

* feat(RHOAIENG-34026): comprehensive ModelMesh removal for RHOAI 3.0

This commit removes ModelMesh functionality while preserving v1 API compatibility:

Core Functionality Removal:
- Remove ModelMeshServing component controller and implementation
- Remove ModelMeshServing monitoring configurations
- Update ModelController to ignore ModelMeshServing in DSC v1 (always treats as Removed)
- Update ModelController tests to reflect new RHOAI 3.0 behavior

Cleanup and Infrastructure:
- Remove ModelMeshServing RBAC editor/viewer roles
- Remove ModelMeshServing E2E tests
- Remove ModelMesh Prometheus unit tests
- Remove ModelMeshServing from sample DSC configuration
- Remove ModelMesh from component integration documentation
- Remove KServe ModelMesh CRD conflict handling logic
- Remove unused imports and monitoring embeds

API Strategy:
- Keep v1 DSC API intact with ModelMeshServing field for upgrade compatibility
- ModelController ignores ModelMeshServing settings (always Removed in RHOAI 3.0)
- Remove actual component functionality while maintaining API backward compatibility

Testing Updates:
- Update ModelController tests to reflect new behavior
- Tests now validate that ModelMeshServing no longer enables ModelController
- All tests pass with new RHOAI 3.0 logic

Build verification:
- All code compiles successfully
- Manifest generation works correctly
- ModelController unit tests pass with new behavior

# Conflicts:
#	internal/controller/components/modelcontroller/modelcontroller.go
#	internal/controller/components/modelmeshserving/modelmeshserving.go
#	internal/controller/components/modelmeshserving/modelmeshserving_test.go

* Remove ModelMeshServing component and API references

This commit removes ModelMeshServing component from the opendatahub-operator
codebase as part of RHOAI 3.0 migration to KServe-only model serving.

Changes include:

## API and Controller Changes:
- Keep ModelMeshServing API types for v1 DSC backward compatibility
- Set ModelMeshServing ManagementState to 'Removed' in ModelController
- Remove ModelMeshServing logic from modelcontroller_actions.go
- Add deprecation markers to ModelMeshServing types

## E2E Test Updates:
- Remove ModelMeshServing test cases from controller_test.go
- Update modelcontroller_test.go to focus on KServe functionality
- Remove ModelMeshServing from resilience_test.go and helper_test.go
- Add partial upgrade test for ModelMeshServing resource preservation

## Monitoring and Configuration Cleanup:
- Remove ModelMeshServing Prometheus scrape jobs and SLO rules
- Remove ModelMeshServing alerting rules from prometheus-configs.yaml
- Remove ModelMeshServing from monitoring controller mappings

## Sample and Documentation Updates:
- Remove modelmeshserving from DSC sample configurations
- Fix incorrect app.kubernetes.io/managed-by labels in samples (RHOAIENG-32728)
- Remove ModelMeshServing references from README.md and DESIGN.md
- Remove ModelMeshServing component from upgrade logic

## Bundle and Manifest Updates:
- Update CSV to remove ModelMeshServing CRD references
- Add ModelMeshServing CRD to bundle for compatibility

Addresses: RHOAIENG-34026, RHOAIENG-32728
Partial: RHOAIENG-34036 (upgrade testing)

# Conflicts:
#	docs/api-overview.md

* feat: Remove ModelMeshServing component from RHOAI 3.0

- Remove ModelMeshServing CRD generation from kustomization.yaml
- Delete ModelMeshServing CRD file (components.platform.opendatahub.io_modelmeshservings.yaml)
- Clean up bundle manifests: remove ModelMeshServing references from CSV and CRDs
- Update API documentation config to exclude ModelMeshServing types
- Remove ModelMeshServing from v2 DataScienceCluster sample
- Clean up catalog.yaml: remove ModelMeshServing GVK, CRD description, and keywords
- Preserve v1 DataScienceCluster compatibility for smooth upgrades
- Maintain RBAC permissions and e2e tests for upgrade safety
- Update TrustyAI controller to remove obsolete model-mesh label check
- Simplify ModelController types and remove ModelMeshServing dependencies
- Update conversion logic between v1 and v2 DataScienceCluster APIs
- Clean up documentation and design files

* Update the DSPO commit to leverage kube-rbac-proxy (#2636)

See:
https://github.com/opendatahub-io/data-science-pipelines-operator/pull/920
https://issues.redhat.com/browse/RHOAIENG-34577

Signed-off-by: mprahl <[email protected]>

* chore: update manifest commit SHAs (#2637)

Co-authored-by: zdtsw <[email protected]>

* fix: resolve RBAC errors and improve resilience test reliability (#2643)

- Fix pod deletion RBAC issue by using individual DeleteResource instead of bulk deletion
- Restore ModelRegistry component to quota tests after removing stuck finalizer
- Optimize jq expressions to handle deployment name patterns and readyReplicas checks
- Add pod restart after RBAC restoration for faster test recovery

Resolves "server does not allow this method" error and makes tests more reliable.

* RHOAIENG-29717 | fix: Adding missing test case (#2510)

Co-authored-by: den-rgb <[email protected]>

* Add a RELATED_IMAGE hook for kube-auth-proxy (#2632)

Co-authored-by: jctanner <[email protected]>

* removing wrong labels from samples in DSCI/DSC v2 (#2650)

* RHOAIENG-33158 | feat: enhance hardware profile management by adding custom-serving profile (#2578)

* RHOAIENG-33158 | feat: enhance hardware profile management by adding custom-serving profile

- Introduced a new HardwareProfile named 'custom-serving' in the YAML configuration.
- Updated the DSCInitializationReconciler to manage both default and custom hardware profiles.
- Modified the logic to check for the existence of the custom hardware profile during reconciliation.

* chore: Fixed calling renamed function by refering new name

* chore: reworked default and custom hardwareprofile CRs deployment

- split the yamls to seperate files to reduce confusion and easier identification
- reworked the logic to check and deploy after checking both custom and default hwp errors.

---------

Co-authored-by: Wen Zhou <[email protected]>

* feat: update workflow to point to correct branch (#2652)

* chore: update manifest commit SHAs (#2646)

Co-authored-by: openshift-merge-bot <[email protected]>

* RHOAIENG-30940: Remove devFlags support (#2588)

* feat(RHOAIENG-30940): remove devFlags support

* fix(RHOAIENG-34045): small typo

* fix: lint

* test: add e2e test on HWProfile v1alpha1 and v1 (#2635)

* test: add e2e test on HWProfile v1alpha1 and v1

- create CR on v1, test it can be read by v1 and v1alpha1
- create CR on v1alpha, test it can be read by v1 and v1alpha1
* update: prolong timeout
* update: move HWProfile test to a dedicated suite

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* RHOAIENG-33158 | feat: Implement migration of HardwareProfiles from AcceleratorProfiles and container sizes (#2529)

* RHOAIENG-33158 | feat: Implement migration of HardwareProfiles from AcceleratorProfiles and container sizes

- Added functions to migrate AcceleratorProfiles to HardwareProfiles, creating separate profiles for notebooks and serving.
- Implemented migration for container sizes, generating HardwareProfiles based on specified resource limits.
- Created a special HardwareProfile for InferenceServices without associated AcceleratorProfiles or container sizes.

* RHOAIENG-33158 | chore: addressed PR comments

- Removed custom-serving HWP out of upgrade.go
- now trying to load odhDashboardConfig from manifests if its not available in cluster
- other refactors and logs

* chore: Renamed funcs to better describe the functionality

- Adjusted comments to explain the code better.

* RHOAIENG-33158 | chore: addressed PR comments

- making sure hwp names are all lowercase and without spaces
- changed string literal references to string constants
- returning early if application namespace is empty
- migration will proceed only if old major version is 2 and current version is 3

* fix: correct GatewayConfig validation error handling for OIDC configuration (#2654)

* chore: update manifest commit SHAs (#2655)

Co-authored-by: openshift-merge-bot <[email protected]>

* chore: remove unused const (#2661)

- rename GatewayKind to GatewayConfigKind

Signed-off-by: Wen Zhou <[email protected]>

* chore: fix previous HWProfile test (#2659)

- revert timeout
- update docs

Signed-off-by: Wen Zhou <[email protected]>

* CLI to retry flaky test in job (#2611)

* feat(RHOAIENG-34877): add test cli to retry e2e tests during job execution to avoid retest

* fix: improve retry skip filter and add github actions to run tests on CLI

* (feat): Rename DatasciencePipelines to AIPipelines (#2589)

* (feat): Rename DatasciencePipelines to AIPipelines

Signed-off-by: Ajay Jaganathan <[email protected]>

# Conflicts:
#	api/datasciencecluster/v1/datasciencecluster_conversion.go
#	api/datasciencecluster/v2/datasciencecluster_types.go
#	api/datasciencecluster/v2/zz_generated.deepcopy.go
#	bundle/manifests/datasciencecluster.opendatahub.io_datascienceclusters.yaml
#	bundle/manifests/opendatahub-operator.clusterserviceversion.yaml
#	config/crd/bases/datasciencecluster.opendatahub.io_datascienceclusters.yaml
#	docs/api-overview.md
#	pkg/upgrade/upgrade.go
#	tests/e2e/datasciencepipelines_test.go

* add logic for converting installed components

Signed-off-by: Ajay Jaganathan <[email protected]>

---------

Signed-off-by: Ajay Jaganathan <[email protected]>

* Fix session issue with kube-auth-proxy (#2625)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>

* update: remove DSCI and DSC v1 sample from CSV (#2667)

Signed-off-by: Wen Zhou <[email protected]>

* RHOAIENG-33892: Kueue controller should report an error in case a DSC has Managed state set in Kueue component. (#2645)

RHOAIENG-33894: Remove all the remaining logic/tests to handle Managed Kueue component state.

* Change api group for hardware profiles in admingroup role (#2631)

* Change api group for hardware profiles in admingroup role

* Apply suggestion from @zdtsw

* Update internal/controller/services/auth/resources/admingroup-role.tmpl.yaml

---------

Co-authored-by: Wen Zhou <[email protected]>

* Enable CLI access via OpenShift OAuth bearer tokens (#2666)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>
Co-authored-by: James Tanner <[email protected]>

* Fix multiple run of upgrade tests (#2665)

* test: fix multiple run of upgrade tests

* test: fix v2tov3upgrade tests and test cli

* test: add test on duplicated test names in cli

* feat: add gen-ai image to the dashboard (#2673)

* fixed bug where dup bearer token was added via cli access (#2684)

* Remove Serverless Mode, Service Mesh, and Authorino Infrastructure from OpenDataHub Operator (#2560)

* remove KServe Serverless mode, removed ServiceMesh, Serverless and Authorino infra

add v2tov3 upgrade test

remove ServiceMesh spec from DSCI v2 api

remove servicemesh manifests

remove servicemesh spec from dsci v1 sample, regenerate bundle

disable webhook test

retain only minimal servicemesh api, update docs generation ignore rules

fix NoMatchError checking

remove redundant istio/serverless predicates and status condition variables

removed OSSMv2 pre-condition check from kserve controller

remove serving spec and defaultDeploymentMode field from KServe spec, re-generate manifests and bundle

remove leftover codeflare mention in CSV manifest

* Clean up types, status and doc strings

Signed-off-by: Christopher Sams <[email protected]>

* re-enable webhook test, re-generate manifests and bundle

* add ServiceMesh to removedCRDToCreate list, fix linter issues

---------

Signed-off-by: Christopher Sams <[email protected]>
Co-authored-by: Christopher Sams <[email protected]>
Co-authored-by: Ugo Giordano <[email protected]>

* chore: update manifest commit SHAs (#2668)

Co-authored-by: openshift-merge-bot <[email protected]>

* set the default mode for kube-auth-proxy (#2677)

* fix: optimize failed test rerun (#2686)

* RHOAIENG-34261 | Enable structured YAML for metrics exporters and fix UX inconsistency between metrics and traces (#2487)

Co-authored-by: Dayakar Maruboena <[email protected]>

* feat(RHOAIENG-35049): add support to pass RAGAS KFP image for downstream (#2566)

Co-authored-by: Wen Zhou <[email protected]>

* fix: remove test on old dashboard AProfile and HWProfile (#2696)

Signed-off-by: Wen Zhou <[email protected]>

* RHOAIENG-33159 | Replace accelerator and container size annotations to hardwareprofile annotations (#2675)

* RHOAIENG-33159 | Replace accelerator and container size annotations to hardwareprofile annotations

* Update:

- remove custom-serving from config, dsci should only create default one
- dynamically create custom-serving HWProfile CR in upgrade.go
- move support functions into upgrade_utils.go
- remove duplicated calls in the loops
- simplify hwprofile name concat
- remove unused gvk
- change logging info.
- remove old test on custom-serving

Signed-off-by: Wen Zhou <[email protected]>

* fix: lint

Signed-off-by: Wen Zhou <[email protected]>

* update: code review change

- fix missing permisison on notebooks
- fix namespace lookup for all namespace
- fix missing annoataion for hwprofile namespace set

Signed-off-by: Wen Zhou <[email protected]>

* update: reduce unnecesary calls to get annoataions

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>
Co-authored-by: Wen Zhou <[email protected]>

* added kube auth proxy metrics address (#2693)

* fix: annoatation for custom-serving HWProfile

Signed-off-by: Wen Zhou <[email protected]>
(cherry picked from commit 719aa32ce6d0fe24b62aa72aa8e2888e02e07c13)

* update notebooks manifest hash for gateway support (#2705)

* fix: Correct RAGAS KDP (#2702)

Signed-off-by: Rui Vieira <[email protected]>

* Make cookie expiration and refresh interval configurable (#2706)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>

* chore: update docs + api comments (#2712)

Signed-off-by: Wen Zhou <[email protected]>

* test: add default kueue configuration test (#2704)

* fix: fixed quotes for HWP visibility annotation (#2717)

* added validation for clientid/issuerurl (#2708)

* Remove installedComponents field from DataScienceCluster v2 API (#2719)

- Remove InstalledComponents field from v2 DataScienceClusterStatus
- Keep InstalledComponents field in v1 for backward compatibility
- Update conversion logic to construct v1 InstalledComponents from v2 component ManagementState
- Add constructInstalledComponentsFromV2Status() function to build InstalledComponents map
- Add getV1ComponentName() helper for component name mapping between versions
- Update conversion tests with new construction logic
- Fix integration tests to properly test v2→v1 conversion behavior
- Remove InstalledComponents assignments from component handlers and templates

The InstalledComponents field in v2 was redundant since component status
already tracks management state. This change simplifies the v2 API while
maintaining full backward compatibility for v1 users through automatic
conversion during webhook processing.

RHOAIENG-36418

* test: fix e2e resilience test to work also for rhoai (#2714)

* chore: update manifest commit SHAs (#2694)

Co-authored-by: openshift-merge-bot <[email protected]>

* chore: update manifest commit SHAs (#2725)

Co-authored-by: openshift-merge-bot <[email protected]>

* fix: wrong image for kube-auth-proxy (#2716)

Signed-off-by: Wen Zhou <[email protected]>

* fix: improve test skip management to correctly manage unordered test run (#2721)

* chore: update manifest commit SHAs (#2726)

* chore: update manifest commit SHAs

* Update get_all_manifests.sh

Co-authored-by: Davide Bianchi <[email protected]>

---------

Co-authored-by: openshift-merge-bot <[email protected]>
Co-authored-by: Wen Zhou <[email protected]>
Co-authored-by: Davide Bianchi <[email protected]>

* update: remove unused oauth-proxy variable (#2727)

Signed-off-by: Wen Zhou <[email protected]>

* override image references from RHAIIS (#2734)

* chore: update manifest commit SHAs (#2735)

Co-authored-by: openshift-merge-bot <[email protected]>

* GatewayConfig does not reconcile when DSCInitialization is created/updated/deleted (#2737)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>

* Use variable for manifest folder in get_all_manifests.sh (#2729)

* fix: e2e test on main

* fix: move ray sha to fix e2e tests. Improve get_all_manifests.sh

* feat: remove rm of old manifests

* feat: use annotation on kube-auth deployment when secret is changed (#2740)

- calculate secret data to hash and set as annoatation to trigger
  deployment change => pod restart with new secret value
- cleanu rbac
- fix lint to use expectedODHDomain

Signed-off-by: Wen Zhou <[email protected]>

* Update: fix typos in code + update Oauthclient name (#2743)

* chore:  fix typos

Signed-off-by: Wen Zhou <[email protected]>

* update: change OAuthClient name to be platform agnostic

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* chore: update manifest commit SHAs (#2742)

Co-authored-by: openshift-merge-bot <[email protected]>

* fix: correct Unmanaged ManagementState description (#2746)

Update the Unmanaged ManagementState description to accurately reflect
the operator's behavior. The operator does not deploy or manage the
component's lifecycle when in Unmanaged state, but may still create
supporting configuration resources.

This fixes the incorrect description that stated the operator is
'actively managing the component', which contradicted the expected
semantics of an Unmanaged state.

Fixes: RHOAIENG-32465

* feat: remove custom mark OdhDasboardConfig as unmanaged internally code (#2672)

* chore: update manifest commit SHAs (#2752)

Co-authored-by: openshift-merge-bot <[email protected]>

* feat: add SNO-aware OpenTelemetry collector replica defaults (#2738)

Set CollectorReplicas default to 1 for single-node clusters and 2 for multi-node clusters

* cleaning up caikit-tgis-image (#2755)

* update: cleanup kueue controller image + remove support on Managed status (#2757)

* chore: cleanup kueue controller image
---------

Signed-off-by: Wen Zhou <[email protected]>

* refactor: keep gateway function in correct file (#2745)

- gateway_auth_actions.go should only host functions : createKubeAuthProxyInfrastructure createEnvoyFilter
- gateway_controller_actions.go only for funtions: createGatewayInfrastructure createDestinationRule syncGatewayConfigStatus
- all the rest functions should be in gateway_support.go
- gateway_util_test.go is for support testing function

Signed-off-by: Wen Zhou <[email protected]>

* chore: update manifest commit SHAs (#2761)

Co-authored-by: openshift-merge-bot <[email protected]>

* fix: remove appwrapper from kueue supported framework configuration (#2764)

* chore: update bundle manifests for SNO-aware collector replicas (#2765)

Update generated bundle manifests via make bundle

* fix: Remove DSCI requirement for service reconciliation (#2750)

* fix: Remove DSCI requirement for service reconciliation

Services can now reconcile without DSCInitialization present:
- Reconciler treats missing DSCI as acceptable (not an error)
- Hash function handles nil DSCI with sentinel value
- ApplicationNamespace helper implements on-demand DSCI fetch
- Auth and Monitoring services use helpers for safe DSCI access
- Templates handle nil DSCI gracefully

This allows services to start before platform initialization and
prevents them from getting stuck in failed state when DSCI doesn't
exist yet.

RHOAIENG-18035

* refactor: remove DSCI from ReconciliationRequest

Make DSCI optional in the reconciliation framework by removing it from
ReconciliationRequest. Actions now fetch DSCI on-demand using helper
functions ApplicationNamespace() and MonitoringNamespace().

Changes:
- Remove DSCI field from ReconciliationRequest struct
- Update ApplicationNamespace/MonitoringNamespace helpers to fetch DSCI
- Update template rendering to pass AppNamespace to templates
- Update all actions and tests to work without rr.DSCI
- Convert monitoring tests to use Gomega assertions

This allows services to reconcile independently of DSCI existence.

* Remove cleanup FeatureTrackers action from KServe controller

* feat: remove secretegenerator controller (#2749)

* feat: remove secretegenerator controller

- we do not need that controller since OAuthClient will be created by
  Gateway service
- move secret releated function into pkg

Signed-off-by: Wen Zhou <[email protected]>

* update: rewrite unit-test

- remove over testing
- only match two functions in secret.go

Signed-off-by: Wen Zhou <[email protected]>

* update: unit-test to use gomega

Signed-off-by: Wen Zhou <[email protected]>

* update: code review comments with more checks

Signed-off-by: Wen Zhou <[email protected]>

* update: remove docs content not valid any more

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* chore: update manifest commit SHAs (#2768)

Co-authored-by: openshift-merge-bot <[email protected]>
Co-authored-by: Wen Zhou <[email protected]>

* update: to support olmv1 we need to replace operatorcondition with (#2766)

subscription

- with this, we lost the ability to check dependent operator version
- need a new predicate on subscritpion on .spec.name (metadata.name wont
  be useful)

Signed-off-by: Wen Zhou <[email protected]>

* update: enable seccomprofile to runtimedefault (#2770)

- we are on ocp 4.19 which support this function

Signed-off-by: Wen Zhou <[email protected]>

* fix: on managed cluster, we cannot set kueue to Managed, API wont work (#2774)

Signed-off-by: Ugo Giordano <[email protected]>

* test: add parallel components and services test execution (#2771)

* chore: update manifest commit SHAs (#2773)

Co-authored-by: openshift-merge-bot <[email protected]>

* Revert "update: to support olmv1 we need to replace operatorcondition with (#…" (#2779)

This reverts commit 6fb1439c98c4a29424bedfdeb8f145603ef14657.

* chore: uplift version from 3.0 to 3.2 (#2780)

- update comments, some feature we need to keep till first stable v3
  release

Signed-off-by: Wen Zhou <[email protected]>

* chore: update manifest commit SHAs (#2781)

Co-authored-by: openshift-merge-bot <[email protected]>

* docs: add e2e testing tips and FAQ section to README (#2682)

* docs: add e2e testing tips and FAQ section to README

Add comprehensive e2e testing guidance including setup
instructions, common workflows, and troubleshooting examples.
Includes minimum/recommended configurations, selective test
execution patterns, and component-specific testing commands.

* Clarify that env vars are space-separated

* chore: update manifest commit SHAs (#2787)

Co-authored-by: openshift-merge-bot <[email protected]>

* chore: cleanup comments and remove unused status for ArgoCD (#2788)

Signed-off-by: Wen Zhou <[email protected]>

* set defaults based on metric log data from perf test (#2791)

* chore: update manifest commit SHAs (#2794)

Co-authored-by: openshift-merge-bot <[email protected]>

* Watch ingress certificate secrets to automatically sync gateway cert on rotation (#2772)

* RHOAIENG-34484 | build: Remove generated files from build (#2329)

* build: Remove generated files from build

Instead generate them as needed. In order to allow the bundle to be
built from the existing `bundle.Dockerfile` mechanism, I introduced some
some logic to generate it as a multi-stage dockerfile, where the first
stage runs `make bundle`.

Testing
-------

1. Build the bundle from main (`make bundle-build`), make note of
   the image hash
2. Build the bundle from this branch (`make bundle-build`), make note of
   the image hash
3. Mount both images (`podman image mount $hash1 ; podman image mount
   $hash2`)
4. compare the directories. I use `meld` for this.

Note the only difference is in timestamp.

* Set controller image in manager.yaml to REPLACE_IMAGE

* Also remove config/crd/external from source

* Remove generated webhook manifest

* Fix typo in .gitignore

* Pass version information as build args

* Handle empty env vars for build args

* Remove generated gatewayconfigs and servicemeshes crds

* fix(build): Address PR feedback

Add documentation comments explaining:
- OPERATOR_VERSION arg avoids clash with VERSION var
- tests directory needed for package references
- .dockerignore retains tests for bundle build

Revert some changes to the manager image overriding, instead reworking
the variable interpretation.

Use SED_COMMAND

* (feat): Upgrade automation (#2782)

- Update release automation to support manifest shas
- Remove duplication by using common utils

* chore: update manifest commit SHAs (#2802)

Co-authored-by: openshift-merge-bot <[email protected]>

* fix: generate manifest before run unit test, fetch all needed external CRD (#2804)

* chore: clean up more images which are not used for ai hub and ai pipeline (#2795)

* chore: clean up more images which are not used for ai hub and ai
pipeline

Signed-off-by: Wen Zhou <[email protected]>

* Update internal/controller/components/modelregistry/modelregistry_support.go

Co-authored-by: Jon Burdo <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>
Co-authored-by: Jon Burdo <[email protected]>

* chore: remove unnecessary deployment mode configuration from KServe component (#2777)

* Remove unnecessary DeployConfig struct from KServe component

Since RawDeployment is the only supported deployment mode and it's
hardcoded in the ConfigMap, we don't need the DeployConfig struct
and getDeployConfig function.

This cleanup is a follow-up to commit 8e20eeef which removed the
deployment mode field from the API.

Changes:
- Remove DeployConfig struct and getDeployConfig function from config.go
- Simplify updateInferenceCM to directly set hardcoded deployment mode
- Update tests to parse JSON directly instead of using removed struct

* Remove defaultDeploymentMode field from inference ConfigMap

Since RawDeployment is the only supported mode, we no longer need to
explicitly set the defaultDeploymentMode field in the inference ConfigMap.
KServe will use RawDeployment by default.

Changes:
- Remove DeployConfigName constant (no longer needed)
- Stop setting defaultDeploymentMode in updateInferenceCM
- Remove deploy config validation from tests
- Remove deploy config from test ConfigMap creation

* chore: optimize SNO detection using Infrastructure ControlPlaneTopology (#2784)

* chore: update manifest commit SHAs (#2810)

Co-authored-by: openshift-merge-bot <[email protected]>

* chore: update manifest commit SHAs (#2818)

Co-authored-by: openshift-merge-bot <[email protected]>

* feat: improve skip logic with parallel groups (#2819)

* chore: fix default HWP description (#2821)

* Skip oauth2_proxy cooike in upstream request (#2805)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>

* Make the subdomain configurable in gateway API (#2790)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>

* chore: update manifest commit SHAs (#2824)

Co-authored-by: openshift-merge-bot <[email protected]>

* feat: make more flex for the application namespace to be used for (#2814)

Gateway


(cherry picked from commit 01c7d6f6ecb0f49b3c906720982fe152e10a5ef3)

Signed-off-by: Wen Zhou <[email protected]>

* restrict httproutes to the application namespace (#2807)

* restrict httproutes to the application namespace

* fix unit tests

* refactor: use cluster.GetApplicationNamespace() instead of custom getPlatformNamespace()

* Call cluster.GetApplicationNamespace() directly instead of passing as parameter

* linter fixes

* adjust ext_authz timeout to 5s (#2823)

* adjusted timetout to 5s

* added env var and gateway config for auth timeout

* chore: update manifest commit SHAs (#2830)

Co-authored-by: openshift-merge-bot <[email protected]>

* feat: use bot to update manifest sha (#2838)

* docs: update pre-req for onboarding (#2835)

* docs: update pre-req for onboarding

Signed-off-by: Wen Zhou <[email protected]>

* update: update prom rules for new stack

* update: code review
- typo
- remove channel for cert-manager

* update: wording and links

* udpate: accurate info.

---------

Signed-off-by: Wen Zhou <[email protected]>

* feat: add env vars for configuring application and workbenches namespaces for e2e test suite (#2837)

extend TestContext with the new field for workbenches namespace

update CreateDSC() to support setting workbenches namespace via env var

update README, set defaults for app and wb namespaces

add variables for monitoring namespace

update e2e test instructions for custom app namespace

* chore(modelregistry): add PostgreSQL 16 image mapping (#2812)

* chore(modelregistry): add PostgreSQL 16 image mapping

Add IMAGES_POSTGRES to RELATED_IMAGE_RHEL9_POSTGRES16_IMAGE mapping
in model registry image replacement configuration to support PostgreSQL
16 container image deployment.

Related: https://github.com/opendatahub-io/model-registry-operator/pull/374

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Chris Hambridge <[email protected]>

* Update internal/controller/components/modelregistry/modelregistry_support.go

---------

Signed-off-by: Chris Hambridge <[email protected]>
Co-authored-by: Claude <[email protected]>

* chore: update manifest commit SHAs (#2843)

Co-authored-by: openshift-merge-bot <[email protected]>

* build: update image placeholder strategy (#2842)

Jira: [RHOAIENG-38505](https://issues.redhat.com/browse/RHOAIENG-38505)

Change the kustomize image reference from a hardcoded "controller"
name to a "REPLACE_IMAGE" placeholder.

The previous strategy didn't work properly for CI, because the
[substitution logic](https://github.com/openshift/ci-tools/blob/2401b722f08b32d2a6926e1d918ef010ef37992c/pkg/steps/bundle_source.go#L90)
modifies YAML files in the repo.

* docs: update REAMDE for workbenchnamespace (#2845)

Signed-off-by: Wen Zhou <[email protected]>

* update: add check on dashboard's acceleratorprofile (#2846)

- if CRD does not exist e.g from 3.0, no need own it -- no error into
  Operator

Signed-off-by: Wen Zhou <[email protected]>

* feat: add warning when create/update connectin API secret on S3 type but no AWS_S3_BUCKET or with "" as value (#2732)

* feat: add warning when create/update connectin API secret on S3 type
- if the secret does not have AWS_S3_BUCKET we show a warning but still
  allowed
- warning if AWS_S3_BUCKET exist but has "" as value
---------

Signed-off-by: Wen Zhou <[email protected]>

* chore: update manifest commit SHAs (#2850)

Co-authored-by: openshift-merge-bot <[email protected]>

* (fix): Adapt release automation to the new make file changes (#2851)

* feat: add e2e test case for verifying workbench namespace configuration (#2852)

* Add comprehensive E2E tests for Gateway with OpenShift OAuth authentication (#2753)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>

* RHOAIENG-25593 | feat: adding perses instance (#2642)

Co-authored-by: den-rgb <[email protected]>

* RHOAIENG-34502 | feat: Enable rhoai build from main branch (#2220)

* feat: Enable rhoai build from main branch

This required sigificant changes to the Makefile and a few different strategies:

- conditionally build different versions of some structs, where there is an irreconcilable difference between `main` and `rhoai` branches (using build tags)
- maintain a separate overlay of manifests and separate bundle, tracking `rhoai` specific changes where necessary.

Renamed directories:
- `bundle` -> `odh-bundle`
- `config` -> `odh-config`

New directories:
- `rhoai-bundle`: contains the RHOAI bundle
- `rhoai-config`: contains the RHOAI manifests

With these changes most Make targets now accept the `ODH_PLATFORM_TYPE` parameter, and operate in either an odh-mode by default, or a rhoai mode if overridden to any value other than `OpenDataHub`.

`get_all_manifests.sh` now has a different mode when passed `ODH_PLATFORM_TYPE` other than `OpenDataHub`, where it looks at $VERSION and infers the downstream git reference to use. (It is most easily invoked via `make get-manifests ODH_PLATFORM_TYPE=rhoai`).

This adds RHOAI-specific Dockerfiles for the operator and the bundle.

See the difference between the rhoai versions and odh versions by using a diff tool, such as `meld` or `diff -u`.

You can compare the resulting bundle for differences by checking out the rhoai branch, and comparing `bundle.rhoai` to `bundle` in the `rhoai` branch.

There are a number of small differences related to changes that haven't been made to the `rhoai` branch.

* Update rhoai contents

* build: update CLEANFILES to use explicit paths

Replace CONFIG_DIR variable with explicit odh-config and
rhoai-config paths in CLEANFILES to ensure proper cleanup of
generated files for both configurations.

* refactor(api): consolidate build tag type definitions

Extract platform-specific type definitions from monolithic files
into separate build-tagged files. Move ODH and RHOAI variant type
definitions into dedicated .odh.go and .rhoai.go files while
keeping shared types in base files.

Affected components:
- ModelRegistry (components/v1alpha1)
- Workbenches (components/v1alpha1)
- DSCInitialization (dscinitialization/v1, v2)
- Monitoring (services/v1alpha1)

This refactoring improves code organization and maintainability
by separating platform-specific implementations using Go build
tags rather than duplicating entire files.

* docs: fix incorrect reference in MonitoringCommonSpec description

Update MonitoringCommonSpec documentation to correctly reference
"Monitoring" instead of "Dashboard" as the shared desired state.
Remove trailing blank lines at end of file.

* fix: Remove stale DEFAULT_REF reference

* build(makefile): update CLEANFILES to use specific bundle directories

Replace generic BUNDLE_DIR variable with explicit bundle directory
names (rhoai-bundle and odh-bundle) in CLEANFILES to improve clarity
and ensure both bundle variants are properly cleaned.

* chore(api): update copyright year to 2025

* docs: Restore extraneous lines in api docs

Locally, I use a pre-commit hook to clean up ends-of-files, but this is
incompatible with our CI check.

* Fix VERSION for odh/rhoai and make them independent

* build: update image placeholder strategy

Jira: [RHOAIENG-38505](https://issues.redhat.com/browse/RHOAIENG-38505)

Change the kustomize image reference from a hardcoded "controller"
name to a "REPLACE_IMAGE" placeholder.

The previous strategy didn't work properly for CI, because the
[substitution logic](https://github.com/openshift/ci-tools/blob/2401b722f08b32d2a6926e1d918ef010ef37992c/pkg/steps/bundle_source.go#L90)
modifies YAML files in the repo.

* Update placeholder for rhoai manager image

* upgrade Go to 1.24 and update rhoai Dockerfile

- Upgrade GOLANG_VERSION from 1.23 to 1.24
- Set ODH_PLATFORM_TYPE=rhoai in manifest script
- Replace kueue-configs with hardwareprofiles config
- Install tar for component dev workflow compatibility
- Remove redundant chown in favor of COPY --chown

* build(makefile): use platform-specific Dockerfile for image builds

Add DOCKERFILE_FILENAME variable to dynamically select between
Dockerfile (ODH) and rhoai.Dockerfile (RHOAI) based on
ODH_PLATFORM_TYPE. This ensures image-build target uses the correct
Dockerfile for each platform variant.

* docs(readme,testing): document RHOAI build mode and update test guidelines

Update README with instructions for building operator and bundles
in RHOAI mode using ODH_PLATFORM_TYPE=rhoai flag. Remove bundle
manifests from mandatory integration testing requirements and fix
minor formatting inconsistencies.

* Move connectionApi

* push the e2e test image with the tag main instead of adding the commit hash (#2856)

* update: move Prom Rule tests into component folder (#2858)

- test on the PrometheusRule CR not the old Prom config files
- remove triage which is only for SRE
- rename rule files to match component name
- remove deadmansnitch which is only for SRE

Signed-off-by: Wen Zhou <[email protected]>

* add some enhancements to the e2e push gha to have a version for odh (main branch) and rhoai X.Y (rhoai-x.y branch) (#2860)

* RHOAIENG-25595 | Add Perses-Tempo datasource integration (#2553)

Co-authored-by: Dayakar Maruboena <[email protected]>

* Remove load restrictions none for kustomize builds (#2863)

* feat: remove LoadRestrictionsNone from kustomize

* feat: move rhoai-config into odh-config/rhoai overlay

* rebase with master

* fix: generate rhoai manifest and webhook name

* Update manifest sha in get_all_manifests.sh correctly (#2870)

* fix: update manifest sha in get_all_manifests.sh correctly

* fix: revert namespace name in makefile

* feat: update rhoai manifests to rhoai-3.2 tag

* docs: note bundle is still needed for rhoai branch (#2801)

* docs: note bundle is still needed for rhoai branch

* Fix typo

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* fix: dsc odh samples (#2869)

* update: cleanup unnecessary upgrade logic (#2695)

- remove functions and calls not needed for 3.2 (should be removed in
  3.0 already)
  jupyterhub resource cleanup
  rbac for modelreg
  envoyfilter patch for serving
  watson resource docs
  patch odhdashboardconfig for trusty enablement
- move and cleanup function for AP to HWP migration

Signed-off-by: Wen Zhou <[email protected]>

* chore: update manifest commit SHAs (#2875)

Co-authored-by: openshift-merge-bot <148852131+openshift-merg…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants