-
Notifications
You must be signed in to change notification settings - Fork 214
update: cleanup unnecessary upgrade logic + move functions #2695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update: cleanup unnecessary upgrade logic + move functions #2695
Conversation
WalkthroughMoved deployed-release lookup into pkg/cluster; added idempotent HardwareProfile creation in pkg/cluster; added resource-deletion and owner-reference utilities in pkg/resources; removed legacy cleanup/resource-spec logic from pkg/upgrade and introduced hardware-profile and feature constants; cmd/main now calls cluster.GetDeployedRelease. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant CLI as cmd/main
participant ClusterPkg as pkg/cluster
participant Upgrade as pkg/upgrade
participant K8s as Kubernetes API
CLI->>ClusterPkg: GetDeployedRelease(ctx)
alt DSCI exists
ClusterPkg->>K8s: GET DSCInitialization
K8s-->>ClusterPkg: DSCI with Status.Release
ClusterPkg-->>CLI: Release
else DSCI missing, DSC exists
ClusterPkg->>K8s: GET DataScienceCluster
K8s-->>ClusterPkg: DSC with Status.Release
ClusterPkg-->>CLI: Release
else none exist
ClusterPkg-->>CLI: empty Release / not-found
end
Upgrade->>ClusterPkg: CreateHardwareProfile(ctx,hwp)
ClusterPkg->>K8s: CREATE HardwareProfile
alt already exists
K8s-->>ClusterPkg: AlreadyExists
ClusterPkg-->>Upgrade: nil (idempotent)
else created
K8s-->>ClusterPkg: Created
ClusterPkg-->>Upgrade: created
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
pkg/cluster/resources.go (1)
160-167: Harden CreateHardwareProfile: add nil guard and retry/backoff (webhook/startup ready).Prevents nil deref on hwp and aligns creation with existing CreateWithRetry behavior.
-func CreateHardwareProfile(ctx context.Context, cli client.Client, hwp *infrav1.HardwareProfile) error { - if err := cli.Create(ctx, hwp); err != nil { - if !k8serr.IsAlreadyExists(err) { - return fmt.Errorf("failed to create HardwareProfile '%s/%s': %w", hwp.Namespace, hwp.Name, err) - } - } - return nil -} +func CreateHardwareProfile(ctx context.Context, cli client.Client, hwp *infrav1.HardwareProfile) error { + if hwp == nil { + return errors.New("nil HardwareProfile") + } + if err := CreateWithRetry(ctx, cli, hwp); err != nil { + return fmt.Errorf("failed to create HardwareProfile '%s/%s': %w", hwp.Namespace, hwp.Name, err) + } + return nil +}cmd/main.go (1)
261-263: Don’t drop GetDeployedRelease error; log and continue with zero value.Improves debuggability without changing behavior.
- // get old release version before we create default DSCI CR - oldReleaseVersion, _ := cluster.GetDeployedRelease(ctx, setupClient) + // get old release version before we create default DSCI CR + oldReleaseVersion, err := cluster.GetDeployedRelease(ctx, setupClient) + if err != nil { + setupLog.Info("unable to get deployed release; defaulting to empty", "err", err) + }pkg/upgrade/upgrade.go (2)
766-775: Preserve standardized HWP annotations; avoid AP overwriting them.Currently AP annotations override “feature-visibility”, “modified-date”, etc. Prefer copying only non-conflicting keys.
- // Copy existing annotations from AP - if apAnnotations := ap.GetAnnotations(); apAnnotations != nil { - for k, v := range apAnnotations { - annotations[k] = v - } - } + // Copy existing annotations from AP without overriding standardized keys + if apAnnotations := ap.GetAnnotations(); apAnnotations != nil { + for k, v := range apAnnotations { + if _, exists := annotations[k]; !exists { + annotations[k] = v + } + } + }
941-961: Use UTC for modified-date annotation.Ensures consistent, timezone-agnostic timestamps.
- "opendatahub.io/modified-date": time.Now().Format(time.RFC3339), + "opendatahub.io/modified-date": time.Now().UTC().Format(time.RFC3339),pkg/resources/resources.go (1)
681-717: Make DeleteOneResource resilient: skip missing paths, ignore NotFound, aggregate errors.Avoid aborting the whole batch when a single item lacks the field or is concurrently deleted.
-func DeleteOneResource(ctx context.Context, c client.Client, res ResourceSpec) error { - log := logf.FromContext(ctx) - list := &unstructured.UnstructuredList{} - list.SetGroupVersionKind(res.Gvk) - - err := c.List(ctx, list, client.InNamespace(res.Namespace)) - if err != nil { - if meta.IsNoMatchError(err) { - log.Info("CRD not found, will not delete", "gvk", res.Gvk.String()) - return nil - } - return fmt.Errorf("failed to list %s: %w", res.Gvk.Kind, err) - } - - for _, item := range list.Items { - v, ok, err := unstructured.NestedString(item.Object, res.Path...) - if err != nil { - return fmt.Errorf("failed to get field %v for %s %s/%s: %w", res.Path, res.Gvk.Kind, res.Namespace, item.GetName(), err) - } - - if !ok { - return fmt.Errorf("unexisting path to delete: %v", res.Path) - } - - for _, toDelete := range res.Values { - if v == toDelete { - err = c.Delete(ctx, &item) - if err != nil { - return fmt.Errorf("failed to delete %s %s/%s: %w", res.Gvk.Kind, res.Namespace, item.GetName(), err) - } - log.Info("Deleted object", "name", item.GetName(), "gvk", res.Gvk.String(), "namespace", res.Namespace) - } - } - } - - return nil -} +func DeleteOneResource(ctx context.Context, c client.Client, res ResourceSpec) error { + log := logf.FromContext(ctx) + list := &unstructured.UnstructuredList{} + list.SetGroupVersionKind(res.Gvk) + + if err := c.List(ctx, list, client.InNamespace(res.Namespace)); err != nil { + if meta.IsNoMatchError(err) { + log.Info("CRD not found, will not delete", "gvk", res.Gvk.String()) + return nil + } + return fmt.Errorf("failed to list %s: %w", res.Gvk.Kind, err) + } + + var merr *multierror.Error + for i := range list.Items { + item := list.Items[i] + v, ok, err := unstructured.NestedString(item.Object, res.Path...) + if err != nil { + merr = multierror.Append(merr, fmt.Errorf("failed to get field %v for %s %s/%s: %w", res.Path, res.Gvk.Kind, res.Namespace, item.GetName(), err)) + continue + } + if !ok { + // field absent on this item; nothing to match + continue + } + for _, toDelete := range res.Values { + if v == toDelete { + if err := c.Delete(ctx, &item); err != nil { + if k8serr.IsNotFound(err) { + break + } + merr = multierror.Append(merr, fmt.Errorf("failed to delete %s %s/%s: %w", res.Gvk.Kind, res.Namespace, item.GetName(), err)) + break + } + log.Info("Deleted object", "name", item.GetName(), "gvk", res.Gvk.String(), "namespace", res.Namespace) + break + } + } + } + return merr.ErrorOrNil() +}Based on learnings
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
cmd/main.go(1 hunks)pkg/cluster/cluster_config.go(1 hunks)pkg/cluster/resources.go(1 hunks)pkg/resources/resources.go(3 hunks)pkg/upgrade/upgrade.go(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
pkg/cluster/resources.go (2)
tests/e2e/helper_test.go (1)
CreateHardwareProfile(339-384)pkg/cluster/gvk/gvk.go (2)
HardwareProfile(88-92)Namespace(28-32)
cmd/main.go (1)
pkg/cluster/cluster_config.go (1)
GetDeployedRelease(107-131)
pkg/upgrade/upgrade.go (1)
pkg/cluster/resources.go (1)
CreateHardwareProfile(160-167)
pkg/cluster/cluster_config.go (2)
api/common/types.go (1)
Release(159-162)pkg/cluster/resources.go (2)
GetDSCI(115-135)GetDSC(92-112)
pkg/resources/resources.go (2)
pkg/cluster/gvk/gvk.go (1)
Namespace(28-32)pkg/resources/resources_types.go (1)
UnstructuredList(11-11)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Build/push catalog image
- GitHub Check: golangci-lint
- GitHub Check: Run tests and collect coverage on internal and pkg
🔇 Additional comments (5)
pkg/cluster/cluster_config.go (1)
93-131: GetDeployedRelease flow looks correct and robust.NotFound handling and fallback to DSC are appropriate; empty Release on clean installs is clear.
pkg/upgrade/upgrade.go (1)
414-416: Good reuse of cluster.CreateHardwareProfile.Centralizing creation improves consistency; pairs well if CreateHardwareProfile uses retry/backoff.
Ensure cluster.CreateHardwareProfile includes retry/backoff (see suggested change in pkg/cluster/resources.go).
Also applies to: 477-480
pkg/resources/resources.go (3)
33-42: ResourceSpec API looks good.Clear contract for GVK, namespace, path, values.
659-669: Batch deletion with multierror is appropriate.Append + ErrorOrNil used correctly. Based on learnings.
731-739: UnsetOwnerReferences behavior is fine.Simple orphaning via Update; no changes needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/resources/resources.go (1)
706-714: Add break after successful deletion.After deleting an item when a FilterValue matches, the loop continues checking remaining FilterValues. While functionally harmless (the deleted item won't match again), this is inefficient. Add a
breakstatement after the delete succeeds to exit the inner loop early.Apply this diff:
for _, targetValue := range res.FilterValues { if v == targetValue { err = c.Delete(ctx, &item) if err != nil { return fmt.Errorf("failed to delete %s %s/%s: %w", res.Gvk.Kind, res.Namespace, item.GetName(), err) } log.Info("Deleted object", "name", item.GetName(), "gvk", res.Gvk.String(), "namespace", res.Namespace) + break } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
cmd/main.go(1 hunks)pkg/cluster/cluster_config.go(1 hunks)pkg/cluster/resources.go(1 hunks)pkg/resources/resources.go(3 hunks)pkg/upgrade/upgrade.go(8 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- cmd/main.go
- pkg/cluster/cluster_config.go
🧰 Additional context used
🧬 Code graph analysis (2)
pkg/cluster/resources.go (2)
tests/e2e/helper_test.go (1)
CreateHardwareProfile(339-384)pkg/cluster/gvk/gvk.go (1)
HardwareProfile(88-92)
pkg/upgrade/upgrade.go (1)
pkg/cluster/resources.go (1)
CreateHardwareProfile(160-167)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Build/push catalog image
- GitHub Check: Run tests and collect coverage on internal and pkg
- GitHub Check: golangci-lint
🔇 Additional comments (7)
pkg/cluster/resources.go (1)
150-167: LGTM: Idempotent HardwareProfile creation.The function correctly implements idempotent resource creation by treating "AlreadyExists" errors as success. Error messages provide clear context including namespace and name.
pkg/upgrade/upgrade.go (3)
51-64: LGTM: Well-structured ContainerSize type.The nested struct definition clearly maps to the OdhDashboardConfig structure, making it straightforward to parse container size specifications.
414-415: LGTM: Centralized HardwareProfile creation.The refactoring to use
cluster.CreateHardwareProfilecentralizes creation logic and provides consistent error handling. The error messages include helpful context (profile type, namespace, container size name).Also applies to: 477-479
942-961: LGTM: Centralized annotation generation.The helper function ensures consistent annotation structure across all HardwareProfile creation paths, reducing duplication and potential inconsistencies.
pkg/resources/resources.go (3)
33-43: LGTM: Well-documented ResourceSpec type.The struct is clearly defined with helpful comments explaining the purpose of each field, particularly the FieldPath and FilterValues fields.
650-669: LGTM: Proper error aggregation.The function correctly uses
multierrorto collect and return all errors encountered during batch deletion, following the pattern established in the learnings.
720-740: LGTM: Clear orphaning utility.The function correctly removes owner references and updates the resource, with appropriate error context.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2695 +/- ##
==========================================
+ Coverage 49.65% 49.74% +0.09%
==========================================
Files 144 144
Lines 10562 10388 -174
==========================================
- Hits 5245 5168 -77
+ Misses 4758 4666 -92
+ Partials 559 554 -5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/upgrade/upgrade_utils.go (1)
110-136: Do not treat missing manifest as an error.When the OdhDashboardConfig manifest isn’t present, return (nil, false, nil) so callers can gracefully skip. Currently, the os.Stat error propagates and aborts migration.
func loadOdhDashboardConfigFromManifests(ctx context.Context) (*unstructured.Unstructured, bool, error) { @@ - _, err := os.Stat(manifestPath) - if err == nil { + if _, err := os.Stat(manifestPath); err == nil { // ... read & parse manifest ... // Verify it's an OdhDashboardConfig if obj.GetKind() == "OdhDashboardConfig" { log.Info("Successfully loaded OdhDashboardConfig from manifest", "path", manifestPath) return &obj, true, nil } } - return nil, false, err + // Manifest not found (or not a matching kind) is not an error: signal “not found”. + return nil, false, nil }
♻️ Duplicate comments (2)
pkg/resources/resources.go (2)
683-718: Make deletion robust: skip missing field, ignore 404s, conditional namespace; fix typo.
- Don’t fail the whole batch when FieldPath is absent on an item; skip it.
- Treat NotFound on delete as benign (idempotent cleanup).
- Apply InNamespace only when non-empty to avoid issues with cluster-scoped kinds.
- Fix the “unexisting” typo in the error message.
Apply this diff:
@@ - err := c.List(ctx, list, client.InNamespace(res.Namespace)) + var opts []client.ListOption + if res.Namespace != "" { + opts = append(opts, client.InNamespace(res.Namespace)) + } + err := c.List(ctx, list, opts...) @@ - if !ok { - return fmt.Errorf("unexisting path to delete: %v", res.FieldPath) - } + if !ok { + log.V(1).Info("Field path not found; skipping item", + "fieldPath", res.FieldPath, "name", item.GetName(), + "gvk", res.Gvk.String(), "namespace", res.Namespace) + continue + } @@ - err = c.Delete(ctx, &item) - if err != nil { - return fmt.Errorf("failed to delete %s %s/%s: %w", res.Gvk.Kind, res.Namespace, item.GetName(), err) - } + if err := c.Delete(ctx, &item); err != nil { + if k8serr.IsNotFound(err) { + log.V(1).Info("Object already deleted", "name", item.GetName(), "gvk", res.Gvk.String(), "namespace", res.Namespace) + continue + } + return fmt.Errorf("failed to delete %s %s/%s: %w", res.Gvk.Kind, res.Namespace, item.GetName(), err) + } log.Info("Deleted object", "name", item.GetName(), "gvk", res.Gvk.String(), "namespace", res.Namespace)
702-704: Fix wording: “nonexistent field path”.Update the error text for correctness.
- return fmt.Errorf("unexisting path to delete: %v", res.FieldPath) + return fmt.Errorf("nonexistent field path: %v", res.FieldPath)
🧹 Nitpick comments (3)
pkg/resources/resources.go (1)
35-43: Nit: prefer conventional initialism “GVK”.Rename field Gvk → GVK for consistency with Go initialism conventions. Low priority.
pkg/upgrade/upgrade_utils.go (2)
159-185: Minor: simplify max tracking; avoid self-shadowing.Use existing variables directly to improve readability.
- if LimitMemory := LimitMem; !LimitMemory.IsZero() && LimitMemory.Cmp(maxMemory) > 0 { - maxMemory = LimitMemory - } - if LimitCpu := LimitCpu; !LimitCpu.IsZero() && LimitCpu.Cmp(maxCpu) > 0 { - maxCpu = LimitCpu - } + if !LimitMem.IsZero() && LimitMem.Cmp(maxMemory) > 0 { + maxMemory = LimitMem + } + if !LimitCpu.IsZero() && LimitCpu.Cmp(maxCpu) > 0 { + maxCpu = LimitCpu + }
215-239: Optional: reorder returns for clarity (or return a struct).Returning (ReqMem, ReqCpu, LimitMem, LimitCpu) is non-intuitive. Consider returning in CPU-first order and updating the call site, or define a small struct.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cmd/main.go(1 hunks)pkg/cluster/cluster_config.go(1 hunks)pkg/cluster/resources.go(1 hunks)pkg/resources/resources.go(3 hunks)pkg/upgrade/upgrade.go(7 hunks)pkg/upgrade/upgrade_utils.go(10 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/cluster/resources.go
- cmd/main.go
🧰 Additional context used
🧬 Code graph analysis (4)
pkg/cluster/cluster_config.go (2)
api/common/types.go (1)
Release(159-162)pkg/cluster/resources.go (2)
GetDSCI(115-135)GetDSC(92-112)
pkg/upgrade/upgrade_utils.go (1)
pkg/cluster/resources.go (1)
CreateHardwareProfile(160-167)
pkg/upgrade/upgrade.go (1)
pkg/cluster/resources.go (2)
GetHardwareProfile(138-148)CreateHardwareProfile(160-167)
pkg/resources/resources.go (2)
pkg/cluster/gvk/gvk.go (1)
Namespace(28-32)pkg/resources/resources_types.go (1)
UnstructuredList(11-11)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Build/push catalog image
- GitHub Check: Run tests and collect coverage on internal and pkg
- GitHub Check: golangci-lint
🔇 Additional comments (3)
pkg/cluster/cluster_config.go (1)
93-131: LGTM: clean, correct fallback order (DSCI → DSC) and safe not-found handling.pkg/upgrade/upgrade.go (2)
508-556: LGTM: idempotent creation of custom-serving HardwareProfile with standardized annotations.
327-349: LGTM: fetch OdhDashboardConfig once; ensure custom-serving HWP exists before ISVC migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/upgrade/upgrade_utils.go (1)
110-136: Treat missing manifest as “not found,” not an error.os.Stat failure on absent file currently bubbles up and aborts migration. Return (nil, false, nil) for not-exist; only propagate other errors.
manifestPath := deploy.DefaultManifestPath + odhDashboardConfigPath - _, err := os.Stat(manifestPath) - if err == nil { + _, err := os.Stat(manifestPath) + if err == nil { // ... load and return ... - } - return nil, false, err + } + if os.IsNotExist(err) { + return nil, false, nil + } + return nil, false, err
♻️ Duplicate comments (1)
pkg/resources/resources.go (1)
683-715: Make deletion robust: don’t fail on missing field, log with actual namespace, and break after delete.Current behavior aborts the entire sweep if a single item lacks the field, and may attempt multiple deletes if FilterValues has duplicates.
- for _, item := range list.Items { - v, ok, err := unstructured.NestedString(item.Object, res.FieldPath...) + for _, item := range list.Items { + v, ok, err := unstructured.NestedString(item.Object, res.FieldPath...) if err != nil { - return fmt.Errorf("failed to get field %v for %s %s/%s: %w", res.FieldPath, res.Gvk.Kind, res.Namespace, item.GetName(), err) + return fmt.Errorf("failed to get field %v for %s %s/%s: %w", + res.FieldPath, res.Gvk.Kind, item.GetNamespace(), item.GetName(), err) } - if !ok { - return fmt.Errorf("unexisting path to delete: %v", res.FieldPath) - } + if !ok { + log.Info("nonexistent field path; skipping item", + "fieldPath", res.FieldPath, "gvk", res.Gvk.String(), + "namespace", item.GetNamespace(), "name", item.GetName()) + continue + } for _, targetValue := range res.FilterValues { if v == targetValue { - err = c.Delete(ctx, &item) - if err != nil { - return fmt.Errorf("failed to delete %s %s/%s: %w", res.Gvk.Kind, res.Namespace, item.GetName(), err) - } - log.Info("Deleted object", "name", item.GetName(), "gvk", res.Gvk.String(), "namespace", res.Namespace) + if err := c.Delete(ctx, &item); err != nil { + return fmt.Errorf("failed to delete %s %s/%s: %w", + res.Gvk.Kind, item.GetNamespace(), item.GetName(), err) + } + log.Info("Deleted object", + "gvk", res.Gvk.String(), + "namespace", item.GetNamespace(), + "name", item.GetName()) + break } } }Also fixes the previously reported wording.
Based on past review comments
🧹 Nitpick comments (5)
pkg/cluster/resources.go (1)
150-167: Harden creation against webhook timing; reuse CreateWithRetry.Use the existing exponential backoff helper to tolerate transient 500s from webhooks and keep idempotency.
func CreateHardwareProfile(ctx context.Context, cli client.Client, hwp *infrav1.HardwareProfile) error { - if err := cli.Create(ctx, hwp); err != nil { - if !k8serr.IsAlreadyExists(err) { - return fmt.Errorf("failed to create HardwareProfile '%s/%s': %w", hwp.Namespace, hwp.Name, err) - } - } - return nil + if err := CreateWithRetry(ctx, cli, hwp); err != nil { + return fmt.Errorf("failed to create HardwareProfile '%s/%s': %w", hwp.Namespace, hwp.Name, err) + } + return nil }pkg/resources/resources.go (2)
33-43: ResourceSpec is clear; minor naming nit.Consider renaming field Gvk → GVK to follow Go initialism conventions. Not blocking.
660-669: Avoid shadowing stdliberrorsand skip appending nils.Rename the accumulator for readability and to avoid confusion with the imported
errorspackage.-func DeleteResources(ctx context.Context, c client.Client, resources []ResourceSpec) error { - var errors *multierror.Error +func DeleteResources(ctx context.Context, c client.Client, resources []ResourceSpec) error { + var agg *multierror.Error for _, res := range resources { - err := DeleteOneResource(ctx, c, res) - errors = multierror.Append(errors, err) + if err := DeleteOneResource(ctx, c, res); err != nil { + agg = multierror.Append(agg, err) + } } - return errors.ErrorOrNil() + return agg.ErrorOrNil() }pkg/upgrade/upgrade_utils.go (2)
671-674: Use thecustomServingconstant instead of a string literal.Prevents drift if the name changes once.
- if strings.HasPrefix(hwpName, containerSizeHWPPrefix) || (hwpName == "custom-serving") { + if strings.HasPrefix(hwpName, containerSizeHWPPrefix) || hwpName == customServing { annotations[hardwareProfileNamespaceAnnotation] = namespace }
384-388: Don’t allow AP annotations to override managed HWP annotations.Protect keys we set (visibility, modified-date, display-name, description, disabled) when copying from AP.
- // Copy existing annotations from AP - if apAnnotations := ap.GetAnnotations(); apAnnotations != nil { - for k, v := range apAnnotations { - annotations[k] = v - } - } + // Copy non-managed annotations from AP + if apAnnotations := ap.GetAnnotations(); apAnnotations != nil { + reserved := map[string]struct{}{ + hardwareProfileVisibilityAnnotation: {}, + hardwareProfileModifiedDateAnnotation: {}, + hardwareProfileDisplayNameAnnotation: {}, + hardwareProfileDescriptionAnnotation: {}, + hardwareProfileDisabledAnnotation: {}, + } + for k, v := range apAnnotations { + if _, keep := reserved[k]; keep { + continue + } + annotations[k] = v + } + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cmd/main.go(1 hunks)pkg/cluster/cluster_config.go(1 hunks)pkg/cluster/resources.go(1 hunks)pkg/resources/resources.go(3 hunks)pkg/upgrade/upgrade.go(7 hunks)pkg/upgrade/upgrade_utils.go(10 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cmd/main.go
🧰 Additional context used
🧬 Code graph analysis (5)
pkg/resources/resources.go (1)
pkg/cluster/gvk/gvk.go (1)
Namespace(28-32)
pkg/upgrade/upgrade_utils.go (1)
pkg/cluster/resources.go (1)
CreateHardwareProfile(160-167)
pkg/cluster/resources.go (2)
tests/e2e/helper_test.go (1)
CreateHardwareProfile(339-384)pkg/cluster/gvk/gvk.go (1)
HardwareProfile(88-92)
pkg/cluster/cluster_config.go (2)
api/common/types.go (1)
Release(159-162)pkg/cluster/resources.go (2)
GetDSCI(115-135)GetDSC(92-112)
pkg/upgrade/upgrade.go (2)
pkg/cluster/resources.go (2)
GetHardwareProfile(138-148)CreateHardwareProfile(160-167)tests/e2e/helper_test.go (1)
CreateHardwareProfile(339-384)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Build/push catalog image
- GitHub Check: Run tests and collect coverage on internal and pkg
- GitHub Check: golangci-lint
🔇 Additional comments (3)
pkg/cluster/cluster_config.go (1)
93-131: Release detection logic looks correct; fallback order is sensible.DSCI → DSC fallback with not-found handling is clean. No action required.
pkg/upgrade/upgrade_utils.go (1)
98-101: Centralized HWP creation path looks good.Error context is clear and uses the new cluster.CreateHardwareProfile.
pkg/upgrade/upgrade.go (1)
508-556: Custom-serving HWP creation flow is clean and idempotent.Existence check → annotated create with centralized helper → clear logging. Looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
pkg/resources/resources.go (1)
702-704: Fix typo in error message.The error message uses "unexisting path" which is grammatically incorrect. As noted in a previous review, this should be "nonexistent field path" or similar.
Apply this diff:
- if !ok { - return fmt.Errorf("unexisting path to delete: %v", res.FieldPath) - } + if !ok { + return fmt.Errorf("nonexistent field path: %v", res.FieldPath) + }pkg/upgrade/upgrade.go (1)
354-354: Fix typo in comment."HardwarrfeProfiles" should be "HardwareProfiles". This was flagged in a previous review.
Apply this diff:
-// as described in RHOAIENG-33158. This creates 2 HardwarrfeProfiles for each AP (notebooks and serving). +// as described in RHOAIENG-33158. This creates 2 HardwareProfiles for each AP (notebooks and serving).
🧹 Nitpick comments (2)
pkg/resources/resources.go (1)
33-43: Consider clarifying field naming per previous review feedback.The
FieldPathandFilterValuesnaming works but could be clearer. A past reviewer suggested considering names likeFilterFieldPathandFilterFieldValuesto make the filtering intent more explicit.pkg/upgrade/upgrade.go (1)
632-632: Consider using string formatting for the log message.The string concatenation works but could be cleaner with fmt.Sprintf or by adding the hardwareProfile value as a separate log field (which is already done).
Optional alternative:
- log.Info("Set HardwareProfile annotation for InferenceService with "+customServing+" HardwareProfile", "isvc", isvc.GetName(), "hardwareProfile", hwpName) + log.Info("Set HardwareProfile annotation for InferenceService with custom-serving HardwareProfile", "isvc", isvc.GetName(), "hardwareProfile", hwpName)Or simply:
- log.Info("Set HardwareProfile annotation for InferenceService with "+customServing+" HardwareProfile", "isvc", isvc.GetName(), "hardwareProfile", hwpName) + log.Info("Set default HardwareProfile annotation for InferenceService", "isvc", isvc.GetName(), "hardwareProfile", hwpName)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
Makefile(1 hunks)bundle/manifests/opendatahub-operator.clusterserviceversion.yaml(3 hunks)cmd/main.go(1 hunks)config/manifests/bases/opendatahub-operator.clusterserviceversion.yaml(3 hunks)pkg/cluster/cluster_config.go(1 hunks)pkg/cluster/resources.go(1 hunks)pkg/resources/resources.go(3 hunks)pkg/upgrade/upgrade.go(7 hunks)pkg/upgrade/upgrade_utils.go(10 hunks)
✅ Files skipped from review due to trivial changes (1)
- Makefile
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/cluster/cluster_config.go
- pkg/upgrade/upgrade_utils.go
🧰 Additional context used
🧬 Code graph analysis (4)
cmd/main.go (1)
pkg/cluster/cluster_config.go (1)
GetDeployedRelease(107-131)
pkg/cluster/resources.go (2)
tests/e2e/helper_test.go (1)
CreateHardwareProfile(339-384)pkg/cluster/gvk/gvk.go (1)
HardwareProfile(88-92)
pkg/upgrade/upgrade.go (1)
pkg/cluster/resources.go (2)
GetHardwareProfile(138-148)CreateHardwareProfile(160-167)
pkg/resources/resources.go (2)
pkg/cluster/gvk/gvk.go (1)
Namespace(28-32)pkg/resources/resources_types.go (1)
UnstructuredList(11-11)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Build/push catalog image
- GitHub Check: Run tests and collect coverage on internal and pkg
- GitHub Check: golangci-lint
🔇 Additional comments (9)
config/manifests/bases/opendatahub-operator.clusterserviceversion.yaml (1)
9-9: Version bumps are consistent across manifest identifiers.All three version references (containerImage, metadata.name, and spec.version) have been updated uniformly from 3.0.0 to 3.1.0, maintaining consistency for the operator distribution manifest.
Also applies to: 20-20, 175-175
bundle/manifests/opendatahub-operator.clusterserviceversion.yaml (1)
85-85: Bundle manifest version bumps align with base manifest.Version identifiers in the bundle manifest are updated consistently with the base manifest: containerImage, metadata.name, and spec.version all reflect the 3.1.0 release version uniformly.
Also applies to: 98-98, 1537-1537
pkg/cluster/resources.go (1)
150-167: LGTM!The function correctly implements idempotent HardwareProfile creation with appropriate error handling and context.
pkg/resources/resources.go (2)
650-669: LGTM!The error aggregation pattern correctly follows go-multierror best practices, using a pointer and calling ErrorOrNil() to return nil when no errors occurred.
720-740: LGTM!The function correctly handles owner reference removal with appropriate early-return optimization when no owner references exist.
cmd/main.go (1)
262-262: LGTM!The change correctly uses the relocated
cluster.GetDeployedReleasefunction, aligning with the PR's objective to move functions to the cluster package for reuse.pkg/upgrade/upgrade.go (3)
35-54: LGTM!The new constants appropriately centralize magic strings and improve code maintainability with clear, consistent naming.
347-347: LGTM!The function call correctly uses the renamed
createCustomServingHardwareProfile(now package-private), which is appropriate since it's only used within the upgrade package.
508-556: LGTM!The function correctly uses the centralized
cluster.CreateHardwareProfileand follows proper idempotent creation patterns. The annotation handling appropriately sets the managed flag to "false" for the custom-serving profile.
38ec9b9 to
17a9c00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
pkg/resources/resources.go (1)
703-703: Fix typo in error message.The error message contains "unexisting path" which should be "nonexistent path" or "non-existing path" for grammatical correctness.
Apply this diff:
- return fmt.Errorf("unexisting path to delete: %v", res.FieldPath) + return fmt.Errorf("nonexistent field path: %v", res.FieldPath)pkg/upgrade/upgrade.go (1)
354-354: Fix typo in comment."HardwarrfeProfiles" → "HardwareProfiles"
Apply this diff:
-// as described in RHOAIENG-33158. This creates 2 HardwarrfeProfiles for each AP (notebooks and serving). +// as described in RHOAIENG-33158. This creates 2 HardwareProfiles for each AP (notebooks and serving).
🧹 Nitpick comments (2)
pkg/resources/resources.go (2)
706-714: Consider breaking after successful deletion.After deleting a resource (line 708), the code continues checking remaining
FilterValues. Since the item is already deleted, checking additional filter values is unnecessary.Apply this diff to improve efficiency:
for _, targetValue := range res.FilterValues { if v == targetValue { err = c.Delete(ctx, &item) if err != nil { return fmt.Errorf("failed to delete %s %s/%s: %w", res.Gvk.Kind, res.Namespace, item.GetName(), err) } log.Info("Deleted object", "name", item.GetName(), "gvk", res.Gvk.String(), "namespace", res.Namespace) + break } }
720-740: Consider improving error message grammar.The error message on line 736 reads "error unset ownerreference" which is grammatically awkward.
Apply this diff:
- return fmt.Errorf("error unset ownerreference for CR %s : %w", instanceName, err) + return fmt.Errorf("failed to unset owner references for CR %s: %w", instanceName, err)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cmd/main.go(1 hunks)pkg/cluster/cluster_config.go(1 hunks)pkg/cluster/resources.go(1 hunks)pkg/resources/resources.go(3 hunks)pkg/upgrade/upgrade.go(7 hunks)pkg/upgrade/upgrade_utils.go(10 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cmd/main.go
🧰 Additional context used
🧬 Code graph analysis (5)
pkg/cluster/cluster_config.go (2)
api/common/types.go (2)
Release(159-162)Status(97-105)pkg/cluster/resources.go (2)
GetDSCI(115-135)GetDSC(92-112)
pkg/resources/resources.go (2)
pkg/cluster/gvk/gvk.go (1)
Namespace(28-32)pkg/resources/resources_types.go (1)
UnstructuredList(11-11)
pkg/upgrade/upgrade.go (1)
pkg/cluster/resources.go (2)
GetHardwareProfile(164-174)CreateHardwareProfile(186-193)
pkg/upgrade/upgrade_utils.go (1)
pkg/cluster/resources.go (1)
CreateHardwareProfile(186-193)
pkg/cluster/resources.go (2)
tests/e2e/helper_test.go (1)
CreateHardwareProfile(339-384)pkg/cluster/gvk/gvk.go (1)
HardwareProfile(88-92)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Build/push catalog image
- GitHub Check: Run tests and collect coverage on internal and pkg
- GitHub Check: golangci-lint
🔇 Additional comments (12)
pkg/resources/resources.go (4)
14-14: LGTM!The new imports are appropriate for the added functionality:
go-multierrorfor error aggregation,api/metafor API error handling, and controller-runtime logging.Also applies to: 19-19, 28-28
33-43: LGTM!The
ResourceSpecstruct is well-documented with clear field descriptions and examples. The structure is appropriate for identifying and filtering Kubernetes resources.
650-669: LGTM!The
DeleteResourcesfunction correctly aggregates errors from multiple deletion attempts usinggo-multierror, ensuring that all deletions are attempted even if some fail.
687-694: Good error handling for missing CRDs.The function gracefully handles
NoMatchErrorwhen a CRD is not found, logging the condition and continuing. This prevents failures when deleting resources for optional components.pkg/cluster/cluster_config.go (1)
93-131: LGTM! Well-structured function with proper error handling.The implementation correctly handles the fallback logic from DSCI to DSC and returns an empty Release when neither exists, which aligns with the "clean installation or both CRs deleted" scenario mentioned in the comment.
pkg/cluster/resources.go (1)
176-193: LGTM! Clean idempotent creation pattern.The function correctly handles the "already exists" case and provides clear error messages with namespace and name context.
pkg/upgrade/upgrade_utils.go (4)
94-103: LGTM! Proper delegation to cluster package.The error message provides comprehensive context for debugging, including the HardwareProfile name, container size name, profile type, and namespace.
341-359: LGTM! Consistent use of centralized creation logic.The function signature improvement and delegation to
cluster.CreateHardwareProfileprovide better consistency across the codebase.
545-550: LGTM! Constants improve maintainability.Using
featureVisibilityModelServingandfeatureVisibilityWorkbenchconstants instead of hard-coded strings reduces the risk of typos and makes updates easier.
680-699: LGTM! Excellent consolidation of annotation logic.This helper function ensures consistency across all HardwareProfile creation paths and reduces code duplication.
pkg/upgrade/upgrade.go (2)
35-55: LGTM! Excellent centralization of configuration.The introduction of these constants significantly improves maintainability and reduces the risk of typos in string literals throughout the codebase.
508-556: LGTM! Clean refactor with proper use of helpers and constants.The function correctly leverages the centralized
createHardwareProfileAnnotationshelper andcluster.CreateHardwareProfilefor consistency. The renaming from exported to unexported is appropriate since this is an internal upgrade utility.
|
What about add unit tests to moved functions? |
you are right. i should add unit test for functions moved out of upgrade package |
- remove functions and calls not needed for 3.2 (should be removed in 3.0 already) jupyterhub resource cleanup rbac for modelreg envoyfilter patch for serving watson resource docs patch odhdashboardconfig for trusty enablement - move and cleanup function for AP to HWP migration Signed-off-by: Wen Zhou <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (7)
pkg/cluster/cluster_config.go (1)
101-139: GetDeployedRelease fallback logic looks correct; consider adding unit tests.The function cleanly prefers DSCI over DSC, treats NotFound as a fallback condition, and surfaces other errors unchanged, which matches the intended behavior and will be useful outside the upgrade path. Given this is now a central helper, it would be good to back it with unit tests covering: DSCI-only, DSC-only, both-missing, and non-NotFound error cases.
pkg/resources/resources.go (2)
14-43: ResourceSpec deletion utilities look sound; a couple of robustness tweaks to consider.The
ResourceSpecmodel and theDeleteResources/DeleteOneResourcehelpers read clearly and the use ofmultierrorplusmeta.IsNoMatchErrormakes sense for best-effort cleanup. Two minor robustness suggestions:
- In
DeleteOneResource, a missingFieldPathcurrently returns an error on the first object. If these specs are ever reused in contexts where the field might be absent on some instances, you may want to log andcontinueinstead of failing the whole spec.- For potential cluster-scoped resources, you might consider only applying
client.InNamespace(res.Namespace)whenres.Namespaceis non-empty to avoid relying on how the client treats an empty namespace for cluster-scoped GVKs.These are non-blocking; the current behavior is acceptable for strictly-validated upgrade specs.
Also applies to: 650-718
720-740: UnsetOwnerReferences is correct but could be slightly safer.The helper does what it says, but you might want to:
- Guard against a nil
odhObjectto avoid a possible panic.- Check
len(odhObject.GetOwnerReferences()) == 0instead of just!= nilso you don’t send an unnecessary Update when the slice is empty.Behavior today is fine for controlled call sites; this would just harden it a bit.
pkg/cluster/resources.go (1)
176-193: Centralized HardwareProfile creation helper looks good; consider minor polish.Wrapping
cli.Createand swallowingIsAlreadyExistsgives you a nice idempotent entry point and keeps callers simple. Two small, optional refinements:
- You could use
return client.IgnoreAlreadyExists(cli.Create(ctx, hwp))to express the intent a bit more idiomatically.- If you expect this to be reused outside one-off migrations, consider whether you want the same retry behavior as
CreateWithRetry(e.g., for webhook startup races).Current implementation is perfectly serviceable for upgrade/migration flows.
pkg/upgrade/upgrade_utils.go (1)
381-382: Centralizing HardwareProfile annotations is nice; consider making the timestamp UTC/injectable.
createHardwareProfileAnnotationsandgetFeatureVisibilitynicely standardize visibility, display name, description, disabled flag, and modified date across all HardwareProfile creation paths, and their reuse in both HWP generators andcreateCustomServingHardwareProfilereduces duplication and typo risk. One minor improvement: usingtime.Now().UTC().Format(time.RFC3339)(or injecting a clock for tests) would makehardwareProfileModifiedDateAnnotationtimezone-independent and easier to assert in unit tests.Also applies to: 463-465, 492-493, 525-526, 545-550, 680-699
pkg/upgrade/upgrade.go (2)
35-55: New constants for defaults, annotations, and customServing improve clarity; double‑check defaultMin vs defaultResourceLimits.*Centralizing
odhDashboardConfigName,customServing, the varioushardwareProfile*annotation keys, feature-visibility JSON strings, andcontainerSizeHWPPrefixsignificantly reduces duplication and makes the later usages (e.g., inAttachHardwareProfileToInferenceServices’ defaulthwpNameand log message) easier to follow. One thing worth re‑validating:defaultMinMemoryis"1Mi"whiledefaultResourceLimits["minMemory"]is"8Gi"—if these are meant to represent the same conceptual default, you may want to align them; if they’re intentionally different for “no containerSizes” vs “computed from sizes” paths, a brief comment could help future readers.Also applies to: 611-633
508-556: createCustomServingHardwareProfile logic is sound; only a small error-message nit.The function correctly:
- Checks for the existing HardwareProfile via
cluster.GetHardwareProfile, treating only non-NotFound errors as fatal.- Uses
createHardwareProfileAnnotations+hardwareProfileManagedAnnotation = "false"to initialize metadata.- Delegates creation to
cluster.CreateHardwareProfilefor idempotency and logs success.If you touch this again, you might consider tweaking the error format string to something like:
return fmt.Errorf("failed to check HardwareProfile CR %q: %w", customServing, customServingError)to read a bit more naturally.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cmd/main.go(1 hunks)pkg/cluster/cluster_config.go(1 hunks)pkg/cluster/resources.go(1 hunks)pkg/resources/resources.go(3 hunks)pkg/upgrade/upgrade.go(7 hunks)pkg/upgrade/upgrade_utils.go(10 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cmd/main.go
🧰 Additional context used
🧬 Code graph analysis (5)
pkg/cluster/resources.go (2)
tests/e2e/helper_test.go (1)
CreateHardwareProfile(345-390)pkg/cluster/gvk/gvk.go (1)
HardwareProfile(88-92)
pkg/resources/resources.go (2)
pkg/cluster/gvk/gvk.go (1)
Namespace(28-32)pkg/resources/resources_types.go (1)
UnstructuredList(11-11)
pkg/upgrade/upgrade_utils.go (1)
pkg/cluster/resources.go (1)
CreateHardwareProfile(186-193)
pkg/cluster/cluster_config.go (2)
api/common/types.go (1)
Release(159-162)pkg/cluster/resources.go (2)
GetDSCI(115-135)GetDSC(92-112)
pkg/upgrade/upgrade.go (1)
pkg/cluster/resources.go (2)
GetHardwareProfile(164-174)CreateHardwareProfile(186-193)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Build/push catalog image
- GitHub Check: golangci-lint
- GitHub Check: Run tests and collect coverage on internal and pkg
🔇 Additional comments (3)
pkg/upgrade/upgrade_utils.go (2)
60-92: Using odhDashboardConfigName constant in getOdhDashboardConfig improves consistency.Switching the cluster lookup to use the shared
odhDashboardConfigNameconstant keeps name and manifest path in sync and avoids hard‑coding. No issues from a behavior standpoint.
94-103: Routing hardware-profile creation through cluster.CreateHardwareProfile is a good consolidation.Both
createHardwareProfileFromContainerSizeandcreateHardwareProfileFromAcceleratorProfilenow defer creation tocluster.CreateHardwareProfile, soAlreadyExistsis handled uniformly and call sites only need to wrap non-idempotent failures. This keeps upgrade logic thinner and makes future behavior changes (e.g., adding retries) localized to one place.Also applies to: 341-359
pkg/upgrade/upgrade.go (1)
317-349: MigrateToInfraHardwareProfiles orchestration still looks correct with the new custom-serving helper.The flow—fetch OdhDashboardConfig once, then migrate accelerator profiles, container sizes, notebooks, and finally inference services—remains intact, and inserting
createCustomServingHardwareProfilejust before attaching annotations to InferenceServices ensures the “custom-serving” HWP is created (or verified) before you annotate. Continuing even when one step fails, while aggregating errors viamultierror, matches the existing best-effort pattern.
|
@zdtsw Will any of the functions that are removed from the upgrade package impact upgrades from 2.25 to 3.2 or 3.3? AFAIK we aren't enforcing that users upgrade from 2.25 to 3.0 before 3.2/3.3 so I want to make sure we aren't breaking the upgrade path from 2.25 to 3.3. |
i do not think so , the logic for v2 to v3 are staying in #2698 to be removed after 3.3 |
Okay in that case I think this is ready to merge. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: carlkyrillos The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
315e014
into
opendatahub-io:main
|
/cherrypick rhoai |
|
@zdtsw: #2695 failed to apply on top of branch "rhoai": In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
- remove functions and calls not needed for 3.2 (should be removed in 3.0 already) jupyterhub resource cleanup rbac for modelreg envoyfilter patch for serving watson resource docs patch odhdashboardconfig for trusty enablement - move and cleanup function for AP to HWP migration Signed-off-by: Wen Zhou <[email protected]> (cherry picked from commit 315e014)
|
follow up #2881 |
- remove functions and calls not needed for 3.2 (should be removed in 3.0 already) jupyterhub resource cleanup rbac for modelreg envoyfilter patch for serving watson resource docs patch odhdashboardconfig for trusty enablement - move and cleanup function for AP to HWP migration (cherry picked from commit 315e014) Signed-off-by: Wen Zhou <[email protected]>
* Remove servicemesh not installed check. (#2443) After adding the gatewayconfig CRD and controller, ODH and RHOAI will always have servicemesh installed automatically by the openshift ingress controller. It will not be possible to get into a state without OSSM 3.x installed. * remove redundant e2e test case (#2454) * Change naming convention in mod arch image for dashboard (#2449) * fix: fix catalog image related_image env variable name as per devops config (#2452) Signed-off-by: Dhiraj Bokde <[email protected]> * update: sample of DSCI and DSC + README (#2447) - since we have introduce GW API creation, we should update our sample - ossm v3 is installed out of the box, we should not ossm v2 still "Managed" as we advocate - set to Unmanaged than Removed, to avoid accidentally deletion - this whole setion should be taken away soon in 3.0 release Signed-off-by: Wen Zhou <[email protected]> * fix: connection API on ISVC if type is changed + only create serviceaccount if type is s3 (#2433) * fix: connection API on ISVC if type is changed - we previously only handle when type is removed then we cleanup injection - now we handle update as well - refactor some old code - handle serviceaccount creation and injection for s3 only Signed-off-by: Wen Zhou <[email protected]> * fix: lint Signed-off-by: Wen Zhou <[email protected]> * update: code review - correct log and skip Patch if no cleanup is needed Signed-off-by: Wen Zhou <[email protected]> * update: fix case when there was no annotation - if isvc exists but no annoatation: user might manually set certain part, e.g .spec.predictor.model.storageUri on update with annotation, old storageUri etc should be removed before new injection to avoid confliect Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * feat(RHOAIENG-30795): Add configuration support to TrustyAI DSC component (#2225) * feat(RHOAIENG-30795): Add configuration support to TrustyAI DSC component * fix: Add missing docs * fix: Fix linting errors * fix: Remove redundant labels and pointer to boolean fields Since they have default values, no need to use pointers. * fix: Add TrustyAI CM to operator's resources. Remove unneeded annotations. * fix: Change tests to account for AddResources * fix: Linting of trustyai_test.go * Merge test for default CM and enabled component * Fix linting (remove dupl) * chore: make bundle * fix: revert REPLACE_IMAGE * chore: skip HWP migration if dashboard HWP CRD missing (#2470) * feat: add support to pass Oauth proxy image for downstream (#2466) * add override for new workbenches (#2465) * prometheus test should check monitoring namespace not apps namespace (#2453) * feat: add new VAP to block create/update operation on dashboard's HWProfile CR and AcceleratorProfile CR (#2378) * feat: add new VAP to block create/update operation on dashboard's HWProfile CR and AcceleratorProfile CR - since this will only work on OCP 4.19+, no need check OCP versoin, as VAP/VAPB is enabled in the cluster - in the case if the cluster already have dashboard's AP/HWP CRD DSCI should deploy and own 4 new resources: 2 VAP and 2 VAPB to block if any attmpt to create dashboard AcceleratorProfile/HardwareProfile CR or update existing AcceleratorProfile/HardwareProfile CR these 4 resoruces will be re-created if user try to delete them when DSCI is deleted, these 4 resources should be removed from cluster Signed-off-by: Wen Zhou <[email protected]> * fix: testscases Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * update: remove logic set default value in odhdashboardconfig in upgrade (#2469) Signed-off-by: Wen Zhou <[email protected]> * feat: optimize E2E test framework performance and error handling (#2418) * feat: add comprehensive E2E testing framework with improved resource management - Add DeleteResources function for efficient bulk resource deletion with options - Implement EnsureResourceDeletedThenRecreated with UID-based verification - Add WithRemoveFinalizersOnDelete option for automatic finalizer handling - Support WithWaitForRecreation flag for controller-managed resources - Replace ExpectedErr with AcceptableErrMatcher using proper Kubernetes error codes - Fix tryRemoveFinalizers to handle success, NotFound, and NoKindMatchError gracefully - Add controllerCacheRefreshDelay to prevent cache staleness issues - Consolidate cleanup functions and remove redundant code patterns - Add ResourceQuota GVK definition and improve bulk deletion documentation This comprehensive framework eliminates race conditions in controller-managed resource testing and provides consistent, reliable patterns for deletion-recreation scenarios with enhanced safety and performance optimizations. * feat: enhance core component E2E tests and ServiceMesh organization - Add comprehensive component testing with enhanced determinism - Improve test organization and maintainability across core test suites - Move resilience tests after functional tests for cleaner failure attribution - Fix KnativeServing test synchronization with deletion/recreation flags - Add proper scoping to cleanupListResources with safety checks - Optimize ServiceMesh test organization with consolidated validation patterns - Extract common validation helpers following "with" pattern conventions - Consolidate repetitive operator configuration patterns using shared constants - Add reusable helpers for monitoring resource cleanup and DSCI validation - Move shared operator namespace constants to centralized helper file - Rename validation functions for consistency (buildCapabilityConditions → withCapabilityConditions) - Reorganize helper functions placement for better code structure - Optimize E2E performance with better resource management - Apply framework patterns across component test suites - Refactor UninstallOperator to use ResourceOpts pattern for better consistency - Replace hard sleeps with WithWaitForDeletion in ServiceMesh operator uninstallation - Remove custom uninstallOperatorWithChannel function in favor of standardized approach - Eliminate duplicate validation patterns with DRY principle application - Refactor Gateway tests and centralize GVKs This enhances the core testing infrastructure with improved coverage, resilience patterns, ServiceMesh test organization, and maintainable validation helpers that reduce code duplication while following established framework conventions. # Conflicts: # tests/e2e/servicemesh_test.go # Conflicts: # tests/e2e/creation_test.go # tests/e2e/dashboard_test.go # tests/e2e/monitoring_test.go * Removing the ServiceMesh controller and some ServiceMesh/Authorino e2e tests, as they will no longer be present in 3.0. * Commenting out gateway test cases for now, as the feature is still in the early stages of implementation. * feat: add support for llmisvc to use kueue (#2293) * feat: add support for llmisvc for kueue - rename old kueue one for isvc so we can have both isvc and llmisvc Signed-off-by: Wen Zhou <[email protected]> * fix: code review Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * feat: add GH Action to enforce e2e test suite update for code-related PRs (#2426) fix: split workflow into two to allow the bot to comment on fork PRs reuse existing error comments and add automatic error comment cleanup, to prevent PR noise add option to provide skip justification in PR descprition re-use same error comment for both error cases remove functionality to add justification via PR comment, to simplify the workflow and improve user experience * addressed comments in PR 2404 (#2444) * Use custom manifests without devFlags (#2427) * Use custom manifests without devFlags Jira: https://issues.redhat.com/browse/RHOAIENG-31642 This gives instructions for an alternative approach to using custom manifests with the operator, which don't require the use of "devFlags" in the DSC. The intention is that this will allow us to remove the devFlags functionality for 3.0. This is a primitive approach for now. I expect that folks will start to build better development patterns rather than needing to modify the patch file for their component, etc. They can even have their own approaches in their own repositories, and this may just be an example. * Don't re-chown files in Dockerfile They're already chowned as expected from the COPY instruction on L60. * feat: add comprehensive debug utilities for E2E test failures and timeouts (#2486) * feat: add comprehensive debug utilities for E2E test failures and timeouts - Add debug_utils_test.go with global panic and failure handlers - Integrate HandleGlobalPanic() and HandleTestFailure() at multiple test levels - SetGlobalDebugClient() automatically configured in NewTestContext() - Provides detailed diagnostics on ANY test failure (panics, timeouts, assertions): * Cluster state and node health with resource allocation percentages * Namespace resources (deployments, pods, container states) * Operator deployment status and pod logs * DSCI/DSC status and unhealthy conditions * Recent events from key namespaces (last 5 minutes) * Resource quota violations Addresses debugging challenges for deployment recreation failures, resource exhaustion, and other cluster-level issues in E2E tests. Activates automatically without requiring manual intervention. * Added security redaction for secrets, tokens, and credentials in logs * fix: limit e2e update requirement check GH action only to the main branch (#2497) chore: improve documentation and error messages * update: temp. disable e2e test on dashboard (#2499) - this should be reverted once dashboard is fixed Signed-off-by: Wen Zhou <[email protected]> * removing the py311 images (#2484) Co-authored-by: Wen Zhou <[email protected]> * RHOAIENG-29717 | feat: Deploy thanos querier frontend for metrics acces (#2448) Co-authored-by: den-rgb <[email protected]> * chore: e2e update requirement GH action: edit regex for justification section title, ensure guidelines are consistent (#2509) * fix: prevent secret data leakage in operator error logs (#2471) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> * test: increase resource check timeout, fix debug (#2506) * Remove codeflare operator, monitoring and tests (#2468) * feat(RHOAIENG-34034): remove codeflare operator, monitoring and tests * test(RHOAIENG-34034): add e2e test to verify codeflare component is not removed * fix(RHOAIENG-34034): improve test and add test flag in README * test(RHOAIENG-34034): use consistently check * test: improve test readibility and reusability * fix: watch on Dashboard's AProfile/HWProfile by DSCI (#2512) - add watch on these two CRD on create/update, so VAP/VAPB need to be created - this is needed for the testcase on dashboard once it is enabled and still ship these two CRD - we probably can remove this logic alongwith the dashbaord tests on VAP/VAPB once dashboard cleanup CRD from their manifests Signed-off-by: Wen Zhou <[email protected]> * chore: log on webhook for notebook with HWProfile (#2491) - different for not able to find HWProfile VS client Get() failed - update e2e tests to add new error code Signed-off-by: Wen Zhou <[email protected]> * feat: auto detect auth, kube auth proxy, envoy filter (#2490) * fix gateway controller namespace handling, add status sync, and improve test patterns * feat: detect auth, kubeauthproxy, envoyfilter * added better support for oidc, fixed kubeproxy/envoyfilter issue * used apply instead of addresources for secret * fixed cookie mismatch for oauth/oidc * Initial effort at migrating hardwareprofiles from v1alpha1 to v1 (#2398) Rebased, included v1alpha1 hardwareprofile changes added deprecated markers to v1alpha1 hardwareprofiles Rebased on latest, incremental Makefile version bump, regenerated bundle v1alpha1 HardwareProfiles must be registered to scheme minor comment update to reflect infrav1 hardwareprofiles adding hardwareprofile test cases # Conflicts: # bundle/manifests/opendatahub-operator.clusterserviceversion.yaml Signed-off-by: Max Whittingham <[email protected]> * chore: remove unused pkg/feature and related integration tests (#2521) * chore: remove unused pkg/feature and related integration tests * chore: remove unused unit-tests1.yaml workflow * collector validation should check metrics resources or storage (#2515) * refactor: remove duplicated reconciler crd existence check (#2483) * chore: uplift version + fix bundle + fmt + fix monitoring tests (#2526) Signed-off-by: Ugo Giordano <[email protected]> * update: add cleanup when --fail-fast is called we still get leftover cleanup (#2478) Signed-off-by: Wen Zhou <[email protected]> * implemented destination rule for TLS to upstream services (#2536) * feat: add default HWProfile CR "default-profile" (#2461) * feat: add default HWProfile CR "default-profile" - remove the sample one we added before, with this PR it will create one in the cluster - user will be able to change this default-profile CR, operator wont reconcile - if user change DSCI or DSCI gets reconciled, it should not reset default-profile CR with user modified value. - if user delete this default-profile CR, operator will create a new one with default values Signed-off-by: Wen Zhou <[email protected]> * update: code review Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * feat: switch isvc connection annotation from type to protocol (#2532) * feat: switch isvc connection annotation from type to protocol * fix: addressed PR comments * Update unit-tests2.yaml to codecov-actions v5.5.1 (#1) (#2542) Update unit-tests2.yaml to codecov-actions v5.5.1 * fix: version inconsistency (#2544) Signed-off-by: Wen Zhou <[email protected]> * add support for overriding kserve-llm-d (#2537) * feat: add oauth proxy parametrization to dashboard (#2547) * feat: add support for llmisvc to use HWProfile (#2424) * feat: add support for llmisvc to use HWProfile - if HWprofile has .spec.identifiers, add this into .spec.tempalte.containers(llmisvc) - if HWprofile has .spec.scheduling - if it is type: kueue, set label "kueue.x-k8s.io/queue-name" with value .spec.scheduling.kueue.localqueue - if it is type: node and has .spec.scheduling.node .spec.scheduling.node.nodeSelector add into .spec.template.nodeSelector(llmisvc) .spec.scheduling.node.tolerations add into .spec.template.tolerations(llmisvc) Signed-off-by: Wen Zhou <[email protected]> * fix: add support for llmisvc with min config - if user only set spec: {} when create llmisvc, we should: 1. create a container named as "main" 2. inject identifier from HWProfile Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * Support override for runtime images (#2464) * support override for runtime images * add override for vllm-spyre image * fix: prevent cache false hits for resources stuck in deletion (#2527) * fix: bypass cache for objects with deletionTimestamp to prevent stuck resources - Add cache bypass logic for objects with deletionTimestamp in deploy action - Implement proactive cache cleanup in ShouldSkip method - Add Delete method to Cache for cleaning up stale entries - Extract cache logic into isCachedAndShouldSkip helper method - Add comprehensive unit tests for cache cleanup verification - Apply line-of-sight principle for better code readability Fixes cache bug where resources.Hash() removes deletionTimestamp causing identical hashes for objects with/without deletion timestamps. This led to false cache hits preventing recreation of any Kubernetes resource type (deployments, services, configmaps, secrets, CRDs, etc.) during E2E tests. Root cause: pkg/controller/predicates/resources/resources.go:295 Impact: E2E test failures when resources stuck in deletion while still in cache Solution: Objects with deletionTimestamp always bypass cache and trigger cleanup Scope: Affects all Kubernetes resources deployed via deploy action * refactor: simplify cache test suite and improve terminating object handling - Skip deployment for terminating objects instead of attempting and failing - Refactor ShouldSkip to handle terminating objects before cache logic - Rename ProcessCacheEntry to CleanupIfTerminating for clarity - Make Cache.Delete method idempotent - Convert TestDeployWithCacheAction to clean table-driven approach - Remove unused test helper functions (saved ~150+ lines) - Simplify test structure with direct function calls in table - Apply micro-optimization to avoid DeepCopy on skip path All tests pass with improved maintainability and performance. * Update action_deploy_cache.go Removed redundant check when deleting a key from the cache * fix(RHOAIENG-34533): resolve TrustyAI DSC validation error when patching DSC eval flags (#2525) * fix(RHOAIENG-34533): resolve TrustyAI DSC validation error when patching DSC eval flags This change resolves RHOAIENG-34533 by converting TrustyAI evaluation configuration fields from boolean to enum string values. Changes: - Convert PermitCodeExecution and PermitOnline from bool to string with enum validation (allow/deny) - Add proper default handling and constants for evaluation permissions - Update CRDs, bundle manifests, and test expectations - Maintain backward compatibility in ConfigMap generation * add updated API * Simplify CM value assignment, revert image name to v3.0.0 * test: revert "temp. disable e2e test on dashboard" from PR2507 and fix VAP/VAPB e2e tests (#2561) * Revert "update: temp. disable e2e test on dashboard (#2499)" This reverts commit 9e847d546cb75c2bb14bd32ef54df0f1fc1c5894. * fix: VAP/VAPB testcase - move from DSCI to Dashbaord Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * feat: add oauth proxy parametrization to dashboard (#2554) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> * RHOAIENG-34055: add ray sanity check for codeflare removal (#2514) * feat(RHOAIENG-34055): add ray sanity check for codeflare removal * feat(RHOAIENG-34055): create a common sanitycheck action * feat: add GHA to auto build and push e2e tests image after each merge to main (#2543) * checkin (#2555) * remove useless kueue-batch-user-rolebinding (#2571) Removed the config/kueue-configs/ directory and all associated references as the kueue-batch-user-rolebinding functionality is no longer needed. * OCP console link to odh should match gateway address (#2569) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> Co-authored-by: James Tanner <[email protected]> * fix: add llmisvc kueue test case (#2573) * feat: add support for Connection API in LLMInferenceservice - add create/update/removal case - create $llmisvc-sa serviceaccount and inject into .spec.template.serviceAccountName - serect should support both cases: data.URI and data.http-hosts according to kserve for both isvc and llmisvc - simplify logging by remove duplicated error and caller print error - only remove connectoin added imagepullsecret from list if remove annotation - add support for only change conneciton-path value and .spec.mode.uri should updated too - when update, if manually set path in spec, use it; if annotation is set, use annotation; if neither set, and previously already has path (maybe was created by gitops or dashboard) keep it Signed-off-by: Wen Zhou <[email protected]> * update: code review - fix typo - update handleSA() to deal with servicename more precisely - cleanup func and const not in use any more Signed-off-by: Wen Zhou <[email protected]> * update: code review - remove duplicated function for serviceaccount - add check on both annotations Signed-off-by: Wen Zhou <[email protected]> * update: code review - serviceAccoutName should only be removed if it is not having the same value derived from serect's name - serviceAccountName should only be injected if it did not have a value already Signed-off-by: Wen Zhou <[email protected]> * RHOAIENG-31870: Add DSC and DSCI v2 versions. (#2505) RHOAIENG-35095: Webhook dsc, dsci and kueue integration_test.go unit-tests disabled Added hack/buildLocal.sh : a script to build locally and deploy to crc * cleanup unused referencegrant gvk code (#2576) * Add clusterrole for allowedgroups (#2494) * add clusterrole for allowedgroups fix unit test remove unused role for allowed group * fix comment and error message * Bug fix for gateway ODIC mode (#2591) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> Co-authored-by: James Tanner <[email protected]> * Removed monitoring namespace from flags, and documentation to set via env vars. Viper now sets the defaults vars for monitoring namespace instead. (#2538) Signed-off-by: Max Whittingham <[email protected]> * Update golangci-lint to v2.5.0 (#2582) * chore: fix api version (#2594) Signed-off-by: Wen Zhou <[email protected]> * fix: retrieve all files in a PR (#2595) * feat(modelregistry): add benchmark data image environment variable mapping (#2498) Add IMAGES_BENCHMARK_DATA environment variable mapping to support benchmark data injection in model registry deployments. This enables the model-registry-operator to configure benchmark data images via init containers for benchmark visualization and analysis features. Related to: https://github.com/opendatahub-io/model-registry-operator/pull/324 Follows the same pattern established in previous PRs: - https://github.com/opendatahub-io/opendatahub-operator/pull/2429 - https://github.com/opendatahub-io/opendatahub-operator/pull/2452 🤖 Generated with [Claude Code](https://claude.ai/code) Signed-off-by: Chris Hambridge <[email protected]> Co-authored-by: Claude <[email protected]> Co-authored-by: Wen Zhou <[email protected]> * RHOAIENG-33891: DSC v1 should not be created or updated with Kueue component state set to Managed (#2602) * Add kube-linter to checks Kubernetes manifests against various best practices, with a focus on production readiness and security (#2605) See: - https://docs.kubelinter.io/ - https://github.com/stackrox/kube-linter * Add webhook to prevent creation of dashbaord's HardwareProfile and AcceleratorProfile (#2599) * chore: use generic Getter[T] instead of StringGetter (#2608) Update StringGetter to use Go generics, allowing the type to work with any return type instead of being limited to strings. * chore: Update GitHub workflows and issue labels and fix bug (#2607) - Add v3 label to issue_label_bot.yaml - Rename workflow yaml: test for e2e, release for ODH release, - Remove 'incubation' branch from linter.yaml triggers - Update email - Update API docs config to exclude all List types Signed-off-by: Wen Zhou <[email protected]> * RHOAIENG-34045: cleanup codeflare manifests, remove codeflare from DSC v2 (#2596) * feat(RHOAIENG-34045): cleanup codeflare manifests, remove codeflare from DSC v2 * feat(RHOAIENG-34045): remove CodeFlare from api docs and remove finalizers update permissions * update: add new members (#2612) Signed-off-by: Wen Zhou <[email protected]> * feat(e2e): improve test framework resilience and fix monitoring controller cleanup (#2550) * feat(e2e): add explicit resource lifecycle methods and improve test framework - Add EventuallyResourceCreated, EventuallyResourceUpdated, EventuallyResourcePatched - Add Patch function in testf for explicit patch-only behavior - Replace CreateOrUpdate with CreateOrPatch for better concurrency - Fix resourceVersion conflicts and test timing issues - Add comprehensive testf test coverage and helper functions - Modernize kueue_test.go with jq.Match validation - Reduce test code duplication and improve error messages * refactor(monitoring): remove redundant management state checks and unify actions - Remove isMonitoringManaged() function and all management state checks from actions - Remove MonitoringNotManagedMessage constant (no longer needed) - Combine MonitoringStack and ThanosQuerier into single deployMonitoringStackWithQuerier action - Combine Tempo and Instrumentation into single deployTracingStack action - Extract validateRequiredCRDs helper to reduce CRD validation duplication - Add CRDRequirement struct for consistent condition handling - Rename setCRDNotFoundCondition to setConditionFalse for clarity The monitoring actions now focus purely on configuration validation and CRD availability checks. The optimization also reduces the monitoring controller from 5 deployment actions to 3 and eliminates condition duplication. Breaking changes: - Remove deployThanosQuerier, deployTempo, deployInstrumentation individual actions - Add deployMonitoringStackWithQuerier and deployTracingStack unified actions * test: temporarily disable RBAC and ServiceAccount deletion recovery tests These tests are experiencing timing issues with external dependencies and token refresh cycles. Temporarily disabling to unblock CI while we investigate proper solutions for testing these resource types. TODO: Re-enable after investigating timing issues. * chore: remove placeholder CRD and skip kubebuilder for dashboard actions (#2609) Remove empty CRD template file and add kubebuilder:skipmake directive to dashboard_controller_actions.go to prevent unintended CRD generation. * Use commit sha in manifest file (#2540) * feat(RHOAIENG-34447): add support to commit sha component target and add workflow to update it * feat: upgrade commit sha * chore(kube-lint): add CEL-based linter to prevent system group bindings in ClusterRoleBinding (#2617) Add custom kube-linter check to detect ClusterRoleBindings that target system groups (e.g., system:authenticated, system:unauthenticated). * feat: update IMAGES_BENCHMARK_DATA to new image (#2622) Signed-off-by: Alessio Pragliola <[email protected]> * add e2e tests for reconciliation resilience (#2364) * Remove ModelMesh component and infrastructure from OpenDataHub Operator (#2565) * feat(RHOAIENG-34026): remove modelmeshserving manifest download and component registration - Remove modelmeshserving from get_all_manifests.sh - Remove modelmeshserving import from cmd/main.go # Conflicts: # get_all_manifests.sh * feat(RHOAIENG-34026): comprehensive ModelMesh removal for RHOAI 3.0 This commit removes ModelMesh functionality while preserving v1 API compatibility: Core Functionality Removal: - Remove ModelMeshServing component controller and implementation - Remove ModelMeshServing monitoring configurations - Update ModelController to ignore ModelMeshServing in DSC v1 (always treats as Removed) - Update ModelController tests to reflect new RHOAI 3.0 behavior Cleanup and Infrastructure: - Remove ModelMeshServing RBAC editor/viewer roles - Remove ModelMeshServing E2E tests - Remove ModelMesh Prometheus unit tests - Remove ModelMeshServing from sample DSC configuration - Remove ModelMesh from component integration documentation - Remove KServe ModelMesh CRD conflict handling logic - Remove unused imports and monitoring embeds API Strategy: - Keep v1 DSC API intact with ModelMeshServing field for upgrade compatibility - ModelController ignores ModelMeshServing settings (always Removed in RHOAI 3.0) - Remove actual component functionality while maintaining API backward compatibility Testing Updates: - Update ModelController tests to reflect new behavior - Tests now validate that ModelMeshServing no longer enables ModelController - All tests pass with new RHOAI 3.0 logic Build verification: - All code compiles successfully - Manifest generation works correctly - ModelController unit tests pass with new behavior # Conflicts: # internal/controller/components/modelcontroller/modelcontroller.go # internal/controller/components/modelmeshserving/modelmeshserving.go # internal/controller/components/modelmeshserving/modelmeshserving_test.go * Remove ModelMeshServing component and API references This commit removes ModelMeshServing component from the opendatahub-operator codebase as part of RHOAI 3.0 migration to KServe-only model serving. Changes include: ## API and Controller Changes: - Keep ModelMeshServing API types for v1 DSC backward compatibility - Set ModelMeshServing ManagementState to 'Removed' in ModelController - Remove ModelMeshServing logic from modelcontroller_actions.go - Add deprecation markers to ModelMeshServing types ## E2E Test Updates: - Remove ModelMeshServing test cases from controller_test.go - Update modelcontroller_test.go to focus on KServe functionality - Remove ModelMeshServing from resilience_test.go and helper_test.go - Add partial upgrade test for ModelMeshServing resource preservation ## Monitoring and Configuration Cleanup: - Remove ModelMeshServing Prometheus scrape jobs and SLO rules - Remove ModelMeshServing alerting rules from prometheus-configs.yaml - Remove ModelMeshServing from monitoring controller mappings ## Sample and Documentation Updates: - Remove modelmeshserving from DSC sample configurations - Fix incorrect app.kubernetes.io/managed-by labels in samples (RHOAIENG-32728) - Remove ModelMeshServing references from README.md and DESIGN.md - Remove ModelMeshServing component from upgrade logic ## Bundle and Manifest Updates: - Update CSV to remove ModelMeshServing CRD references - Add ModelMeshServing CRD to bundle for compatibility Addresses: RHOAIENG-34026, RHOAIENG-32728 Partial: RHOAIENG-34036 (upgrade testing) # Conflicts: # docs/api-overview.md * feat: Remove ModelMeshServing component from RHOAI 3.0 - Remove ModelMeshServing CRD generation from kustomization.yaml - Delete ModelMeshServing CRD file (components.platform.opendatahub.io_modelmeshservings.yaml) - Clean up bundle manifests: remove ModelMeshServing references from CSV and CRDs - Update API documentation config to exclude ModelMeshServing types - Remove ModelMeshServing from v2 DataScienceCluster sample - Clean up catalog.yaml: remove ModelMeshServing GVK, CRD description, and keywords - Preserve v1 DataScienceCluster compatibility for smooth upgrades - Maintain RBAC permissions and e2e tests for upgrade safety - Update TrustyAI controller to remove obsolete model-mesh label check - Simplify ModelController types and remove ModelMeshServing dependencies - Update conversion logic between v1 and v2 DataScienceCluster APIs - Clean up documentation and design files * Update the DSPO commit to leverage kube-rbac-proxy (#2636) See: https://github.com/opendatahub-io/data-science-pipelines-operator/pull/920 https://issues.redhat.com/browse/RHOAIENG-34577 Signed-off-by: mprahl <[email protected]> * chore: update manifest commit SHAs (#2637) Co-authored-by: zdtsw <[email protected]> * fix: resolve RBAC errors and improve resilience test reliability (#2643) - Fix pod deletion RBAC issue by using individual DeleteResource instead of bulk deletion - Restore ModelRegistry component to quota tests after removing stuck finalizer - Optimize jq expressions to handle deployment name patterns and readyReplicas checks - Add pod restart after RBAC restoration for faster test recovery Resolves "server does not allow this method" error and makes tests more reliable. * RHOAIENG-29717 | fix: Adding missing test case (#2510) Co-authored-by: den-rgb <[email protected]> * Add a RELATED_IMAGE hook for kube-auth-proxy (#2632) Co-authored-by: jctanner <[email protected]> * removing wrong labels from samples in DSCI/DSC v2 (#2650) * RHOAIENG-33158 | feat: enhance hardware profile management by adding custom-serving profile (#2578) * RHOAIENG-33158 | feat: enhance hardware profile management by adding custom-serving profile - Introduced a new HardwareProfile named 'custom-serving' in the YAML configuration. - Updated the DSCInitializationReconciler to manage both default and custom hardware profiles. - Modified the logic to check for the existence of the custom hardware profile during reconciliation. * chore: Fixed calling renamed function by refering new name * chore: reworked default and custom hardwareprofile CRs deployment - split the yamls to seperate files to reduce confusion and easier identification - reworked the logic to check and deploy after checking both custom and default hwp errors. --------- Co-authored-by: Wen Zhou <[email protected]> * feat: update workflow to point to correct branch (#2652) * chore: update manifest commit SHAs (#2646) Co-authored-by: openshift-merge-bot <[email protected]> * RHOAIENG-30940: Remove devFlags support (#2588) * feat(RHOAIENG-30940): remove devFlags support * fix(RHOAIENG-34045): small typo * fix: lint * test: add e2e test on HWProfile v1alpha1 and v1 (#2635) * test: add e2e test on HWProfile v1alpha1 and v1 - create CR on v1, test it can be read by v1 and v1alpha1 - create CR on v1alpha, test it can be read by v1 and v1alpha1 * update: prolong timeout * update: move HWProfile test to a dedicated suite Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * RHOAIENG-33158 | feat: Implement migration of HardwareProfiles from AcceleratorProfiles and container sizes (#2529) * RHOAIENG-33158 | feat: Implement migration of HardwareProfiles from AcceleratorProfiles and container sizes - Added functions to migrate AcceleratorProfiles to HardwareProfiles, creating separate profiles for notebooks and serving. - Implemented migration for container sizes, generating HardwareProfiles based on specified resource limits. - Created a special HardwareProfile for InferenceServices without associated AcceleratorProfiles or container sizes. * RHOAIENG-33158 | chore: addressed PR comments - Removed custom-serving HWP out of upgrade.go - now trying to load odhDashboardConfig from manifests if its not available in cluster - other refactors and logs * chore: Renamed funcs to better describe the functionality - Adjusted comments to explain the code better. * RHOAIENG-33158 | chore: addressed PR comments - making sure hwp names are all lowercase and without spaces - changed string literal references to string constants - returning early if application namespace is empty - migration will proceed only if old major version is 2 and current version is 3 * fix: correct GatewayConfig validation error handling for OIDC configuration (#2654) * chore: update manifest commit SHAs (#2655) Co-authored-by: openshift-merge-bot <[email protected]> * chore: remove unused const (#2661) - rename GatewayKind to GatewayConfigKind Signed-off-by: Wen Zhou <[email protected]> * chore: fix previous HWProfile test (#2659) - revert timeout - update docs Signed-off-by: Wen Zhou <[email protected]> * CLI to retry flaky test in job (#2611) * feat(RHOAIENG-34877): add test cli to retry e2e tests during job execution to avoid retest * fix: improve retry skip filter and add github actions to run tests on CLI * (feat): Rename DatasciencePipelines to AIPipelines (#2589) * (feat): Rename DatasciencePipelines to AIPipelines Signed-off-by: Ajay Jaganathan <[email protected]> # Conflicts: # api/datasciencecluster/v1/datasciencecluster_conversion.go # api/datasciencecluster/v2/datasciencecluster_types.go # api/datasciencecluster/v2/zz_generated.deepcopy.go # bundle/manifests/datasciencecluster.opendatahub.io_datascienceclusters.yaml # bundle/manifests/opendatahub-operator.clusterserviceversion.yaml # config/crd/bases/datasciencecluster.opendatahub.io_datascienceclusters.yaml # docs/api-overview.md # pkg/upgrade/upgrade.go # tests/e2e/datasciencepipelines_test.go * add logic for converting installed components Signed-off-by: Ajay Jaganathan <[email protected]> --------- Signed-off-by: Ajay Jaganathan <[email protected]> * Fix session issue with kube-auth-proxy (#2625) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> * update: remove DSCI and DSC v1 sample from CSV (#2667) Signed-off-by: Wen Zhou <[email protected]> * RHOAIENG-33892: Kueue controller should report an error in case a DSC has Managed state set in Kueue component. (#2645) RHOAIENG-33894: Remove all the remaining logic/tests to handle Managed Kueue component state. * Change api group for hardware profiles in admingroup role (#2631) * Change api group for hardware profiles in admingroup role * Apply suggestion from @zdtsw * Update internal/controller/services/auth/resources/admingroup-role.tmpl.yaml --------- Co-authored-by: Wen Zhou <[email protected]> * Enable CLI access via OpenShift OAuth bearer tokens (#2666) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> Co-authored-by: James Tanner <[email protected]> * Fix multiple run of upgrade tests (#2665) * test: fix multiple run of upgrade tests * test: fix v2tov3upgrade tests and test cli * test: add test on duplicated test names in cli * feat: add gen-ai image to the dashboard (#2673) * fixed bug where dup bearer token was added via cli access (#2684) * Remove Serverless Mode, Service Mesh, and Authorino Infrastructure from OpenDataHub Operator (#2560) * remove KServe Serverless mode, removed ServiceMesh, Serverless and Authorino infra add v2tov3 upgrade test remove ServiceMesh spec from DSCI v2 api remove servicemesh manifests remove servicemesh spec from dsci v1 sample, regenerate bundle disable webhook test retain only minimal servicemesh api, update docs generation ignore rules fix NoMatchError checking remove redundant istio/serverless predicates and status condition variables removed OSSMv2 pre-condition check from kserve controller remove serving spec and defaultDeploymentMode field from KServe spec, re-generate manifests and bundle remove leftover codeflare mention in CSV manifest * Clean up types, status and doc strings Signed-off-by: Christopher Sams <[email protected]> * re-enable webhook test, re-generate manifests and bundle * add ServiceMesh to removedCRDToCreate list, fix linter issues --------- Signed-off-by: Christopher Sams <[email protected]> Co-authored-by: Christopher Sams <[email protected]> Co-authored-by: Ugo Giordano <[email protected]> * chore: update manifest commit SHAs (#2668) Co-authored-by: openshift-merge-bot <[email protected]> * set the default mode for kube-auth-proxy (#2677) * fix: optimize failed test rerun (#2686) * RHOAIENG-34261 | Enable structured YAML for metrics exporters and fix UX inconsistency between metrics and traces (#2487) Co-authored-by: Dayakar Maruboena <[email protected]> * feat(RHOAIENG-35049): add support to pass RAGAS KFP image for downstream (#2566) Co-authored-by: Wen Zhou <[email protected]> * fix: remove test on old dashboard AProfile and HWProfile (#2696) Signed-off-by: Wen Zhou <[email protected]> * RHOAIENG-33159 | Replace accelerator and container size annotations to hardwareprofile annotations (#2675) * RHOAIENG-33159 | Replace accelerator and container size annotations to hardwareprofile annotations * Update: - remove custom-serving from config, dsci should only create default one - dynamically create custom-serving HWProfile CR in upgrade.go - move support functions into upgrade_utils.go - remove duplicated calls in the loops - simplify hwprofile name concat - remove unused gvk - change logging info. - remove old test on custom-serving Signed-off-by: Wen Zhou <[email protected]> * fix: lint Signed-off-by: Wen Zhou <[email protected]> * update: code review change - fix missing permisison on notebooks - fix namespace lookup for all namespace - fix missing annoataion for hwprofile namespace set Signed-off-by: Wen Zhou <[email protected]> * update: reduce unnecesary calls to get annoataions Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> Co-authored-by: Wen Zhou <[email protected]> * added kube auth proxy metrics address (#2693) * fix: annoatation for custom-serving HWProfile Signed-off-by: Wen Zhou <[email protected]> (cherry picked from commit 719aa32ce6d0fe24b62aa72aa8e2888e02e07c13) * update notebooks manifest hash for gateway support (#2705) * fix: Correct RAGAS KDP (#2702) Signed-off-by: Rui Vieira <[email protected]> * Make cookie expiration and refresh interval configurable (#2706) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> * chore: update docs + api comments (#2712) Signed-off-by: Wen Zhou <[email protected]> * test: add default kueue configuration test (#2704) * fix: fixed quotes for HWP visibility annotation (#2717) * added validation for clientid/issuerurl (#2708) * Remove installedComponents field from DataScienceCluster v2 API (#2719) - Remove InstalledComponents field from v2 DataScienceClusterStatus - Keep InstalledComponents field in v1 for backward compatibility - Update conversion logic to construct v1 InstalledComponents from v2 component ManagementState - Add constructInstalledComponentsFromV2Status() function to build InstalledComponents map - Add getV1ComponentName() helper for component name mapping between versions - Update conversion tests with new construction logic - Fix integration tests to properly test v2→v1 conversion behavior - Remove InstalledComponents assignments from component handlers and templates The InstalledComponents field in v2 was redundant since component status already tracks management state. This change simplifies the v2 API while maintaining full backward compatibility for v1 users through automatic conversion during webhook processing. RHOAIENG-36418 * test: fix e2e resilience test to work also for rhoai (#2714) * chore: update manifest commit SHAs (#2694) Co-authored-by: openshift-merge-bot <[email protected]> * chore: update manifest commit SHAs (#2725) Co-authored-by: openshift-merge-bot <[email protected]> * fix: wrong image for kube-auth-proxy (#2716) Signed-off-by: Wen Zhou <[email protected]> * fix: improve test skip management to correctly manage unordered test run (#2721) * chore: update manifest commit SHAs (#2726) * chore: update manifest commit SHAs * Update get_all_manifests.sh Co-authored-by: Davide Bianchi <[email protected]> --------- Co-authored-by: openshift-merge-bot <[email protected]> Co-authored-by: Wen Zhou <[email protected]> Co-authored-by: Davide Bianchi <[email protected]> * update: remove unused oauth-proxy variable (#2727) Signed-off-by: Wen Zhou <[email protected]> * override image references from RHAIIS (#2734) * chore: update manifest commit SHAs (#2735) Co-authored-by: openshift-merge-bot <[email protected]> * GatewayConfig does not reconcile when DSCInitialization is created/updated/deleted (#2737) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> * Use variable for manifest folder in get_all_manifests.sh (#2729) * fix: e2e test on main * fix: move ray sha to fix e2e tests. Improve get_all_manifests.sh * feat: remove rm of old manifests * feat: use annotation on kube-auth deployment when secret is changed (#2740) - calculate secret data to hash and set as annoatation to trigger deployment change => pod restart with new secret value - cleanu rbac - fix lint to use expectedODHDomain Signed-off-by: Wen Zhou <[email protected]> * Update: fix typos in code + update Oauthclient name (#2743) * chore: fix typos Signed-off-by: Wen Zhou <[email protected]> * update: change OAuthClient name to be platform agnostic Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * chore: update manifest commit SHAs (#2742) Co-authored-by: openshift-merge-bot <[email protected]> * fix: correct Unmanaged ManagementState description (#2746) Update the Unmanaged ManagementState description to accurately reflect the operator's behavior. The operator does not deploy or manage the component's lifecycle when in Unmanaged state, but may still create supporting configuration resources. This fixes the incorrect description that stated the operator is 'actively managing the component', which contradicted the expected semantics of an Unmanaged state. Fixes: RHOAIENG-32465 * feat: remove custom mark OdhDasboardConfig as unmanaged internally code (#2672) * chore: update manifest commit SHAs (#2752) Co-authored-by: openshift-merge-bot <[email protected]> * feat: add SNO-aware OpenTelemetry collector replica defaults (#2738) Set CollectorReplicas default to 1 for single-node clusters and 2 for multi-node clusters * cleaning up caikit-tgis-image (#2755) * update: cleanup kueue controller image + remove support on Managed status (#2757) * chore: cleanup kueue controller image --------- Signed-off-by: Wen Zhou <[email protected]> * refactor: keep gateway function in correct file (#2745) - gateway_auth_actions.go should only host functions : createKubeAuthProxyInfrastructure createEnvoyFilter - gateway_controller_actions.go only for funtions: createGatewayInfrastructure createDestinationRule syncGatewayConfigStatus - all the rest functions should be in gateway_support.go - gateway_util_test.go is for support testing function Signed-off-by: Wen Zhou <[email protected]> * chore: update manifest commit SHAs (#2761) Co-authored-by: openshift-merge-bot <[email protected]> * fix: remove appwrapper from kueue supported framework configuration (#2764) * chore: update bundle manifests for SNO-aware collector replicas (#2765) Update generated bundle manifests via make bundle * fix: Remove DSCI requirement for service reconciliation (#2750) * fix: Remove DSCI requirement for service reconciliation Services can now reconcile without DSCInitialization present: - Reconciler treats missing DSCI as acceptable (not an error) - Hash function handles nil DSCI with sentinel value - ApplicationNamespace helper implements on-demand DSCI fetch - Auth and Monitoring services use helpers for safe DSCI access - Templates handle nil DSCI gracefully This allows services to start before platform initialization and prevents them from getting stuck in failed state when DSCI doesn't exist yet. RHOAIENG-18035 * refactor: remove DSCI from ReconciliationRequest Make DSCI optional in the reconciliation framework by removing it from ReconciliationRequest. Actions now fetch DSCI on-demand using helper functions ApplicationNamespace() and MonitoringNamespace(). Changes: - Remove DSCI field from ReconciliationRequest struct - Update ApplicationNamespace/MonitoringNamespace helpers to fetch DSCI - Update template rendering to pass AppNamespace to templates - Update all actions and tests to work without rr.DSCI - Convert monitoring tests to use Gomega assertions This allows services to reconcile independently of DSCI existence. * Remove cleanup FeatureTrackers action from KServe controller * feat: remove secretegenerator controller (#2749) * feat: remove secretegenerator controller - we do not need that controller since OAuthClient will be created by Gateway service - move secret releated function into pkg Signed-off-by: Wen Zhou <[email protected]> * update: rewrite unit-test - remove over testing - only match two functions in secret.go Signed-off-by: Wen Zhou <[email protected]> * update: unit-test to use gomega Signed-off-by: Wen Zhou <[email protected]> * update: code review comments with more checks Signed-off-by: Wen Zhou <[email protected]> * update: remove docs content not valid any more Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * chore: update manifest commit SHAs (#2768) Co-authored-by: openshift-merge-bot <[email protected]> Co-authored-by: Wen Zhou <[email protected]> * update: to support olmv1 we need to replace operatorcondition with (#2766) subscription - with this, we lost the ability to check dependent operator version - need a new predicate on subscritpion on .spec.name (metadata.name wont be useful) Signed-off-by: Wen Zhou <[email protected]> * update: enable seccomprofile to runtimedefault (#2770) - we are on ocp 4.19 which support this function Signed-off-by: Wen Zhou <[email protected]> * fix: on managed cluster, we cannot set kueue to Managed, API wont work (#2774) Signed-off-by: Ugo Giordano <[email protected]> * test: add parallel components and services test execution (#2771) * chore: update manifest commit SHAs (#2773) Co-authored-by: openshift-merge-bot <[email protected]> * Revert "update: to support olmv1 we need to replace operatorcondition with (#…" (#2779) This reverts commit 6fb1439c98c4a29424bedfdeb8f145603ef14657. * chore: uplift version from 3.0 to 3.2 (#2780) - update comments, some feature we need to keep till first stable v3 release Signed-off-by: Wen Zhou <[email protected]> * chore: update manifest commit SHAs (#2781) Co-authored-by: openshift-merge-bot <[email protected]> * docs: add e2e testing tips and FAQ section to README (#2682) * docs: add e2e testing tips and FAQ section to README Add comprehensive e2e testing guidance including setup instructions, common workflows, and troubleshooting examples. Includes minimum/recommended configurations, selective test execution patterns, and component-specific testing commands. * Clarify that env vars are space-separated * chore: update manifest commit SHAs (#2787) Co-authored-by: openshift-merge-bot <[email protected]> * chore: cleanup comments and remove unused status for ArgoCD (#2788) Signed-off-by: Wen Zhou <[email protected]> * set defaults based on metric log data from perf test (#2791) * chore: update manifest commit SHAs (#2794) Co-authored-by: openshift-merge-bot <[email protected]> * Watch ingress certificate secrets to automatically sync gateway cert on rotation (#2772) * RHOAIENG-34484 | build: Remove generated files from build (#2329) * build: Remove generated files from build Instead generate them as needed. In order to allow the bundle to be built from the existing `bundle.Dockerfile` mechanism, I introduced some some logic to generate it as a multi-stage dockerfile, where the first stage runs `make bundle`. Testing ------- 1. Build the bundle from main (`make bundle-build`), make note of the image hash 2. Build the bundle from this branch (`make bundle-build`), make note of the image hash 3. Mount both images (`podman image mount $hash1 ; podman image mount $hash2`) 4. compare the directories. I use `meld` for this. Note the only difference is in timestamp. * Set controller image in manager.yaml to REPLACE_IMAGE * Also remove config/crd/external from source * Remove generated webhook manifest * Fix typo in .gitignore * Pass version information as build args * Handle empty env vars for build args * Remove generated gatewayconfigs and servicemeshes crds * fix(build): Address PR feedback Add documentation comments explaining: - OPERATOR_VERSION arg avoids clash with VERSION var - tests directory needed for package references - .dockerignore retains tests for bundle build Revert some changes to the manager image overriding, instead reworking the variable interpretation. Use SED_COMMAND * (feat): Upgrade automation (#2782) - Update release automation to support manifest shas - Remove duplication by using common utils * chore: update manifest commit SHAs (#2802) Co-authored-by: openshift-merge-bot <[email protected]> * fix: generate manifest before run unit test, fetch all needed external CRD (#2804) * chore: clean up more images which are not used for ai hub and ai pipeline (#2795) * chore: clean up more images which are not used for ai hub and ai pipeline Signed-off-by: Wen Zhou <[email protected]> * Update internal/controller/components/modelregistry/modelregistry_support.go Co-authored-by: Jon Burdo <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> Co-authored-by: Jon Burdo <[email protected]> * chore: remove unnecessary deployment mode configuration from KServe component (#2777) * Remove unnecessary DeployConfig struct from KServe component Since RawDeployment is the only supported deployment mode and it's hardcoded in the ConfigMap, we don't need the DeployConfig struct and getDeployConfig function. This cleanup is a follow-up to commit 8e20eeef which removed the deployment mode field from the API. Changes: - Remove DeployConfig struct and getDeployConfig function from config.go - Simplify updateInferenceCM to directly set hardcoded deployment mode - Update tests to parse JSON directly instead of using removed struct * Remove defaultDeploymentMode field from inference ConfigMap Since RawDeployment is the only supported mode, we no longer need to explicitly set the defaultDeploymentMode field in the inference ConfigMap. KServe will use RawDeployment by default. Changes: - Remove DeployConfigName constant (no longer needed) - Stop setting defaultDeploymentMode in updateInferenceCM - Remove deploy config validation from tests - Remove deploy config from test ConfigMap creation * chore: optimize SNO detection using Infrastructure ControlPlaneTopology (#2784) * chore: update manifest commit SHAs (#2810) Co-authored-by: openshift-merge-bot <[email protected]> * chore: update manifest commit SHAs (#2818) Co-authored-by: openshift-merge-bot <[email protected]> * feat: improve skip logic with parallel groups (#2819) * chore: fix default HWP description (#2821) * Skip oauth2_proxy cooike in upstream request (#2805) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> * Make the subdomain configurable in gateway API (#2790) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> * chore: update manifest commit SHAs (#2824) Co-authored-by: openshift-merge-bot <[email protected]> * feat: make more flex for the application namespace to be used for (#2814) Gateway (cherry picked from commit 01c7d6f6ecb0f49b3c906720982fe152e10a5ef3) Signed-off-by: Wen Zhou <[email protected]> * restrict httproutes to the application namespace (#2807) * restrict httproutes to the application namespace * fix unit tests * refactor: use cluster.GetApplicationNamespace() instead of custom getPlatformNamespace() * Call cluster.GetApplicationNamespace() directly instead of passing as parameter * linter fixes * adjust ext_authz timeout to 5s (#2823) * adjusted timetout to 5s * added env var and gateway config for auth timeout * chore: update manifest commit SHAs (#2830) Co-authored-by: openshift-merge-bot <[email protected]> * feat: use bot to update manifest sha (#2838) * docs: update pre-req for onboarding (#2835) * docs: update pre-req for onboarding Signed-off-by: Wen Zhou <[email protected]> * update: update prom rules for new stack * update: code review - typo - remove channel for cert-manager * update: wording and links * udpate: accurate info. --------- Signed-off-by: Wen Zhou <[email protected]> * feat: add env vars for configuring application and workbenches namespaces for e2e test suite (#2837) extend TestContext with the new field for workbenches namespace update CreateDSC() to support setting workbenches namespace via env var update README, set defaults for app and wb namespaces add variables for monitoring namespace update e2e test instructions for custom app namespace * chore(modelregistry): add PostgreSQL 16 image mapping (#2812) * chore(modelregistry): add PostgreSQL 16 image mapping Add IMAGES_POSTGRES to RELATED_IMAGE_RHEL9_POSTGRES16_IMAGE mapping in model registry image replacement configuration to support PostgreSQL 16 container image deployment. Related: https://github.com/opendatahub-io/model-registry-operator/pull/374 Co-Authored-By: Claude <[email protected]> Signed-off-by: Chris Hambridge <[email protected]> * Update internal/controller/components/modelregistry/modelregistry_support.go --------- Signed-off-by: Chris Hambridge <[email protected]> Co-authored-by: Claude <[email protected]> * chore: update manifest commit SHAs (#2843) Co-authored-by: openshift-merge-bot <[email protected]> * build: update image placeholder strategy (#2842) Jira: [RHOAIENG-38505](https://issues.redhat.com/browse/RHOAIENG-38505) Change the kustomize image reference from a hardcoded "controller" name to a "REPLACE_IMAGE" placeholder. The previous strategy didn't work properly for CI, because the [substitution logic](https://github.com/openshift/ci-tools/blob/2401b722f08b32d2a6926e1d918ef010ef37992c/pkg/steps/bundle_source.go#L90) modifies YAML files in the repo. * docs: update REAMDE for workbenchnamespace (#2845) Signed-off-by: Wen Zhou <[email protected]> * update: add check on dashboard's acceleratorprofile (#2846) - if CRD does not exist e.g from 3.0, no need own it -- no error into Operator Signed-off-by: Wen Zhou <[email protected]> * feat: add warning when create/update connectin API secret on S3 type but no AWS_S3_BUCKET or with "" as value (#2732) * feat: add warning when create/update connectin API secret on S3 type - if the secret does not have AWS_S3_BUCKET we show a warning but still allowed - warning if AWS_S3_BUCKET exist but has "" as value --------- Signed-off-by: Wen Zhou <[email protected]> * chore: update manifest commit SHAs (#2850) Co-authored-by: openshift-merge-bot <[email protected]> * (fix): Adapt release automation to the new make file changes (#2851) * feat: add e2e test case for verifying workbench namespace configuration (#2852) * Add comprehensive E2E tests for Gateway with OpenShift OAuth authentication (#2753) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> * RHOAIENG-25593 | feat: adding perses instance (#2642) Co-authored-by: den-rgb <[email protected]> * RHOAIENG-34502 | feat: Enable rhoai build from main branch (#2220) * feat: Enable rhoai build from main branch This required sigificant changes to the Makefile and a few different strategies: - conditionally build different versions of some structs, where there is an irreconcilable difference between `main` and `rhoai` branches (using build tags) - maintain a separate overlay of manifests and separate bundle, tracking `rhoai` specific changes where necessary. Renamed directories: - `bundle` -> `odh-bundle` - `config` -> `odh-config` New directories: - `rhoai-bundle`: contains the RHOAI bundle - `rhoai-config`: contains the RHOAI manifests With these changes most Make targets now accept the `ODH_PLATFORM_TYPE` parameter, and operate in either an odh-mode by default, or a rhoai mode if overridden to any value other than `OpenDataHub`. `get_all_manifests.sh` now has a different mode when passed `ODH_PLATFORM_TYPE` other than `OpenDataHub`, where it looks at $VERSION and infers the downstream git reference to use. (It is most easily invoked via `make get-manifests ODH_PLATFORM_TYPE=rhoai`). This adds RHOAI-specific Dockerfiles for the operator and the bundle. See the difference between the rhoai versions and odh versions by using a diff tool, such as `meld` or `diff -u`. You can compare the resulting bundle for differences by checking out the rhoai branch, and comparing `bundle.rhoai` to `bundle` in the `rhoai` branch. There are a number of small differences related to changes that haven't been made to the `rhoai` branch. * Update rhoai contents * build: update CLEANFILES to use explicit paths Replace CONFIG_DIR variable with explicit odh-config and rhoai-config paths in CLEANFILES to ensure proper cleanup of generated files for both configurations. * refactor(api): consolidate build tag type definitions Extract platform-specific type definitions from monolithic files into separate build-tagged files. Move ODH and RHOAI variant type definitions into dedicated .odh.go and .rhoai.go files while keeping shared types in base files. Affected components: - ModelRegistry (components/v1alpha1) - Workbenches (components/v1alpha1) - DSCInitialization (dscinitialization/v1, v2) - Monitoring (services/v1alpha1) This refactoring improves code organization and maintainability by separating platform-specific implementations using Go build tags rather than duplicating entire files. * docs: fix incorrect reference in MonitoringCommonSpec description Update MonitoringCommonSpec documentation to correctly reference "Monitoring" instead of "Dashboard" as the shared desired state. Remove trailing blank lines at end of file. * fix: Remove stale DEFAULT_REF reference * build(makefile): update CLEANFILES to use specific bundle directories Replace generic BUNDLE_DIR variable with explicit bundle directory names (rhoai-bundle and odh-bundle) in CLEANFILES to improve clarity and ensure both bundle variants are properly cleaned. * chore(api): update copyright year to 2025 * docs: Restore extraneous lines in api docs Locally, I use a pre-commit hook to clean up ends-of-files, but this is incompatible with our CI check. * Fix VERSION for odh/rhoai and make them independent * build: update image placeholder strategy Jira: [RHOAIENG-38505](https://issues.redhat.com/browse/RHOAIENG-38505) Change the kustomize image reference from a hardcoded "controller" name to a "REPLACE_IMAGE" placeholder. The previous strategy didn't work properly for CI, because the [substitution logic](https://github.com/openshift/ci-tools/blob/2401b722f08b32d2a6926e1d918ef010ef37992c/pkg/steps/bundle_source.go#L90) modifies YAML files in the repo. * Update placeholder for rhoai manager image * upgrade Go to 1.24 and update rhoai Dockerfile - Upgrade GOLANG_VERSION from 1.23 to 1.24 - Set ODH_PLATFORM_TYPE=rhoai in manifest script - Replace kueue-configs with hardwareprofiles config - Install tar for component dev workflow compatibility - Remove redundant chown in favor of COPY --chown * build(makefile): use platform-specific Dockerfile for image builds Add DOCKERFILE_FILENAME variable to dynamically select between Dockerfile (ODH) and rhoai.Dockerfile (RHOAI) based on ODH_PLATFORM_TYPE. This ensures image-build target uses the correct Dockerfile for each platform variant. * docs(readme,testing): document RHOAI build mode and update test guidelines Update README with instructions for building operator and bundles in RHOAI mode using ODH_PLATFORM_TYPE=rhoai flag. Remove bundle manifests from mandatory integration testing requirements and fix minor formatting inconsistencies. * Move connectionApi * push the e2e test image with the tag main instead of adding the commit hash (#2856) * update: move Prom Rule tests into component folder (#2858) - test on the PrometheusRule CR not the old Prom config files - remove triage which is only for SRE - rename rule files to match component name - remove deadmansnitch which is only for SRE Signed-off-by: Wen Zhou <[email protected]> * add some enhancements to the e2e push gha to have a version for odh (main branch) and rhoai X.Y (rhoai-x.y branch) (#2860) * RHOAIENG-25595 | Add Perses-Tempo datasource integration (#2553) Co-authored-by: Dayakar Maruboena <[email protected]> * Remove load restrictions none for kustomize builds (#2863) * feat: remove LoadRestrictionsNone from kustomize * feat: move rhoai-config into odh-config/rhoai overlay * rebase with master * fix: generate rhoai manifest and webhook name * Update manifest sha in get_all_manifests.sh correctly (#2870) * fix: update manifest sha in get_all_manifests.sh correctly * fix: revert namespace name in makefile * feat: update rhoai manifests to rhoai-3.2 tag * docs: note bundle is still needed for rhoai branch (#2801) * docs: note bundle is still needed for rhoai branch * Fix typo Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * fix: dsc odh samples (#2869) * update: cleanup unnecessary upgrade logic (#2695) - remove functions and calls not needed for 3.2 (should be removed in 3.0 already) jupyterhub resource cleanup rbac for modelreg envoyfilter patch for serving watson resource docs patch odhdashboardconfig for trusty enablement - move and cleanup function for AP to HWP migration Signed-off-by: Wen Zhou <[email protected]> * chore: update manifest commit SHAs (#2875) Co-authored-by: openshift-merge-bot <148852131+openshift-merg…
Description
How Has This Been Tested?
Screenshot or short clip
Merge criteria
E2E test suite update requirement
When bringing new changes to the operator code, such changes are by default required to be accompanied by extending and/or updating the E2E test suite accordingly.
To opt-out of this requirement:
E2E update requirement opt-out justificationsection belowE2E update requirement opt-out justification
cleanup code, no e2e
Summary by CodeRabbit
Bug Fixes
New Features
Refactor