-
Notifications
You must be signed in to change notification settings - Fork 214
update: add check on dashboard's acceleratorprofile #2846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update: add check on dashboard's acceleratorprofile #2846
Conversation
- if CRD does not exist e.g from 3.0, no need own it -- no error into Operator Signed-off-by: Wen Zhou <[email protected]>
WalkthroughThe dashboard component's watch for DashboardAcceleratorProfile is updated to include a CRD-existence guard. The OwnsGVK call now passes Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~5 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2846 +/- ##
=======================================
Coverage 50.07% 50.07%
=======================================
Files 144 144
Lines 10469 10469
=======================================
Hits 5242 5242
Misses 4669 4669
Partials 558 558 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: carlkyrillos The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
70410f3
into
opendatahub-io:main
|
/cherrypick rhoai |
|
@carlkyrillos: new pull request created: #2849 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
* Remove servicemesh not installed check. (#2443) After adding the gatewayconfig CRD and controller, ODH and RHOAI will always have servicemesh installed automatically by the openshift ingress controller. It will not be possible to get into a state without OSSM 3.x installed. * remove redundant e2e test case (#2454) * Change naming convention in mod arch image for dashboard (#2449) * fix: fix catalog image related_image env variable name as per devops config (#2452) Signed-off-by: Dhiraj Bokde <[email protected]> * update: sample of DSCI and DSC + README (#2447) - since we have introduce GW API creation, we should update our sample - ossm v3 is installed out of the box, we should not ossm v2 still "Managed" as we advocate - set to Unmanaged than Removed, to avoid accidentally deletion - this whole setion should be taken away soon in 3.0 release Signed-off-by: Wen Zhou <[email protected]> * fix: connection API on ISVC if type is changed + only create serviceaccount if type is s3 (#2433) * fix: connection API on ISVC if type is changed - we previously only handle when type is removed then we cleanup injection - now we handle update as well - refactor some old code - handle serviceaccount creation and injection for s3 only Signed-off-by: Wen Zhou <[email protected]> * fix: lint Signed-off-by: Wen Zhou <[email protected]> * update: code review - correct log and skip Patch if no cleanup is needed Signed-off-by: Wen Zhou <[email protected]> * update: fix case when there was no annotation - if isvc exists but no annoatation: user might manually set certain part, e.g .spec.predictor.model.storageUri on update with annotation, old storageUri etc should be removed before new injection to avoid confliect Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * feat(RHOAIENG-30795): Add configuration support to TrustyAI DSC component (#2225) * feat(RHOAIENG-30795): Add configuration support to TrustyAI DSC component * fix: Add missing docs * fix: Fix linting errors * fix: Remove redundant labels and pointer to boolean fields Since they have default values, no need to use pointers. * fix: Add TrustyAI CM to operator's resources. Remove unneeded annotations. * fix: Change tests to account for AddResources * fix: Linting of trustyai_test.go * Merge test for default CM and enabled component * Fix linting (remove dupl) * chore: make bundle * fix: revert REPLACE_IMAGE * chore: skip HWP migration if dashboard HWP CRD missing (#2470) * feat: add support to pass Oauth proxy image for downstream (#2466) * add override for new workbenches (#2465) * prometheus test should check monitoring namespace not apps namespace (#2453) * feat: add new VAP to block create/update operation on dashboard's HWProfile CR and AcceleratorProfile CR (#2378) * feat: add new VAP to block create/update operation on dashboard's HWProfile CR and AcceleratorProfile CR - since this will only work on OCP 4.19+, no need check OCP versoin, as VAP/VAPB is enabled in the cluster - in the case if the cluster already have dashboard's AP/HWP CRD DSCI should deploy and own 4 new resources: 2 VAP and 2 VAPB to block if any attmpt to create dashboard AcceleratorProfile/HardwareProfile CR or update existing AcceleratorProfile/HardwareProfile CR these 4 resoruces will be re-created if user try to delete them when DSCI is deleted, these 4 resources should be removed from cluster Signed-off-by: Wen Zhou <[email protected]> * fix: testscases Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * update: remove logic set default value in odhdashboardconfig in upgrade (#2469) Signed-off-by: Wen Zhou <[email protected]> * feat: optimize E2E test framework performance and error handling (#2418) * feat: add comprehensive E2E testing framework with improved resource management - Add DeleteResources function for efficient bulk resource deletion with options - Implement EnsureResourceDeletedThenRecreated with UID-based verification - Add WithRemoveFinalizersOnDelete option for automatic finalizer handling - Support WithWaitForRecreation flag for controller-managed resources - Replace ExpectedErr with AcceptableErrMatcher using proper Kubernetes error codes - Fix tryRemoveFinalizers to handle success, NotFound, and NoKindMatchError gracefully - Add controllerCacheRefreshDelay to prevent cache staleness issues - Consolidate cleanup functions and remove redundant code patterns - Add ResourceQuota GVK definition and improve bulk deletion documentation This comprehensive framework eliminates race conditions in controller-managed resource testing and provides consistent, reliable patterns for deletion-recreation scenarios with enhanced safety and performance optimizations. * feat: enhance core component E2E tests and ServiceMesh organization - Add comprehensive component testing with enhanced determinism - Improve test organization and maintainability across core test suites - Move resilience tests after functional tests for cleaner failure attribution - Fix KnativeServing test synchronization with deletion/recreation flags - Add proper scoping to cleanupListResources with safety checks - Optimize ServiceMesh test organization with consolidated validation patterns - Extract common validation helpers following "with" pattern conventions - Consolidate repetitive operator configuration patterns using shared constants - Add reusable helpers for monitoring resource cleanup and DSCI validation - Move shared operator namespace constants to centralized helper file - Rename validation functions for consistency (buildCapabilityConditions → withCapabilityConditions) - Reorganize helper functions placement for better code structure - Optimize E2E performance with better resource management - Apply framework patterns across component test suites - Refactor UninstallOperator to use ResourceOpts pattern for better consistency - Replace hard sleeps with WithWaitForDeletion in ServiceMesh operator uninstallation - Remove custom uninstallOperatorWithChannel function in favor of standardized approach - Eliminate duplicate validation patterns with DRY principle application - Refactor Gateway tests and centralize GVKs This enhances the core testing infrastructure with improved coverage, resilience patterns, ServiceMesh test organization, and maintainable validation helpers that reduce code duplication while following established framework conventions. # Conflicts: # tests/e2e/servicemesh_test.go # Conflicts: # tests/e2e/creation_test.go # tests/e2e/dashboard_test.go # tests/e2e/monitoring_test.go * Removing the ServiceMesh controller and some ServiceMesh/Authorino e2e tests, as they will no longer be present in 3.0. * Commenting out gateway test cases for now, as the feature is still in the early stages of implementation. * feat: add support for llmisvc to use kueue (#2293) * feat: add support for llmisvc for kueue - rename old kueue one for isvc so we can have both isvc and llmisvc Signed-off-by: Wen Zhou <[email protected]> * fix: code review Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * feat: add GH Action to enforce e2e test suite update for code-related PRs (#2426) fix: split workflow into two to allow the bot to comment on fork PRs reuse existing error comments and add automatic error comment cleanup, to prevent PR noise add option to provide skip justification in PR descprition re-use same error comment for both error cases remove functionality to add justification via PR comment, to simplify the workflow and improve user experience * addressed comments in PR 2404 (#2444) * Use custom manifests without devFlags (#2427) * Use custom manifests without devFlags Jira: https://issues.redhat.com/browse/RHOAIENG-31642 This gives instructions for an alternative approach to using custom manifests with the operator, which don't require the use of "devFlags" in the DSC. The intention is that this will allow us to remove the devFlags functionality for 3.0. This is a primitive approach for now. I expect that folks will start to build better development patterns rather than needing to modify the patch file for their component, etc. They can even have their own approaches in their own repositories, and this may just be an example. * Don't re-chown files in Dockerfile They're already chowned as expected from the COPY instruction on L60. * feat: add comprehensive debug utilities for E2E test failures and timeouts (#2486) * feat: add comprehensive debug utilities for E2E test failures and timeouts - Add debug_utils_test.go with global panic and failure handlers - Integrate HandleGlobalPanic() and HandleTestFailure() at multiple test levels - SetGlobalDebugClient() automatically configured in NewTestContext() - Provides detailed diagnostics on ANY test failure (panics, timeouts, assertions): * Cluster state and node health with resource allocation percentages * Namespace resources (deployments, pods, container states) * Operator deployment status and pod logs * DSCI/DSC status and unhealthy conditions * Recent events from key namespaces (last 5 minutes) * Resource quota violations Addresses debugging challenges for deployment recreation failures, resource exhaustion, and other cluster-level issues in E2E tests. Activates automatically without requiring manual intervention. * Added security redaction for secrets, tokens, and credentials in logs * fix: limit e2e update requirement check GH action only to the main branch (#2497) chore: improve documentation and error messages * update: temp. disable e2e test on dashboard (#2499) - this should be reverted once dashboard is fixed Signed-off-by: Wen Zhou <[email protected]> * removing the py311 images (#2484) Co-authored-by: Wen Zhou <[email protected]> * RHOAIENG-29717 | feat: Deploy thanos querier frontend for metrics acces (#2448) Co-authored-by: den-rgb <[email protected]> * chore: e2e update requirement GH action: edit regex for justification section title, ensure guidelines are consistent (#2509) * fix: prevent secret data leakage in operator error logs (#2471) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> * test: increase resource check timeout, fix debug (#2506) * Remove codeflare operator, monitoring and tests (#2468) * feat(RHOAIENG-34034): remove codeflare operator, monitoring and tests * test(RHOAIENG-34034): add e2e test to verify codeflare component is not removed * fix(RHOAIENG-34034): improve test and add test flag in README * test(RHOAIENG-34034): use consistently check * test: improve test readibility and reusability * fix: watch on Dashboard's AProfile/HWProfile by DSCI (#2512) - add watch on these two CRD on create/update, so VAP/VAPB need to be created - this is needed for the testcase on dashboard once it is enabled and still ship these two CRD - we probably can remove this logic alongwith the dashbaord tests on VAP/VAPB once dashboard cleanup CRD from their manifests Signed-off-by: Wen Zhou <[email protected]> * chore: log on webhook for notebook with HWProfile (#2491) - different for not able to find HWProfile VS client Get() failed - update e2e tests to add new error code Signed-off-by: Wen Zhou <[email protected]> * feat: auto detect auth, kube auth proxy, envoy filter (#2490) * fix gateway controller namespace handling, add status sync, and improve test patterns * feat: detect auth, kubeauthproxy, envoyfilter * added better support for oidc, fixed kubeproxy/envoyfilter issue * used apply instead of addresources for secret * fixed cookie mismatch for oauth/oidc * Initial effort at migrating hardwareprofiles from v1alpha1 to v1 (#2398) Rebased, included v1alpha1 hardwareprofile changes added deprecated markers to v1alpha1 hardwareprofiles Rebased on latest, incremental Makefile version bump, regenerated bundle v1alpha1 HardwareProfiles must be registered to scheme minor comment update to reflect infrav1 hardwareprofiles adding hardwareprofile test cases # Conflicts: # bundle/manifests/opendatahub-operator.clusterserviceversion.yaml Signed-off-by: Max Whittingham <[email protected]> * chore: remove unused pkg/feature and related integration tests (#2521) * chore: remove unused pkg/feature and related integration tests * chore: remove unused unit-tests1.yaml workflow * collector validation should check metrics resources or storage (#2515) * refactor: remove duplicated reconciler crd existence check (#2483) * chore: uplift version + fix bundle + fmt + fix monitoring tests (#2526) Signed-off-by: Ugo Giordano <[email protected]> * update: add cleanup when --fail-fast is called we still get leftover cleanup (#2478) Signed-off-by: Wen Zhou <[email protected]> * implemented destination rule for TLS to upstream services (#2536) * feat: add default HWProfile CR "default-profile" (#2461) * feat: add default HWProfile CR "default-profile" - remove the sample one we added before, with this PR it will create one in the cluster - user will be able to change this default-profile CR, operator wont reconcile - if user change DSCI or DSCI gets reconciled, it should not reset default-profile CR with user modified value. - if user delete this default-profile CR, operator will create a new one with default values Signed-off-by: Wen Zhou <[email protected]> * update: code review Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * feat: switch isvc connection annotation from type to protocol (#2532) * feat: switch isvc connection annotation from type to protocol * fix: addressed PR comments * Update unit-tests2.yaml to codecov-actions v5.5.1 (#1) (#2542) Update unit-tests2.yaml to codecov-actions v5.5.1 * fix: version inconsistency (#2544) Signed-off-by: Wen Zhou <[email protected]> * add support for overriding kserve-llm-d (#2537) * feat: add oauth proxy parametrization to dashboard (#2547) * feat: add support for llmisvc to use HWProfile (#2424) * feat: add support for llmisvc to use HWProfile - if HWprofile has .spec.identifiers, add this into .spec.tempalte.containers(llmisvc) - if HWprofile has .spec.scheduling - if it is type: kueue, set label "kueue.x-k8s.io/queue-name" with value .spec.scheduling.kueue.localqueue - if it is type: node and has .spec.scheduling.node .spec.scheduling.node.nodeSelector add into .spec.template.nodeSelector(llmisvc) .spec.scheduling.node.tolerations add into .spec.template.tolerations(llmisvc) Signed-off-by: Wen Zhou <[email protected]> * fix: add support for llmisvc with min config - if user only set spec: {} when create llmisvc, we should: 1. create a container named as "main" 2. inject identifier from HWProfile Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * Support override for runtime images (#2464) * support override for runtime images * add override for vllm-spyre image * fix: prevent cache false hits for resources stuck in deletion (#2527) * fix: bypass cache for objects with deletionTimestamp to prevent stuck resources - Add cache bypass logic for objects with deletionTimestamp in deploy action - Implement proactive cache cleanup in ShouldSkip method - Add Delete method to Cache for cleaning up stale entries - Extract cache logic into isCachedAndShouldSkip helper method - Add comprehensive unit tests for cache cleanup verification - Apply line-of-sight principle for better code readability Fixes cache bug where resources.Hash() removes deletionTimestamp causing identical hashes for objects with/without deletion timestamps. This led to false cache hits preventing recreation of any Kubernetes resource type (deployments, services, configmaps, secrets, CRDs, etc.) during E2E tests. Root cause: pkg/controller/predicates/resources/resources.go:295 Impact: E2E test failures when resources stuck in deletion while still in cache Solution: Objects with deletionTimestamp always bypass cache and trigger cleanup Scope: Affects all Kubernetes resources deployed via deploy action * refactor: simplify cache test suite and improve terminating object handling - Skip deployment for terminating objects instead of attempting and failing - Refactor ShouldSkip to handle terminating objects before cache logic - Rename ProcessCacheEntry to CleanupIfTerminating for clarity - Make Cache.Delete method idempotent - Convert TestDeployWithCacheAction to clean table-driven approach - Remove unused test helper functions (saved ~150+ lines) - Simplify test structure with direct function calls in table - Apply micro-optimization to avoid DeepCopy on skip path All tests pass with improved maintainability and performance. * Update action_deploy_cache.go Removed redundant check when deleting a key from the cache * fix(RHOAIENG-34533): resolve TrustyAI DSC validation error when patching DSC eval flags (#2525) * fix(RHOAIENG-34533): resolve TrustyAI DSC validation error when patching DSC eval flags This change resolves RHOAIENG-34533 by converting TrustyAI evaluation configuration fields from boolean to enum string values. Changes: - Convert PermitCodeExecution and PermitOnline from bool to string with enum validation (allow/deny) - Add proper default handling and constants for evaluation permissions - Update CRDs, bundle manifests, and test expectations - Maintain backward compatibility in ConfigMap generation * add updated API * Simplify CM value assignment, revert image name to v3.0.0 * test: revert "temp. disable e2e test on dashboard" from PR2507 and fix VAP/VAPB e2e tests (#2561) * Revert "update: temp. disable e2e test on dashboard (#2499)" This reverts commit 9e847d546cb75c2bb14bd32ef54df0f1fc1c5894. * fix: VAP/VAPB testcase - move from DSCI to Dashbaord Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * feat: add oauth proxy parametrization to dashboard (#2554) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> * RHOAIENG-34055: add ray sanity check for codeflare removal (#2514) * feat(RHOAIENG-34055): add ray sanity check for codeflare removal * feat(RHOAIENG-34055): create a common sanitycheck action * feat: add GHA to auto build and push e2e tests image after each merge to main (#2543) * checkin (#2555) * remove useless kueue-batch-user-rolebinding (#2571) Removed the config/kueue-configs/ directory and all associated references as the kueue-batch-user-rolebinding functionality is no longer needed. * OCP console link to odh should match gateway address (#2569) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> Co-authored-by: James Tanner <[email protected]> * fix: add llmisvc kueue test case (#2573) * feat: add support for Connection API in LLMInferenceservice - add create/update/removal case - create $llmisvc-sa serviceaccount and inject into .spec.template.serviceAccountName - serect should support both cases: data.URI and data.http-hosts according to kserve for both isvc and llmisvc - simplify logging by remove duplicated error and caller print error - only remove connectoin added imagepullsecret from list if remove annotation - add support for only change conneciton-path value and .spec.mode.uri should updated too - when update, if manually set path in spec, use it; if annotation is set, use annotation; if neither set, and previously already has path (maybe was created by gitops or dashboard) keep it Signed-off-by: Wen Zhou <[email protected]> * update: code review - fix typo - update handleSA() to deal with servicename more precisely - cleanup func and const not in use any more Signed-off-by: Wen Zhou <[email protected]> * update: code review - remove duplicated function for serviceaccount - add check on both annotations Signed-off-by: Wen Zhou <[email protected]> * update: code review - serviceAccoutName should only be removed if it is not having the same value derived from serect's name - serviceAccountName should only be injected if it did not have a value already Signed-off-by: Wen Zhou <[email protected]> * RHOAIENG-31870: Add DSC and DSCI v2 versions. (#2505) RHOAIENG-35095: Webhook dsc, dsci and kueue integration_test.go unit-tests disabled Added hack/buildLocal.sh : a script to build locally and deploy to crc * cleanup unused referencegrant gvk code (#2576) * Add clusterrole for allowedgroups (#2494) * add clusterrole for allowedgroups fix unit test remove unused role for allowed group * fix comment and error message * Bug fix for gateway ODIC mode (#2591) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> Co-authored-by: James Tanner <[email protected]> * Removed monitoring namespace from flags, and documentation to set via env vars. Viper now sets the defaults vars for monitoring namespace instead. (#2538) Signed-off-by: Max Whittingham <[email protected]> * Update golangci-lint to v2.5.0 (#2582) * chore: fix api version (#2594) Signed-off-by: Wen Zhou <[email protected]> * fix: retrieve all files in a PR (#2595) * feat(modelregistry): add benchmark data image environment variable mapping (#2498) Add IMAGES_BENCHMARK_DATA environment variable mapping to support benchmark data injection in model registry deployments. This enables the model-registry-operator to configure benchmark data images via init containers for benchmark visualization and analysis features. Related to: https://github.com/opendatahub-io/model-registry-operator/pull/324 Follows the same pattern established in previous PRs: - https://github.com/opendatahub-io/opendatahub-operator/pull/2429 - https://github.com/opendatahub-io/opendatahub-operator/pull/2452 🤖 Generated with [Claude Code](https://claude.ai/code) Signed-off-by: Chris Hambridge <[email protected]> Co-authored-by: Claude <[email protected]> Co-authored-by: Wen Zhou <[email protected]> * RHOAIENG-33891: DSC v1 should not be created or updated with Kueue component state set to Managed (#2602) * Add kube-linter to checks Kubernetes manifests against various best practices, with a focus on production readiness and security (#2605) See: - https://docs.kubelinter.io/ - https://github.com/stackrox/kube-linter * Add webhook to prevent creation of dashbaord's HardwareProfile and AcceleratorProfile (#2599) * chore: use generic Getter[T] instead of StringGetter (#2608) Update StringGetter to use Go generics, allowing the type to work with any return type instead of being limited to strings. * chore: Update GitHub workflows and issue labels and fix bug (#2607) - Add v3 label to issue_label_bot.yaml - Rename workflow yaml: test for e2e, release for ODH release, - Remove 'incubation' branch from linter.yaml triggers - Update email - Update API docs config to exclude all List types Signed-off-by: Wen Zhou <[email protected]> * RHOAIENG-34045: cleanup codeflare manifests, remove codeflare from DSC v2 (#2596) * feat(RHOAIENG-34045): cleanup codeflare manifests, remove codeflare from DSC v2 * feat(RHOAIENG-34045): remove CodeFlare from api docs and remove finalizers update permissions * update: add new members (#2612) Signed-off-by: Wen Zhou <[email protected]> * feat(e2e): improve test framework resilience and fix monitoring controller cleanup (#2550) * feat(e2e): add explicit resource lifecycle methods and improve test framework - Add EventuallyResourceCreated, EventuallyResourceUpdated, EventuallyResourcePatched - Add Patch function in testf for explicit patch-only behavior - Replace CreateOrUpdate with CreateOrPatch for better concurrency - Fix resourceVersion conflicts and test timing issues - Add comprehensive testf test coverage and helper functions - Modernize kueue_test.go with jq.Match validation - Reduce test code duplication and improve error messages * refactor(monitoring): remove redundant management state checks and unify actions - Remove isMonitoringManaged() function and all management state checks from actions - Remove MonitoringNotManagedMessage constant (no longer needed) - Combine MonitoringStack and ThanosQuerier into single deployMonitoringStackWithQuerier action - Combine Tempo and Instrumentation into single deployTracingStack action - Extract validateRequiredCRDs helper to reduce CRD validation duplication - Add CRDRequirement struct for consistent condition handling - Rename setCRDNotFoundCondition to setConditionFalse for clarity The monitoring actions now focus purely on configuration validation and CRD availability checks. The optimization also reduces the monitoring controller from 5 deployment actions to 3 and eliminates condition duplication. Breaking changes: - Remove deployThanosQuerier, deployTempo, deployInstrumentation individual actions - Add deployMonitoringStackWithQuerier and deployTracingStack unified actions * test: temporarily disable RBAC and ServiceAccount deletion recovery tests These tests are experiencing timing issues with external dependencies and token refresh cycles. Temporarily disabling to unblock CI while we investigate proper solutions for testing these resource types. TODO: Re-enable after investigating timing issues. * chore: remove placeholder CRD and skip kubebuilder for dashboard actions (#2609) Remove empty CRD template file and add kubebuilder:skipmake directive to dashboard_controller_actions.go to prevent unintended CRD generation. * Use commit sha in manifest file (#2540) * feat(RHOAIENG-34447): add support to commit sha component target and add workflow to update it * feat: upgrade commit sha * chore(kube-lint): add CEL-based linter to prevent system group bindings in ClusterRoleBinding (#2617) Add custom kube-linter check to detect ClusterRoleBindings that target system groups (e.g., system:authenticated, system:unauthenticated). * feat: update IMAGES_BENCHMARK_DATA to new image (#2622) Signed-off-by: Alessio Pragliola <[email protected]> * add e2e tests for reconciliation resilience (#2364) * Remove ModelMesh component and infrastructure from OpenDataHub Operator (#2565) * feat(RHOAIENG-34026): remove modelmeshserving manifest download and component registration - Remove modelmeshserving from get_all_manifests.sh - Remove modelmeshserving import from cmd/main.go # Conflicts: # get_all_manifests.sh * feat(RHOAIENG-34026): comprehensive ModelMesh removal for RHOAI 3.0 This commit removes ModelMesh functionality while preserving v1 API compatibility: Core Functionality Removal: - Remove ModelMeshServing component controller and implementation - Remove ModelMeshServing monitoring configurations - Update ModelController to ignore ModelMeshServing in DSC v1 (always treats as Removed) - Update ModelController tests to reflect new RHOAI 3.0 behavior Cleanup and Infrastructure: - Remove ModelMeshServing RBAC editor/viewer roles - Remove ModelMeshServing E2E tests - Remove ModelMesh Prometheus unit tests - Remove ModelMeshServing from sample DSC configuration - Remove ModelMesh from component integration documentation - Remove KServe ModelMesh CRD conflict handling logic - Remove unused imports and monitoring embeds API Strategy: - Keep v1 DSC API intact with ModelMeshServing field for upgrade compatibility - ModelController ignores ModelMeshServing settings (always Removed in RHOAI 3.0) - Remove actual component functionality while maintaining API backward compatibility Testing Updates: - Update ModelController tests to reflect new behavior - Tests now validate that ModelMeshServing no longer enables ModelController - All tests pass with new RHOAI 3.0 logic Build verification: - All code compiles successfully - Manifest generation works correctly - ModelController unit tests pass with new behavior # Conflicts: # internal/controller/components/modelcontroller/modelcontroller.go # internal/controller/components/modelmeshserving/modelmeshserving.go # internal/controller/components/modelmeshserving/modelmeshserving_test.go * Remove ModelMeshServing component and API references This commit removes ModelMeshServing component from the opendatahub-operator codebase as part of RHOAI 3.0 migration to KServe-only model serving. Changes include: ## API and Controller Changes: - Keep ModelMeshServing API types for v1 DSC backward compatibility - Set ModelMeshServing ManagementState to 'Removed' in ModelController - Remove ModelMeshServing logic from modelcontroller_actions.go - Add deprecation markers to ModelMeshServing types ## E2E Test Updates: - Remove ModelMeshServing test cases from controller_test.go - Update modelcontroller_test.go to focus on KServe functionality - Remove ModelMeshServing from resilience_test.go and helper_test.go - Add partial upgrade test for ModelMeshServing resource preservation ## Monitoring and Configuration Cleanup: - Remove ModelMeshServing Prometheus scrape jobs and SLO rules - Remove ModelMeshServing alerting rules from prometheus-configs.yaml - Remove ModelMeshServing from monitoring controller mappings ## Sample and Documentation Updates: - Remove modelmeshserving from DSC sample configurations - Fix incorrect app.kubernetes.io/managed-by labels in samples (RHOAIENG-32728) - Remove ModelMeshServing references from README.md and DESIGN.md - Remove ModelMeshServing component from upgrade logic ## Bundle and Manifest Updates: - Update CSV to remove ModelMeshServing CRD references - Add ModelMeshServing CRD to bundle for compatibility Addresses: RHOAIENG-34026, RHOAIENG-32728 Partial: RHOAIENG-34036 (upgrade testing) # Conflicts: # docs/api-overview.md * feat: Remove ModelMeshServing component from RHOAI 3.0 - Remove ModelMeshServing CRD generation from kustomization.yaml - Delete ModelMeshServing CRD file (components.platform.opendatahub.io_modelmeshservings.yaml) - Clean up bundle manifests: remove ModelMeshServing references from CSV and CRDs - Update API documentation config to exclude ModelMeshServing types - Remove ModelMeshServing from v2 DataScienceCluster sample - Clean up catalog.yaml: remove ModelMeshServing GVK, CRD description, and keywords - Preserve v1 DataScienceCluster compatibility for smooth upgrades - Maintain RBAC permissions and e2e tests for upgrade safety - Update TrustyAI controller to remove obsolete model-mesh label check - Simplify ModelController types and remove ModelMeshServing dependencies - Update conversion logic between v1 and v2 DataScienceCluster APIs - Clean up documentation and design files * Update the DSPO commit to leverage kube-rbac-proxy (#2636) See: https://github.com/opendatahub-io/data-science-pipelines-operator/pull/920 https://issues.redhat.com/browse/RHOAIENG-34577 Signed-off-by: mprahl <[email protected]> * chore: update manifest commit SHAs (#2637) Co-authored-by: zdtsw <[email protected]> * fix: resolve RBAC errors and improve resilience test reliability (#2643) - Fix pod deletion RBAC issue by using individual DeleteResource instead of bulk deletion - Restore ModelRegistry component to quota tests after removing stuck finalizer - Optimize jq expressions to handle deployment name patterns and readyReplicas checks - Add pod restart after RBAC restoration for faster test recovery Resolves "server does not allow this method" error and makes tests more reliable. * RHOAIENG-29717 | fix: Adding missing test case (#2510) Co-authored-by: den-rgb <[email protected]> * Add a RELATED_IMAGE hook for kube-auth-proxy (#2632) Co-authored-by: jctanner <[email protected]> * removing wrong labels from samples in DSCI/DSC v2 (#2650) * RHOAIENG-33158 | feat: enhance hardware profile management by adding custom-serving profile (#2578) * RHOAIENG-33158 | feat: enhance hardware profile management by adding custom-serving profile - Introduced a new HardwareProfile named 'custom-serving' in the YAML configuration. - Updated the DSCInitializationReconciler to manage both default and custom hardware profiles. - Modified the logic to check for the existence of the custom hardware profile during reconciliation. * chore: Fixed calling renamed function by refering new name * chore: reworked default and custom hardwareprofile CRs deployment - split the yamls to seperate files to reduce confusion and easier identification - reworked the logic to check and deploy after checking both custom and default hwp errors. --------- Co-authored-by: Wen Zhou <[email protected]> * feat: update workflow to point to correct branch (#2652) * chore: update manifest commit SHAs (#2646) Co-authored-by: openshift-merge-bot <[email protected]> * RHOAIENG-30940: Remove devFlags support (#2588) * feat(RHOAIENG-30940): remove devFlags support * fix(RHOAIENG-34045): small typo * fix: lint * test: add e2e test on HWProfile v1alpha1 and v1 (#2635) * test: add e2e test on HWProfile v1alpha1 and v1 - create CR on v1, test it can be read by v1 and v1alpha1 - create CR on v1alpha, test it can be read by v1 and v1alpha1 * update: prolong timeout * update: move HWProfile test to a dedicated suite Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * RHOAIENG-33158 | feat: Implement migration of HardwareProfiles from AcceleratorProfiles and container sizes (#2529) * RHOAIENG-33158 | feat: Implement migration of HardwareProfiles from AcceleratorProfiles and container sizes - Added functions to migrate AcceleratorProfiles to HardwareProfiles, creating separate profiles for notebooks and serving. - Implemented migration for container sizes, generating HardwareProfiles based on specified resource limits. - Created a special HardwareProfile for InferenceServices without associated AcceleratorProfiles or container sizes. * RHOAIENG-33158 | chore: addressed PR comments - Removed custom-serving HWP out of upgrade.go - now trying to load odhDashboardConfig from manifests if its not available in cluster - other refactors and logs * chore: Renamed funcs to better describe the functionality - Adjusted comments to explain the code better. * RHOAIENG-33158 | chore: addressed PR comments - making sure hwp names are all lowercase and without spaces - changed string literal references to string constants - returning early if application namespace is empty - migration will proceed only if old major version is 2 and current version is 3 * fix: correct GatewayConfig validation error handling for OIDC configuration (#2654) * chore: update manifest commit SHAs (#2655) Co-authored-by: openshift-merge-bot <[email protected]> * chore: remove unused const (#2661) - rename GatewayKind to GatewayConfigKind Signed-off-by: Wen Zhou <[email protected]> * chore: fix previous HWProfile test (#2659) - revert timeout - update docs Signed-off-by: Wen Zhou <[email protected]> * CLI to retry flaky test in job (#2611) * feat(RHOAIENG-34877): add test cli to retry e2e tests during job execution to avoid retest * fix: improve retry skip filter and add github actions to run tests on CLI * (feat): Rename DatasciencePipelines to AIPipelines (#2589) * (feat): Rename DatasciencePipelines to AIPipelines Signed-off-by: Ajay Jaganathan <[email protected]> # Conflicts: # api/datasciencecluster/v1/datasciencecluster_conversion.go # api/datasciencecluster/v2/datasciencecluster_types.go # api/datasciencecluster/v2/zz_generated.deepcopy.go # bundle/manifests/datasciencecluster.opendatahub.io_datascienceclusters.yaml # bundle/manifests/opendatahub-operator.clusterserviceversion.yaml # config/crd/bases/datasciencecluster.opendatahub.io_datascienceclusters.yaml # docs/api-overview.md # pkg/upgrade/upgrade.go # tests/e2e/datasciencepipelines_test.go * add logic for converting installed components Signed-off-by: Ajay Jaganathan <[email protected]> --------- Signed-off-by: Ajay Jaganathan <[email protected]> * Fix session issue with kube-auth-proxy (#2625) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> * update: remove DSCI and DSC v1 sample from CSV (#2667) Signed-off-by: Wen Zhou <[email protected]> * RHOAIENG-33892: Kueue controller should report an error in case a DSC has Managed state set in Kueue component. (#2645) RHOAIENG-33894: Remove all the remaining logic/tests to handle Managed Kueue component state. * Change api group for hardware profiles in admingroup role (#2631) * Change api group for hardware profiles in admingroup role * Apply suggestion from @zdtsw * Update internal/controller/services/auth/resources/admingroup-role.tmpl.yaml --------- Co-authored-by: Wen Zhou <[email protected]> * Enable CLI access via OpenShift OAuth bearer tokens (#2666) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> Co-authored-by: James Tanner <[email protected]> * Fix multiple run of upgrade tests (#2665) * test: fix multiple run of upgrade tests * test: fix v2tov3upgrade tests and test cli * test: add test on duplicated test names in cli * feat: add gen-ai image to the dashboard (#2673) * fixed bug where dup bearer token was added via cli access (#2684) * Remove Serverless Mode, Service Mesh, and Authorino Infrastructure from OpenDataHub Operator (#2560) * remove KServe Serverless mode, removed ServiceMesh, Serverless and Authorino infra add v2tov3 upgrade test remove ServiceMesh spec from DSCI v2 api remove servicemesh manifests remove servicemesh spec from dsci v1 sample, regenerate bundle disable webhook test retain only minimal servicemesh api, update docs generation ignore rules fix NoMatchError checking remove redundant istio/serverless predicates and status condition variables removed OSSMv2 pre-condition check from kserve controller remove serving spec and defaultDeploymentMode field from KServe spec, re-generate manifests and bundle remove leftover codeflare mention in CSV manifest * Clean up types, status and doc strings Signed-off-by: Christopher Sams <[email protected]> * re-enable webhook test, re-generate manifests and bundle * add ServiceMesh to removedCRDToCreate list, fix linter issues --------- Signed-off-by: Christopher Sams <[email protected]> Co-authored-by: Christopher Sams <[email protected]> Co-authored-by: Ugo Giordano <[email protected]> * chore: update manifest commit SHAs (#2668) Co-authored-by: openshift-merge-bot <[email protected]> * set the default mode for kube-auth-proxy (#2677) * fix: optimize failed test rerun (#2686) * RHOAIENG-34261 | Enable structured YAML for metrics exporters and fix UX inconsistency between metrics and traces (#2487) Co-authored-by: Dayakar Maruboena <[email protected]> * feat(RHOAIENG-35049): add support to pass RAGAS KFP image for downstream (#2566) Co-authored-by: Wen Zhou <[email protected]> * fix: remove test on old dashboard AProfile and HWProfile (#2696) Signed-off-by: Wen Zhou <[email protected]> * RHOAIENG-33159 | Replace accelerator and container size annotations to hardwareprofile annotations (#2675) * RHOAIENG-33159 | Replace accelerator and container size annotations to hardwareprofile annotations * Update: - remove custom-serving from config, dsci should only create default one - dynamically create custom-serving HWProfile CR in upgrade.go - move support functions into upgrade_utils.go - remove duplicated calls in the loops - simplify hwprofile name concat - remove unused gvk - change logging info. - remove old test on custom-serving Signed-off-by: Wen Zhou <[email protected]> * fix: lint Signed-off-by: Wen Zhou <[email protected]> * update: code review change - fix missing permisison on notebooks - fix namespace lookup for all namespace - fix missing annoataion for hwprofile namespace set Signed-off-by: Wen Zhou <[email protected]> * update: reduce unnecesary calls to get annoataions Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> Co-authored-by: Wen Zhou <[email protected]> * added kube auth proxy metrics address (#2693) * fix: annoatation for custom-serving HWProfile Signed-off-by: Wen Zhou <[email protected]> (cherry picked from commit 719aa32ce6d0fe24b62aa72aa8e2888e02e07c13) * update notebooks manifest hash for gateway support (#2705) * fix: Correct RAGAS KDP (#2702) Signed-off-by: Rui Vieira <[email protected]> * Make cookie expiration and refresh interval configurable (#2706) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> * chore: update docs + api comments (#2712) Signed-off-by: Wen Zhou <[email protected]> * test: add default kueue configuration test (#2704) * fix: fixed quotes for HWP visibility annotation (#2717) * added validation for clientid/issuerurl (#2708) * Remove installedComponents field from DataScienceCluster v2 API (#2719) - Remove InstalledComponents field from v2 DataScienceClusterStatus - Keep InstalledComponents field in v1 for backward compatibility - Update conversion logic to construct v1 InstalledComponents from v2 component ManagementState - Add constructInstalledComponentsFromV2Status() function to build InstalledComponents map - Add getV1ComponentName() helper for component name mapping between versions - Update conversion tests with new construction logic - Fix integration tests to properly test v2→v1 conversion behavior - Remove InstalledComponents assignments from component handlers and templates The InstalledComponents field in v2 was redundant since component status already tracks management state. This change simplifies the v2 API while maintaining full backward compatibility for v1 users through automatic conversion during webhook processing. RHOAIENG-36418 * test: fix e2e resilience test to work also for rhoai (#2714) * chore: update manifest commit SHAs (#2694) Co-authored-by: openshift-merge-bot <[email protected]> * chore: update manifest commit SHAs (#2725) Co-authored-by: openshift-merge-bot <[email protected]> * fix: wrong image for kube-auth-proxy (#2716) Signed-off-by: Wen Zhou <[email protected]> * fix: improve test skip management to correctly manage unordered test run (#2721) * chore: update manifest commit SHAs (#2726) * chore: update manifest commit SHAs * Update get_all_manifests.sh Co-authored-by: Davide Bianchi <[email protected]> --------- Co-authored-by: openshift-merge-bot <[email protected]> Co-authored-by: Wen Zhou <[email protected]> Co-authored-by: Davide Bianchi <[email protected]> * update: remove unused oauth-proxy variable (#2727) Signed-off-by: Wen Zhou <[email protected]> * override image references from RHAIIS (#2734) * chore: update manifest commit SHAs (#2735) Co-authored-by: openshift-merge-bot <[email protected]> * GatewayConfig does not reconcile when DSCInitialization is created/updated/deleted (#2737) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> * Use variable for manifest folder in get_all_manifests.sh (#2729) * fix: e2e test on main * fix: move ray sha to fix e2e tests. Improve get_all_manifests.sh * feat: remove rm of old manifests * feat: use annotation on kube-auth deployment when secret is changed (#2740) - calculate secret data to hash and set as annoatation to trigger deployment change => pod restart with new secret value - cleanu rbac - fix lint to use expectedODHDomain Signed-off-by: Wen Zhou <[email protected]> * Update: fix typos in code + update Oauthclient name (#2743) * chore: fix typos Signed-off-by: Wen Zhou <[email protected]> * update: change OAuthClient name to be platform agnostic Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * chore: update manifest commit SHAs (#2742) Co-authored-by: openshift-merge-bot <[email protected]> * fix: correct Unmanaged ManagementState description (#2746) Update the Unmanaged ManagementState description to accurately reflect the operator's behavior. The operator does not deploy or manage the component's lifecycle when in Unmanaged state, but may still create supporting configuration resources. This fixes the incorrect description that stated the operator is 'actively managing the component', which contradicted the expected semantics of an Unmanaged state. Fixes: RHOAIENG-32465 * feat: remove custom mark OdhDasboardConfig as unmanaged internally code (#2672) * chore: update manifest commit SHAs (#2752) Co-authored-by: openshift-merge-bot <[email protected]> * feat: add SNO-aware OpenTelemetry collector replica defaults (#2738) Set CollectorReplicas default to 1 for single-node clusters and 2 for multi-node clusters * cleaning up caikit-tgis-image (#2755) * update: cleanup kueue controller image + remove support on Managed status (#2757) * chore: cleanup kueue controller image --------- Signed-off-by: Wen Zhou <[email protected]> * refactor: keep gateway function in correct file (#2745) - gateway_auth_actions.go should only host functions : createKubeAuthProxyInfrastructure createEnvoyFilter - gateway_controller_actions.go only for funtions: createGatewayInfrastructure createDestinationRule syncGatewayConfigStatus - all the rest functions should be in gateway_support.go - gateway_util_test.go is for support testing function Signed-off-by: Wen Zhou <[email protected]> * chore: update manifest commit SHAs (#2761) Co-authored-by: openshift-merge-bot <[email protected]> * fix: remove appwrapper from kueue supported framework configuration (#2764) * chore: update bundle manifests for SNO-aware collector replicas (#2765) Update generated bundle manifests via make bundle * fix: Remove DSCI requirement for service reconciliation (#2750) * fix: Remove DSCI requirement for service reconciliation Services can now reconcile without DSCInitialization present: - Reconciler treats missing DSCI as acceptable (not an error) - Hash function handles nil DSCI with sentinel value - ApplicationNamespace helper implements on-demand DSCI fetch - Auth and Monitoring services use helpers for safe DSCI access - Templates handle nil DSCI gracefully This allows services to start before platform initialization and prevents them from getting stuck in failed state when DSCI doesn't exist yet. RHOAIENG-18035 * refactor: remove DSCI from ReconciliationRequest Make DSCI optional in the reconciliation framework by removing it from ReconciliationRequest. Actions now fetch DSCI on-demand using helper functions ApplicationNamespace() and MonitoringNamespace(). Changes: - Remove DSCI field from ReconciliationRequest struct - Update ApplicationNamespace/MonitoringNamespace helpers to fetch DSCI - Update template rendering to pass AppNamespace to templates - Update all actions and tests to work without rr.DSCI - Convert monitoring tests to use Gomega assertions This allows services to reconcile independently of DSCI existence. * Remove cleanup FeatureTrackers action from KServe controller * feat: remove secretegenerator controller (#2749) * feat: remove secretegenerator controller - we do not need that controller since OAuthClient will be created by Gateway service - move secret releated function into pkg Signed-off-by: Wen Zhou <[email protected]> * update: rewrite unit-test - remove over testing - only match two functions in secret.go Signed-off-by: Wen Zhou <[email protected]> * update: unit-test to use gomega Signed-off-by: Wen Zhou <[email protected]> * update: code review comments with more checks Signed-off-by: Wen Zhou <[email protected]> * update: remove docs content not valid any more Signed-off-by: Wen Zhou <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> * chore: update manifest commit SHAs (#2768) Co-authored-by: openshift-merge-bot <[email protected]> Co-authored-by: Wen Zhou <[email protected]> * update: to support olmv1 we need to replace operatorcondition with (#2766) subscription - with this, we lost the ability to check dependent operator version - need a new predicate on subscritpion on .spec.name (metadata.name wont be useful) Signed-off-by: Wen Zhou <[email protected]> * update: enable seccomprofile to runtimedefault (#2770) - we are on ocp 4.19 which support this function Signed-off-by: Wen Zhou <[email protected]> * fix: on managed cluster, we cannot set kueue to Managed, API wont work (#2774) Signed-off-by: Ugo Giordano <[email protected]> * test: add parallel components and services test execution (#2771) * chore: update manifest commit SHAs (#2773) Co-authored-by: openshift-merge-bot <[email protected]> * Revert "update: to support olmv1 we need to replace operatorcondition with (#…" (#2779) This reverts commit 6fb1439c98c4a29424bedfdeb8f145603ef14657. * chore: uplift version from 3.0 to 3.2 (#2780) - update comments, some feature we need to keep till first stable v3 release Signed-off-by: Wen Zhou <[email protected]> * chore: update manifest commit SHAs (#2781) Co-authored-by: openshift-merge-bot <[email protected]> * docs: add e2e testing tips and FAQ section to README (#2682) * docs: add e2e testing tips and FAQ section to README Add comprehensive e2e testing guidance including setup instructions, common workflows, and troubleshooting examples. Includes minimum/recommended configurations, selective test execution patterns, and component-specific testing commands. * Clarify that env vars are space-separated * chore: update manifest commit SHAs (#2787) Co-authored-by: openshift-merge-bot <[email protected]> * chore: cleanup comments and remove unused status for ArgoCD (#2788) Signed-off-by: Wen Zhou <[email protected]> * set defaults based on metric log data from perf test (#2791) * chore: update manifest commit SHAs (#2794) Co-authored-by: openshift-merge-bot <[email protected]> * Watch ingress certificate secrets to automatically sync gateway cert on rotation (#2772) * RHOAIENG-34484 | build: Remove generated files from build (#2329) * build: Remove generated files from build Instead generate them as needed. In order to allow the bundle to be built from the existing `bundle.Dockerfile` mechanism, I introduced some some logic to generate it as a multi-stage dockerfile, where the first stage runs `make bundle`. Testing ------- 1. Build the bundle from main (`make bundle-build`), make note of the image hash 2. Build the bundle from this branch (`make bundle-build`), make note of the image hash 3. Mount both images (`podman image mount $hash1 ; podman image mount $hash2`) 4. compare the directories. I use `meld` for this. Note the only difference is in timestamp. * Set controller image in manager.yaml to REPLACE_IMAGE * Also remove config/crd/external from source * Remove generated webhook manifest * Fix typo in .gitignore * Pass version information as build args * Handle empty env vars for build args * Remove generated gatewayconfigs and servicemeshes crds * fix(build): Address PR feedback Add documentation comments explaining: - OPERATOR_VERSION arg avoids clash with VERSION var - tests directory needed for package references - .dockerignore retains tests for bundle build Revert some changes to the manager image overriding, instead reworking the variable interpretation. Use SED_COMMAND * (feat): Upgrade automation (#2782) - Update release automation to support manifest shas - Remove duplication by using common utils * chore: update manifest commit SHAs (#2802) Co-authored-by: openshift-merge-bot <[email protected]> * fix: generate manifest before run unit test, fetch all needed external CRD (#2804) * chore: clean up more images which are not used for ai hub and ai pipeline (#2795) * chore: clean up more images which are not used for ai hub and ai pipeline Signed-off-by: Wen Zhou <[email protected]> * Update internal/controller/components/modelregistry/modelregistry_support.go Co-authored-by: Jon Burdo <[email protected]> --------- Signed-off-by: Wen Zhou <[email protected]> Co-authored-by: Jon Burdo <[email protected]> * chore: remove unnecessary deployment mode configuration from KServe component (#2777) * Remove unnecessary DeployConfig struct from KServe component Since RawDeployment is the only supported deployment mode and it's hardcoded in the ConfigMap, we don't need the DeployConfig struct and getDeployConfig function. This cleanup is a follow-up to commit 8e20eeef which removed the deployment mode field from the API. Changes: - Remove DeployConfig struct and getDeployConfig function from config.go - Simplify updateInferenceCM to directly set hardcoded deployment mode - Update tests to parse JSON directly instead of using removed struct * Remove defaultDeploymentMode field from inference ConfigMap Since RawDeployment is the only supported mode, we no longer need to explicitly set the defaultDeploymentMode field in the inference ConfigMap. KServe will use RawDeployment by default. Changes: - Remove DeployConfigName constant (no longer needed) - Stop setting defaultDeploymentMode in updateInferenceCM - Remove deploy config validation from tests - Remove deploy config from test ConfigMap creation * chore: optimize SNO detection using Infrastructure ControlPlaneTopology (#2784) * chore: update manifest commit SHAs (#2810) Co-authored-by: openshift-merge-bot <[email protected]> * chore: update manifest commit SHAs (#2818) Co-authored-by: openshift-merge-bot <[email protected]> * feat: improve skip logic with parallel groups (#2819) * chore: fix default HWP description (#2821) * Skip oauth2_proxy cooike in upstream request (#2805) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> * Make the subdomain configurable in gateway API (#2790) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> * chore: update manifest commit SHAs (#2824) Co-authored-by: openshift-merge-bot <[email protected]> * feat: make more flex for the application namespace to be used for (#2814) Gateway (cherry picked from commit 01c7d6f6ecb0f49b3c906720982fe152e10a5ef3) Signed-off-by: Wen Zhou <[email protected]> * restrict httproutes to the application namespace (#2807) * restrict httproutes to the application namespace * fix unit tests * refactor: use cluster.GetApplicationNamespace() instead of custom getPlatformNamespace() * Call cluster.GetApplicationNamespace() directly instead of passing as parameter * linter fixes * adjust ext_authz timeout to 5s (#2823) * adjusted timetout to 5s * added env var and gateway config for auth timeout * chore: update manifest commit SHAs (#2830) Co-authored-by: openshift-merge-bot <[email protected]> * feat: use bot to update manifest sha (#2838) * docs: update pre-req for onboarding (#2835) * docs: update pre-req for onboarding Signed-off-by: Wen Zhou <[email protected]> * update: update prom rules for new stack * update: code review - typo - remove channel for cert-manager * update: wording and links * udpate: accurate info. --------- Signed-off-by: Wen Zhou <[email protected]> * feat: add env vars for configuring application and workbenches namespaces for e2e test suite (#2837) extend TestContext with the new field for workbenches namespace update CreateDSC() to support setting workbenches namespace via env var update README, set defaults for app and wb namespaces add variables for monitoring namespace update e2e test instructions for custom app namespace * chore(modelregistry): add PostgreSQL 16 image mapping (#2812) * chore(modelregistry): add PostgreSQL 16 image mapping Add IMAGES_POSTGRES to RELATED_IMAGE_RHEL9_POSTGRES16_IMAGE mapping in model registry image replacement configuration to support PostgreSQL 16 container image deployment. Related: https://github.com/opendatahub-io/model-registry-operator/pull/374 Co-Authored-By: Claude <[email protected]> Signed-off-by: Chris Hambridge <[email protected]> * Update internal/controller/components/modelregistry/modelregistry_support.go --------- Signed-off-by: Chris Hambridge <[email protected]> Co-authored-by: Claude <[email protected]> * chore: update manifest commit SHAs (#2843) Co-authored-by: openshift-merge-bot <[email protected]> * build: update image placeholder strategy (#2842) Jira: [RHOAIENG-38505](https://issues.redhat.com/browse/RHOAIENG-38505) Change the kustomize image reference from a hardcoded "controller" name to a "REPLACE_IMAGE" placeholder. The previous strategy didn't work properly for CI, because the [substitution logic](https://github.com/openshift/ci-tools/blob/2401b722f08b32d2a6926e1d918ef010ef37992c/pkg/steps/bundle_source.go#L90) modifies YAML files in the repo. * docs: update REAMDE for workbenchnamespace (#2845) Signed-off-by: Wen Zhou <[email protected]> * update: add check on dashboard's acceleratorprofile (#2846) - if CRD does not exist e.g from 3.0, no need own it -- no error into Operator Signed-off-by: Wen Zhou <[email protected]> * feat: add warning when create/update connectin API secret on S3 type but no AWS_S3_BUCKET or with "" as value (#2732) * feat: add warning when create/update connectin API secret on S3 type - if the secret does not have AWS_S3_BUCKET we show a warning but still allowed - warning if AWS_S3_BUCKET exist but has "" as value --------- Signed-off-by: Wen Zhou <[email protected]> * chore: update manifest commit SHAs (#2850) Co-authored-by: openshift-merge-bot <[email protected]> * (fix): Adapt release automation to the new make file changes (#2851) * feat: add e2e test case for verifying workbench namespace configuration (#2852) * Add comprehensive E2E tests for Gateway with OpenShift OAuth authentication (#2753) Signed-off-by: Gowtham Shanmugasundaram <[email protected]> * RHOAIENG-25593 | feat: adding perses instance (#2642) Co-authored-by: den-rgb <[email protected]> * RHOAIENG-34502 | feat: Enable rhoai build from main branch (#2220) * feat: Enable rhoai build from main branch This required sigificant changes to the Makefile and a few different strategies: - conditionally build different versions of some structs, where there is an irreconcilable difference between `main` and `rhoai` branches (using build tags) - maintain a separate overlay of manifests and separate bundle, tracking `rhoai` specific changes where necessary. Renamed directories: - `bundle` -> `odh-bundle` - `config` -> `odh-config` New directories: - `rhoai-bundle`: contains the RHOAI bundle - `rhoai-config`: contains the RHOAI manifests With these changes most Make targets now accept the `ODH_PLATFORM_TYPE` parameter, and operate in either an odh-mode by default, or a rhoai mode if overridden to any value other than `OpenDataHub`. `get_all_manifests.sh` now has a different mode when passed `ODH_PLATFORM_TYPE` other than `OpenDataHub`, where it looks at $VERSION and infers the downstream git reference to use. (It is most easily invoked via `make get-manifests ODH_PLATFORM_TYPE=rhoai`). This adds RHOAI-specific Dockerfiles for the operator and the bundle. See the difference between the rhoai versions and odh versions by using a diff tool, such as `meld` or `diff -u`. You can compare the resulting bundle for differences by checking out the rhoai branch, and comparing `bundle.rhoai` to `bundle` in the `rhoai` branch. There are a number of small differences related to changes that haven't been made to the `rhoai` branch. * Update rhoai contents * build: update CLEANFILES to use explicit paths Replace CONFIG_DIR variable with explicit odh-config and rhoai-config paths in CLEANFILES to ensure proper cleanup of generated files for both configurations. * refactor(api): consolidate build tag type definitions Extract platform-specific type definitions from monolithic files into separate build-tagged files. Move ODH and RHOAI variant type definitions into dedicated .odh.go and .rhoai.go files while keeping shared types in base files. Affected components: - ModelRegistry (components/v1alpha1) - Workbenches (components/v1alpha1) - DSCInitialization (dscinitialization/v1, v2) - Monitoring (services/v1alpha1) This refactoring improves code organization and maintainability by separating platform-specific implementations using Go build tags rather than duplicating entire files. * docs: fix incorrect reference in MonitoringCommonSpec description Update MonitoringCommonSpec documentation to correctly reference "Monitoring" instead of "Dashboard" as the shared desired state. Remove trailing blank lines at end of file. * fix: Remove stale DEFAULT_REF reference * build(makefile): update CLEANFILES to use specific bundle directories Replace generic BUNDLE_DIR variable with explicit bundle directory names (rhoai-bundle and odh-bundle) in CLEANFILES to improve clarity and ensure both bundle variants are properly cleaned. * chore(api): update copyright year to 2025 * docs: Restore extraneous lines in api docs Locally, I use a pre-commit hook to clean up ends-of-files, but this is incompatible with our CI check. * Fix VERSION for odh/rhoai and make them independent * build: update image placeholder strategy Jira: [RHOAIENG-38505](https://issues.redhat.com/browse/RHOAIENG-38505) Change the kustomize image reference from a hardcoded "controller" name to a "REPLACE_IMAGE" placeholder. The previous strategy didn't work properly for CI, because the [substitution logic](https://github.com/openshift/ci-tools/blob/2401b722f08b32d2a6926e1d918ef010ef37992c/pkg/steps/bundle_source.go#L90) modifies YAML files in the repo. * Update placeholder for rhoai manager image * upgrade Go to 1.24 and update rhoai Dockerfile - Upgrade GOLANG_VERSION from 1.23 to 1.24 - Set ODH_PLATFORM_TYPE=rhoai in manifest script - Replace kueue-configs with hardwareprofiles config - Install tar for component dev workflow compatibility - Remove redundant chown in favor of COPY --chown * build(makefile): use platform-specific Dockerfile for image builds Add DOCKERFILE_FILENAME variable to dynamically select between Dockerfile (ODH) and rhoai.Dockerfile (RHOAI) based on ODH_PLATFORM_TYPE. This ensures image-build target uses the correct Dockerfile for each platform variant. * docs(readme,testing): document RHOAI build mode and update test guidelines Update README with instructions for building operator and bundles in RHOAI mode using ODH_PLATFORM_TYPE=rhoai flag. Remove bundle manifests from mandatory integration testing requirements and fix minor formatting inconsistencies. * Move connectionApi * push the e2e test image with the tag main instead of adding the commit hash (#2856) * update: move Prom Rule tests into component folder (#2858) - test on the PrometheusRule CR not the old Prom config files - remove triage which is only for SRE - rename rule files to match component name - remove deadmansnitch which is only for SRE Signed-off-by: Wen Zhou <[email protected]> * add some enhancements to the e2e push gha to have a version for odh (main branch) and rhoai X.Y (rhoai-x.y branch) (#2860) * RHOAIENG-25595 | Add Perses-Tempo datasource integration (#2553) Co-authored-by: Dayakar Maruboena <[email protected]> * Remove load restrictions none for kustomize builds (#2863) * feat: remove LoadRestrictionsNone from kustomize * feat: move rhoai-config into odh-config/rhoai overlay * rebase with master * fix: generate rhoai manifest and webhook name * Update manifest sha in get_all_manifests.sh correctly (#2870) * fix: update manifest sha in get_all_manifests.sh correctly * fix: revert namespace name in makefile * feat: update rhoai manifests to rhoai-3.2 tag * docs: note bundle is still needed for rhoai branch (#2801) * docs: note bundle is still needed for rhoai branch * Fix typo Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * fix: dsc odh samples (#2869) * update: cleanup unnecessary upgrade logic (#2695) - remove functions and calls not needed for 3.2 (should be removed in 3.0 already) jupyterhub resource cleanup rbac for modelreg envoyfilter patch for serving watson resource docs patch odhdashboardconfig for trusty enablement - move and cleanup function for AP to HWP migration Signed-off-by: Wen Zhou <[email protected]> * chore: update manifest commit SHAs (#2875) Co-authored-by: openshift-merge-bot <148852131+openshift-merg…
Description
on current 3.0 build, operator flooded with error if running e2e:
{"level":"error","ts":"2025-11-11T10:51:56Z","logger":"controller-runtime.source.EventHandler","msg":"if kind is a CRD, it should be installed before calling Start","kind":"AcceleratorProfile.dashboard.opendatahub.io","error":"no matches for kind \"AcceleratorProfile\" in version \"dashboard.opendatahub.io/v1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:71\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2\n\t/opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:87\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:88\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}this change should be eventually go away from #2698 but since we have not dediced when we will do the final switch , this PR is just to make the log more clean and one less CRD for controller
How Has This Been Tested?
Screenshot or short clip
Merge criteria
E2E test suite update requirement
When bringing new changes to the operator code, such changes are by default required to be accompanied by extending and/or updating the E2E test suite accordingly.
To opt-out of this requirement:
E2E update requirement opt-out justificationsection belowE2E update requirement opt-out justification
depreacted logic in v3 but kept it for a short while
Summary by CodeRabbit