Skip to content

Conversation

@zdtsw
Copy link
Member

@zdtsw zdtsw commented Nov 11, 2025

  • if CRD does not exist e.g from 3.0, no need own it -- no error into Operator

Description

on current 3.0 build, operator flooded with error if running e2e:
{"level":"error","ts":"2025-11-11T10:51:56Z","logger":"controller-runtime.source.EventHandler","msg":"if kind is a CRD, it should be installed before calling Start","kind":"AcceleratorProfile.dashboard.opendatahub.io","error":"no matches for kind \"AcceleratorProfile\" in version \"dashboard.opendatahub.io/v1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:71\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2\n\t/opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:87\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:88\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}

this change should be eventually go away from #2698 but since we have not dediced when we will do the final switch , this PR is just to make the log more clean and one less CRD for controller

How Has This Been Tested?

Screenshot or short clip

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work
  • The developer has run the integration test pipeline and verified that it passed successfully

E2E test suite update requirement

When bringing new changes to the operator code, such changes are by default required to be accompanied by extending and/or updating the E2E test suite accordingly.

To opt-out of this requirement:

  1. Please inspect the opt-out guidelines, to determine if the nature of the PR changes allows for skipping this requirement
  2. If opt-out is applicable, provide justification in the dedicated E2E update requirement opt-out justification section below
  3. Check the checkbox below:
  • Skip requirement to update E2E test suite for this PR
  1. Submit/save these changes to the PR description. This will automatically trigger the check.

E2E update requirement opt-out justification

depreacted logic in v3 but kept it for a short while

Summary by CodeRabbit

  • Bug Fixes
    • Improved dashboard component stability by adding a configuration existence check. The system now gracefully handles scenarios where optional dashboard components are not available, preventing unnecessary resource watching and potential conflicts.

- if CRD does not exist e.g from 3.0, no need own it -- no error into
  Operator

Signed-off-by: Wen Zhou <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 11, 2025

Walkthrough

The dashboard component's watch for DashboardAcceleratorProfile is updated to include a CRD-existence guard. The OwnsGVK call now passes reconciler.Dynamic(reconciler.CrdExists(gvk.DashboardAcceleratorProfile)) instead of reconciler.Dynamic(), ensuring the controller only watches the CRD when it exists.

Changes

Cohort / File(s) Change Summary
Dashboard Controller Guard
internal/controller/components/dashboard/dashboard_controller.go
Modified OwnsGVK call for DashboardAcceleratorProfile to add CRD-existence guard via reconciler.CrdExists()

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~5 minutes

  • Verify the guard placement is correct within the reconciler setup
  • Confirm reconciler.CrdExists() is the appropriate guard function for this use case
  • Ensure no other DashboardAcceleratorProfile watches need the same guard applied

Poem

🐰 A guard stands watch at the CRD's door,
No phantom shadows to worry anymore!
The profile now checks—does it exist, truly?
Before dancing the reconcile routine, dutifully! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title directly relates to the main change: adding a CRD-existence check for the dashboard's AcceleratorProfile in the controller.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@zdtsw zdtsw requested review from carlkyrillos and removed request for chambridge and sefroberg November 11, 2025 12:31
@codecov
Copy link

codecov bot commented Nov 11, 2025

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 50.07%. Comparing base (6891bef) to head (2e67562).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...oller/components/dashboard/dashboard_controller.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2846   +/-   ##
=======================================
  Coverage   50.07%   50.07%           
=======================================
  Files         144      144           
  Lines       10469    10469           
=======================================
  Hits         5242     5242           
  Misses       4669     4669           
  Partials      558      558           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@openshift-ci
Copy link

openshift-ci bot commented Nov 11, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: carlkyrillos

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 70410f3 into opendatahub-io:main Nov 11, 2025
18 of 19 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in ODH Platform Planning Nov 11, 2025
@carlkyrillos
Copy link
Member

/cherrypick rhoai

@openshift-cherrypick-robot

@carlkyrillos: new pull request created: #2849

In response to this:

/cherrypick rhoai

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-merge-bot bot added a commit that referenced this pull request Nov 27, 2025
* Remove servicemesh not installed check. (#2443)

After adding the gatewayconfig CRD and controller, ODH and RHOAI will
always have servicemesh installed automatically by the openshift ingress
controller. It will not be possible to get into a state without OSSM 3.x
installed.

* remove redundant e2e test case (#2454)

* Change naming convention in mod arch image for dashboard (#2449)

* fix: fix catalog image related_image env variable name as per devops config (#2452)

Signed-off-by: Dhiraj Bokde <[email protected]>

* update: sample of DSCI and DSC + README (#2447)

- since we have introduce GW API creation, we should update our sample
- ossm v3 is installed out of the box, we should not ossm v2 still "Managed" as we advocate
- set to Unmanaged than Removed, to avoid accidentally deletion
- this whole setion should be taken away soon in 3.0 release

Signed-off-by: Wen Zhou <[email protected]>

* fix: connection API on ISVC if type is changed + only create serviceaccount if type is s3 (#2433)

* fix: connection API on ISVC if type is changed

- we previously only handle when type is removed then we cleanup injection
- now we handle update as well
- refactor some old code
- handle serviceaccount creation and injection for s3 only

Signed-off-by: Wen Zhou <[email protected]>

* fix: lint

Signed-off-by: Wen Zhou <[email protected]>

* update: code review

- correct log and skip Patch if no cleanup is needed

Signed-off-by: Wen Zhou <[email protected]>

* update: fix case when there was no annotation

- if isvc exists but no annoatation: user might manually set certain part, e.g .spec.predictor.model.storageUri
 on update with annotation, old storageUri etc should be removed before new injection to avoid confliect

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* feat(RHOAIENG-30795): Add configuration support to TrustyAI DSC component (#2225)

* feat(RHOAIENG-30795): Add configuration support to TrustyAI DSC component

* fix: Add missing docs

* fix: Fix linting errors

* fix: Remove redundant labels and pointer to boolean fields

Since they have default values, no need to use pointers.

* fix: Add TrustyAI CM to operator's resources. Remove unneeded annotations.

* fix: Change tests to account for AddResources

* fix: Linting of trustyai_test.go

* Merge test for default CM and enabled component

* Fix linting (remove dupl)

* chore: make bundle

* fix: revert REPLACE_IMAGE

* chore: skip HWP migration if dashboard HWP CRD missing (#2470)

* feat: add support to pass Oauth proxy image for downstream (#2466)

* add override for new workbenches (#2465)

* prometheus test should check monitoring namespace not apps namespace (#2453)

* feat: add new VAP to block create/update operation on dashboard's HWProfile CR and AcceleratorProfile CR (#2378)

* feat: add new VAP to block create/update operation on dashboard's HWProfile CR and AcceleratorProfile CR

- since this will only work on OCP 4.19+, no need check OCP versoin, as VAP/VAPB is enabled in the cluster
- in the case if the cluster already have dashboard's AP/HWP CRD
  	DSCI should deploy and own 4 new resources: 2 VAP and 2 VAPB to block if any attmpt to create dashboard AcceleratorProfile/HardwareProfile CR or update existing AcceleratorProfile/HardwareProfile CR
	these 4 resoruces will be re-created if user try to delete them
	when DSCI is deleted, these 4 resources should be removed from cluster

Signed-off-by: Wen Zhou <[email protected]>

* fix: testscases

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* update: remove logic set default value in odhdashboardconfig in upgrade (#2469)

Signed-off-by: Wen Zhou <[email protected]>

* feat: optimize E2E test framework performance and error handling (#2418)

* feat: add comprehensive E2E testing framework with improved resource management

- Add DeleteResources function for efficient bulk resource deletion with options
- Implement EnsureResourceDeletedThenRecreated with UID-based verification
- Add WithRemoveFinalizersOnDelete option for automatic finalizer handling
- Support WithWaitForRecreation flag for controller-managed resources
- Replace ExpectedErr with AcceptableErrMatcher using proper Kubernetes error codes
- Fix tryRemoveFinalizers to handle success, NotFound, and NoKindMatchError gracefully
- Add controllerCacheRefreshDelay to prevent cache staleness issues
- Consolidate cleanup functions and remove redundant code patterns
- Add ResourceQuota GVK definition and improve bulk deletion documentation

This comprehensive framework eliminates race conditions in controller-managed
resource testing and provides consistent, reliable patterns for deletion-recreation
scenarios with enhanced safety and performance optimizations.

* feat: enhance core component E2E tests and ServiceMesh organization

- Add comprehensive component testing with enhanced determinism
- Improve test organization and maintainability across core test suites
- Move resilience tests after functional tests for cleaner failure attribution
- Fix KnativeServing test synchronization with deletion/recreation flags
- Add proper scoping to cleanupListResources with safety checks
- Optimize ServiceMesh test organization with consolidated validation patterns
- Extract common validation helpers following "with" pattern conventions
- Consolidate repetitive operator configuration patterns using shared constants
- Add reusable helpers for monitoring resource cleanup and DSCI validation
- Move shared operator namespace constants to centralized helper file
- Rename validation functions for consistency (buildCapabilityConditions → withCapabilityConditions)
- Reorganize helper functions placement for better code structure
- Optimize E2E performance with better resource management
- Apply framework patterns across component test suites
- Refactor UninstallOperator to use ResourceOpts pattern for better consistency
- Replace hard sleeps with WithWaitForDeletion in ServiceMesh operator uninstallation
- Remove custom uninstallOperatorWithChannel function in favor of standardized approach
- Eliminate duplicate validation patterns with DRY principle application
- Refactor Gateway tests and centralize GVKs

This enhances the core testing infrastructure with improved coverage,
resilience patterns, ServiceMesh test organization, and maintainable
validation helpers that reduce code duplication while following
established framework conventions.

# Conflicts:
#	tests/e2e/servicemesh_test.go

# Conflicts:
#	tests/e2e/creation_test.go
#	tests/e2e/dashboard_test.go
#	tests/e2e/monitoring_test.go

* Removing the ServiceMesh controller and some ServiceMesh/Authorino e2e tests, as they will no longer be present in 3.0.

* Commenting out gateway test cases for now, as the feature is still in the early stages of implementation.

* feat: add support for llmisvc to use kueue (#2293)

* feat: add support for llmisvc for kueue

- rename old kueue one for isvc so we can have both isvc and llmisvc

Signed-off-by: Wen Zhou <[email protected]>

* fix: code review

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* feat: add GH Action to enforce e2e test suite update for code-related PRs (#2426)

fix: split workflow into two to allow the bot to comment on fork PRs

reuse existing error comments and add automatic error comment cleanup, to prevent PR noise

add option to provide skip justification in PR descprition

re-use same error comment for both error cases

remove functionality to add justification via PR comment, to simplify the workflow and improve user experience

* addressed comments in PR 2404 (#2444)

* Use custom manifests without devFlags (#2427)

* Use custom manifests without devFlags

Jira: https://issues.redhat.com/browse/RHOAIENG-31642

This gives instructions for an alternative approach to using custom
manifests with the operator, which don't require the use of "devFlags"
in the DSC.

The intention is that this will allow us to remove the devFlags
functionality for 3.0.

This is a primitive approach for now. I expect that folks will start
to build better development patterns rather than needing to modify the
patch file for their component, etc. They can even have their own
approaches in their own repositories, and this may just be an example.

* Don't re-chown files in Dockerfile

They're already chowned as expected from the COPY instruction on L60.

* feat: add comprehensive debug utilities for E2E test failures and timeouts (#2486)

* feat: add comprehensive debug utilities for E2E test failures and timeouts

- Add debug_utils_test.go with global panic and failure handlers
- Integrate HandleGlobalPanic() and HandleTestFailure() at multiple test levels
- SetGlobalDebugClient() automatically configured in NewTestContext()
- Provides detailed diagnostics on ANY test failure (panics, timeouts, assertions):
  * Cluster state and node health with resource allocation percentages
  * Namespace resources (deployments, pods, container states)
  * Operator deployment status and pod logs
  * DSCI/DSC status and unhealthy conditions
  * Recent events from key namespaces (last 5 minutes)
  * Resource quota violations

Addresses debugging challenges for deployment recreation failures,
resource exhaustion, and other cluster-level issues in E2E tests.
Activates automatically without requiring manual intervention.

* Added security redaction for secrets, tokens, and credentials in logs

* fix: limit e2e update requirement check GH action only to the main branch (#2497)

chore: improve documentation and error messages

* update: temp. disable e2e test on dashboard (#2499)

- this should be reverted once dashboard is fixed

Signed-off-by: Wen Zhou <[email protected]>

* removing the py311 images (#2484)

Co-authored-by: Wen Zhou <[email protected]>

* RHOAIENG-29717 | feat: Deploy thanos querier frontend for metrics acces (#2448)

Co-authored-by: den-rgb <[email protected]>

* chore: e2e update requirement GH action: edit regex for justification section title, ensure guidelines are consistent (#2509)

* fix: prevent secret data leakage in operator error logs (#2471)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>

* test: increase resource check timeout, fix debug (#2506)

* Remove codeflare operator, monitoring and tests (#2468)

* feat(RHOAIENG-34034): remove codeflare operator, monitoring and tests

* test(RHOAIENG-34034): add e2e test to verify codeflare component is not removed

* fix(RHOAIENG-34034): improve test and add test flag in README

* test(RHOAIENG-34034): use consistently check

* test: improve test readibility and reusability

* fix: watch on Dashboard's AProfile/HWProfile by DSCI (#2512)

- add watch on these two CRD on create/update, so VAP/VAPB need to be
  created
- this is needed for the testcase on dashboard once it is enabled and
  still ship these two CRD
- we probably can remove this logic alongwith the dashbaord tests on
  VAP/VAPB once dashboard cleanup CRD from their manifests

Signed-off-by: Wen Zhou <[email protected]>

* chore: log on webhook for notebook with HWProfile (#2491)

- different for not able to find HWProfile VS client Get() failed
- update e2e tests to add new error code

Signed-off-by: Wen Zhou <[email protected]>

* feat:  auto detect auth, kube auth proxy, envoy filter (#2490)

* fix gateway controller namespace handling, add status sync, and improve test patterns

* feat: detect auth, kubeauthproxy, envoyfilter

* added better support for oidc, fixed kubeproxy/envoyfilter issue

* used apply instead of addresources for secret

* fixed cookie mismatch for oauth/oidc

* Initial effort at migrating hardwareprofiles from v1alpha1 to v1 (#2398)

Rebased, included v1alpha1 hardwareprofile changes



added deprecated markers to v1alpha1 hardwareprofiles



Rebased on latest, incremental Makefile version bump, regenerated bundle



v1alpha1 HardwareProfiles must be registered to scheme



minor comment update to reflect infrav1 hardwareprofiles



adding hardwareprofile test cases



# Conflicts:
#	bundle/manifests/opendatahub-operator.clusterserviceversion.yaml

Signed-off-by: Max Whittingham <[email protected]>

* chore: remove unused pkg/feature and related integration tests (#2521)

* chore: remove unused pkg/feature and related integration tests

* chore: remove unused unit-tests1.yaml workflow

* collector validation should check metrics resources or storage (#2515)

* refactor: remove duplicated reconciler crd existence check (#2483)

* chore: uplift version + fix bundle + fmt + fix monitoring tests (#2526)

Signed-off-by: Ugo Giordano <[email protected]>

* update: add cleanup when --fail-fast is called we still get leftover cleanup (#2478)

Signed-off-by: Wen Zhou <[email protected]>

* implemented destination rule for TLS to upstream services (#2536)

* feat: add default HWProfile CR "default-profile" (#2461)

* feat: add default HWProfile CR "default-profile"

- remove the sample one we added before, with this PR it will create one in the cluster
- user will be able to change this default-profile CR, operator wont reconcile
- if user change DSCI or DSCI gets reconciled, it should not reset
  default-profile CR with user modified value.
- if user delete this default-profile CR, operator will create a new one with default values

Signed-off-by: Wen Zhou <[email protected]>

* update: code review

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* feat: switch isvc connection annotation from type to protocol (#2532)

* feat: switch isvc connection annotation from type to protocol

* fix: addressed PR comments

* Update unit-tests2.yaml to codecov-actions v5.5.1 (#1) (#2542)

Update unit-tests2.yaml to codecov-actions v5.5.1

* fix: version inconsistency (#2544)

Signed-off-by: Wen Zhou <[email protected]>

* add support for overriding kserve-llm-d (#2537)

* feat: add oauth proxy parametrization to dashboard (#2547)

* feat: add support for llmisvc to use HWProfile (#2424)

* feat: add support for llmisvc to use HWProfile

- if HWprofile has .spec.identifiers, add this into .spec.tempalte.containers(llmisvc)
- if HWprofile has .spec.scheduling
  - if it is type: kueue, set label "kueue.x-k8s.io/queue-name" with value  .spec.scheduling.kueue.localqueue
  - if it is type: node and has .spec.scheduling.node
      .spec.scheduling.node.nodeSelector add into .spec.template.nodeSelector(llmisvc)
      .spec.scheduling.node.tolerations add into .spec.template.tolerations(llmisvc)

Signed-off-by: Wen Zhou <[email protected]>

* fix: add support for llmisvc with min config

- if user only set spec: {} when create llmisvc, we should:
  1. create a container named as "main"
  2. inject identifier from HWProfile

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* Support override for runtime images (#2464)

* support override for runtime images

* add override for vllm-spyre image

* fix: prevent cache false hits for resources stuck in deletion (#2527)

* fix: bypass cache for objects with deletionTimestamp to prevent stuck resources

- Add cache bypass logic for objects with deletionTimestamp in deploy action
- Implement proactive cache cleanup in ShouldSkip method
- Add Delete method to Cache for cleaning up stale entries
- Extract cache logic into isCachedAndShouldSkip helper method
- Add comprehensive unit tests for cache cleanup verification
- Apply line-of-sight principle for better code readability

Fixes cache bug where resources.Hash() removes deletionTimestamp causing
identical hashes for objects with/without deletion timestamps. This led
to false cache hits preventing recreation of any Kubernetes resource type
(deployments, services, configmaps, secrets, CRDs, etc.) during E2E tests.

Root cause: pkg/controller/predicates/resources/resources.go:295
Impact: E2E test failures when resources stuck in deletion while still in cache
Solution: Objects with deletionTimestamp always bypass cache and trigger cleanup
Scope: Affects all Kubernetes resources deployed via deploy action

* refactor: simplify cache test suite and improve terminating object handling

- Skip deployment for terminating objects instead of attempting and failing
- Refactor ShouldSkip to handle terminating objects before cache logic
- Rename ProcessCacheEntry to CleanupIfTerminating for clarity
- Make Cache.Delete method idempotent
- Convert TestDeployWithCacheAction to clean table-driven approach
- Remove unused test helper functions (saved ~150+ lines)
- Simplify test structure with direct function calls in table
- Apply micro-optimization to avoid DeepCopy on skip path

All tests pass with improved maintainability and performance.

* Update action_deploy_cache.go

Removed redundant check when deleting a key from the cache

* fix(RHOAIENG-34533): resolve TrustyAI DSC validation error when patching DSC eval flags (#2525)

* fix(RHOAIENG-34533): resolve TrustyAI DSC validation error when patching DSC eval flags

This change resolves RHOAIENG-34533 by converting TrustyAI evaluation configuration
fields from boolean to enum string values.

Changes:
- Convert PermitCodeExecution and PermitOnline from bool to string with enum validation (allow/deny)
- Add proper default handling and constants for evaluation permissions
- Update CRDs, bundle manifests, and test expectations
- Maintain backward compatibility in ConfigMap generation

* add updated API

* Simplify CM value assignment, revert image name to v3.0.0

* test: revert "temp. disable e2e test on dashboard" from PR2507 and fix VAP/VAPB e2e tests (#2561)

* Revert "update: temp. disable e2e test on dashboard (#2499)"

This reverts commit 9e847d546cb75c2bb14bd32ef54df0f1fc1c5894.

* fix: VAP/VAPB testcase

- move from DSCI to Dashbaord

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* feat: add oauth proxy parametrization to dashboard (#2554)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>

* RHOAIENG-34055: add ray sanity check for codeflare removal (#2514)

* feat(RHOAIENG-34055): add ray sanity check for codeflare removal

* feat(RHOAIENG-34055): create a common sanitycheck action

* feat: add GHA to auto build and push e2e tests image after each merge to main (#2543)

* checkin (#2555)

* remove useless kueue-batch-user-rolebinding (#2571)

Removed the config/kueue-configs/ directory and all associated references
as the kueue-batch-user-rolebinding functionality is no longer needed.

* OCP console link to odh should match gateway address (#2569)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>
Co-authored-by: James Tanner <[email protected]>

* fix: add llmisvc kueue test case (#2573)

* feat: add support for Connection API in LLMInferenceservice

- add create/update/removal case
- create $llmisvc-sa serviceaccount and inject into .spec.template.serviceAccountName
- serect should support both cases: data.URI and data.http-hosts
  according to kserve for both isvc and llmisvc
- simplify logging by remove duplicated error and caller print error
- only remove connectoin added imagepullsecret from list if remove
  annotation
- add support for only change conneciton-path value and .spec.mode.uri
  should updated too
- when update, if manually set path in spec, use it;  if annotation
  is set, use annotation; if neither set, and previously already has
  path (maybe was created by gitops or dashboard) keep it

Signed-off-by: Wen Zhou <[email protected]>

* update: code review

- fix typo
- update handleSA() to deal with servicename more precisely
- cleanup func and const not in use any more

Signed-off-by: Wen Zhou <[email protected]>

* update: code review

- remove duplicated function for serviceaccount
- add check on both annotations

Signed-off-by: Wen Zhou <[email protected]>

* update: code review

- serviceAccoutName should only be removed if it is not having the same
  value derived from serect's name
- serviceAccountName should only be injected if it did not have a value
  already

Signed-off-by: Wen Zhou <[email protected]>

* RHOAIENG-31870: Add DSC and DSCI v2 versions. (#2505)

RHOAIENG-35095: Webhook dsc, dsci and kueue integration_test.go unit-tests disabled
Added hack/buildLocal.sh : a script to build locally and deploy to crc

* cleanup unused referencegrant gvk code (#2576)

* Add clusterrole for allowedgroups (#2494)

* add clusterrole for allowedgroups

fix unit test

remove unused role for allowed group

* fix comment and error message

* Bug fix for gateway ODIC mode (#2591)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>
Co-authored-by: James Tanner <[email protected]>

* Removed monitoring namespace from flags, and documentation to set via env vars. Viper now sets the defaults vars for monitoring namespace instead. (#2538)

Signed-off-by: Max Whittingham <[email protected]>

* Update golangci-lint to v2.5.0 (#2582)

* chore: fix api version (#2594)

Signed-off-by: Wen Zhou <[email protected]>

* fix: retrieve all files in a PR (#2595)

* feat(modelregistry): add benchmark data image environment variable mapping (#2498)

Add IMAGES_BENCHMARK_DATA environment variable mapping to support
benchmark data injection in model registry deployments. This enables
the model-registry-operator to configure benchmark data images via
init containers for benchmark visualization and analysis features.

Related to: https://github.com/opendatahub-io/model-registry-operator/pull/324

Follows the same pattern established in previous PRs:
- https://github.com/opendatahub-io/opendatahub-operator/pull/2429
- https://github.com/opendatahub-io/opendatahub-operator/pull/2452

🤖 Generated with [Claude Code](https://claude.ai/code)

Signed-off-by: Chris Hambridge <[email protected]>
Co-authored-by: Claude <[email protected]>
Co-authored-by: Wen Zhou <[email protected]>

* RHOAIENG-33891: DSC v1 should not be created or updated with Kueue component state set to Managed (#2602)

* Add kube-linter to checks Kubernetes manifests against various best practices, with a focus on production readiness and security (#2605)

See:
- https://docs.kubelinter.io/
- https://github.com/stackrox/kube-linter

* Add webhook to prevent creation of dashbaord's HardwareProfile and AcceleratorProfile (#2599)

* chore: use generic Getter[T] instead of StringGetter (#2608)

Update StringGetter to use Go generics, allowing the type to work with
any return type instead of being limited to strings.

* chore: Update GitHub workflows and issue labels and fix bug (#2607)

- Add v3 label to issue_label_bot.yaml
- Rename workflow yaml: test for e2e, release for ODH release,
- Remove 'incubation' branch from linter.yaml triggers
- Update email
- Update API docs config to exclude all List types

Signed-off-by: Wen Zhou <[email protected]>

* RHOAIENG-34045: cleanup codeflare manifests, remove codeflare from DSC v2 (#2596)

* feat(RHOAIENG-34045): cleanup codeflare manifests, remove codeflare from DSC v2

* feat(RHOAIENG-34045): remove CodeFlare from api docs and remove finalizers update permissions

* update: add new members (#2612)

Signed-off-by: Wen Zhou <[email protected]>

* feat(e2e): improve test framework resilience and fix monitoring controller cleanup (#2550)

* feat(e2e): add explicit resource lifecycle methods and improve test framework

- Add EventuallyResourceCreated, EventuallyResourceUpdated, EventuallyResourcePatched
- Add Patch function in testf for explicit patch-only behavior
- Replace CreateOrUpdate with CreateOrPatch for better concurrency
- Fix resourceVersion conflicts and test timing issues
- Add comprehensive testf test coverage and helper functions
- Modernize kueue_test.go with jq.Match validation
- Reduce test code duplication and improve error messages

* refactor(monitoring): remove redundant management state checks and unify actions

- Remove isMonitoringManaged() function and all management state checks from actions
- Remove MonitoringNotManagedMessage constant (no longer needed)
- Combine MonitoringStack and ThanosQuerier into single deployMonitoringStackWithQuerier action
- Combine Tempo and Instrumentation into single deployTracingStack action
- Extract validateRequiredCRDs helper to reduce CRD validation duplication
- Add CRDRequirement struct for consistent condition handling
- Rename setCRDNotFoundCondition to setConditionFalse for clarity

The monitoring actions now focus purely on configuration validation and CRD availability checks. The optimization also reduces the monitoring controller from 5 deployment actions to 3
and eliminates condition duplication.

Breaking changes:
- Remove deployThanosQuerier, deployTempo, deployInstrumentation individual actions
- Add deployMonitoringStackWithQuerier and deployTracingStack unified actions

* test: temporarily disable RBAC and ServiceAccount deletion recovery tests

These tests are experiencing timing issues with external dependencies
and token refresh cycles. Temporarily disabling to unblock CI while
we investigate proper solutions for testing these resource types.

TODO: Re-enable after investigating timing issues.

* chore: remove placeholder CRD and skip kubebuilder for dashboard actions (#2609)

Remove empty CRD template file and add kubebuilder:skipmake directive to
dashboard_controller_actions.go to prevent unintended CRD generation.

* Use commit sha in manifest file (#2540)

* feat(RHOAIENG-34447): add support to commit sha component target and add workflow to update it

* feat: upgrade commit sha

* chore(kube-lint): add CEL-based linter to prevent system group bindings in ClusterRoleBinding (#2617)

Add custom kube-linter check to detect ClusterRoleBindings that target
system groups (e.g., system:authenticated, system:unauthenticated).

* feat: update IMAGES_BENCHMARK_DATA to new image (#2622)

Signed-off-by: Alessio Pragliola <[email protected]>

* add e2e tests for reconciliation resilience (#2364)

* Remove ModelMesh component and infrastructure from OpenDataHub Operator (#2565)

* feat(RHOAIENG-34026): remove modelmeshserving manifest download and component registration

- Remove modelmeshserving from get_all_manifests.sh
- Remove modelmeshserving import from cmd/main.go

# Conflicts:
#	get_all_manifests.sh

* feat(RHOAIENG-34026): comprehensive ModelMesh removal for RHOAI 3.0

This commit removes ModelMesh functionality while preserving v1 API compatibility:

Core Functionality Removal:
- Remove ModelMeshServing component controller and implementation
- Remove ModelMeshServing monitoring configurations
- Update ModelController to ignore ModelMeshServing in DSC v1 (always treats as Removed)
- Update ModelController tests to reflect new RHOAI 3.0 behavior

Cleanup and Infrastructure:
- Remove ModelMeshServing RBAC editor/viewer roles
- Remove ModelMeshServing E2E tests
- Remove ModelMesh Prometheus unit tests
- Remove ModelMeshServing from sample DSC configuration
- Remove ModelMesh from component integration documentation
- Remove KServe ModelMesh CRD conflict handling logic
- Remove unused imports and monitoring embeds

API Strategy:
- Keep v1 DSC API intact with ModelMeshServing field for upgrade compatibility
- ModelController ignores ModelMeshServing settings (always Removed in RHOAI 3.0)
- Remove actual component functionality while maintaining API backward compatibility

Testing Updates:
- Update ModelController tests to reflect new behavior
- Tests now validate that ModelMeshServing no longer enables ModelController
- All tests pass with new RHOAI 3.0 logic

Build verification:
- All code compiles successfully
- Manifest generation works correctly
- ModelController unit tests pass with new behavior

# Conflicts:
#	internal/controller/components/modelcontroller/modelcontroller.go
#	internal/controller/components/modelmeshserving/modelmeshserving.go
#	internal/controller/components/modelmeshserving/modelmeshserving_test.go

* Remove ModelMeshServing component and API references

This commit removes ModelMeshServing component from the opendatahub-operator
codebase as part of RHOAI 3.0 migration to KServe-only model serving.

Changes include:

## API and Controller Changes:
- Keep ModelMeshServing API types for v1 DSC backward compatibility
- Set ModelMeshServing ManagementState to 'Removed' in ModelController
- Remove ModelMeshServing logic from modelcontroller_actions.go
- Add deprecation markers to ModelMeshServing types

## E2E Test Updates:
- Remove ModelMeshServing test cases from controller_test.go
- Update modelcontroller_test.go to focus on KServe functionality
- Remove ModelMeshServing from resilience_test.go and helper_test.go
- Add partial upgrade test for ModelMeshServing resource preservation

## Monitoring and Configuration Cleanup:
- Remove ModelMeshServing Prometheus scrape jobs and SLO rules
- Remove ModelMeshServing alerting rules from prometheus-configs.yaml
- Remove ModelMeshServing from monitoring controller mappings

## Sample and Documentation Updates:
- Remove modelmeshserving from DSC sample configurations
- Fix incorrect app.kubernetes.io/managed-by labels in samples (RHOAIENG-32728)
- Remove ModelMeshServing references from README.md and DESIGN.md
- Remove ModelMeshServing component from upgrade logic

## Bundle and Manifest Updates:
- Update CSV to remove ModelMeshServing CRD references
- Add ModelMeshServing CRD to bundle for compatibility

Addresses: RHOAIENG-34026, RHOAIENG-32728
Partial: RHOAIENG-34036 (upgrade testing)

# Conflicts:
#	docs/api-overview.md

* feat: Remove ModelMeshServing component from RHOAI 3.0

- Remove ModelMeshServing CRD generation from kustomization.yaml
- Delete ModelMeshServing CRD file (components.platform.opendatahub.io_modelmeshservings.yaml)
- Clean up bundle manifests: remove ModelMeshServing references from CSV and CRDs
- Update API documentation config to exclude ModelMeshServing types
- Remove ModelMeshServing from v2 DataScienceCluster sample
- Clean up catalog.yaml: remove ModelMeshServing GVK, CRD description, and keywords
- Preserve v1 DataScienceCluster compatibility for smooth upgrades
- Maintain RBAC permissions and e2e tests for upgrade safety
- Update TrustyAI controller to remove obsolete model-mesh label check
- Simplify ModelController types and remove ModelMeshServing dependencies
- Update conversion logic between v1 and v2 DataScienceCluster APIs
- Clean up documentation and design files

* Update the DSPO commit to leverage kube-rbac-proxy (#2636)

See:
https://github.com/opendatahub-io/data-science-pipelines-operator/pull/920
https://issues.redhat.com/browse/RHOAIENG-34577

Signed-off-by: mprahl <[email protected]>

* chore: update manifest commit SHAs (#2637)

Co-authored-by: zdtsw <[email protected]>

* fix: resolve RBAC errors and improve resilience test reliability (#2643)

- Fix pod deletion RBAC issue by using individual DeleteResource instead of bulk deletion
- Restore ModelRegistry component to quota tests after removing stuck finalizer
- Optimize jq expressions to handle deployment name patterns and readyReplicas checks
- Add pod restart after RBAC restoration for faster test recovery

Resolves "server does not allow this method" error and makes tests more reliable.

* RHOAIENG-29717 | fix: Adding missing test case (#2510)

Co-authored-by: den-rgb <[email protected]>

* Add a RELATED_IMAGE hook for kube-auth-proxy (#2632)

Co-authored-by: jctanner <[email protected]>

* removing wrong labels from samples in DSCI/DSC v2 (#2650)

* RHOAIENG-33158 | feat: enhance hardware profile management by adding custom-serving profile (#2578)

* RHOAIENG-33158 | feat: enhance hardware profile management by adding custom-serving profile

- Introduced a new HardwareProfile named 'custom-serving' in the YAML configuration.
- Updated the DSCInitializationReconciler to manage both default and custom hardware profiles.
- Modified the logic to check for the existence of the custom hardware profile during reconciliation.

* chore: Fixed calling renamed function by refering new name

* chore: reworked default and custom hardwareprofile CRs deployment

- split the yamls to seperate files to reduce confusion and easier identification
- reworked the logic to check and deploy after checking both custom and default hwp errors.

---------

Co-authored-by: Wen Zhou <[email protected]>

* feat: update workflow to point to correct branch (#2652)

* chore: update manifest commit SHAs (#2646)

Co-authored-by: openshift-merge-bot <[email protected]>

* RHOAIENG-30940: Remove devFlags support (#2588)

* feat(RHOAIENG-30940): remove devFlags support

* fix(RHOAIENG-34045): small typo

* fix: lint

* test: add e2e test on HWProfile v1alpha1 and v1 (#2635)

* test: add e2e test on HWProfile v1alpha1 and v1

- create CR on v1, test it can be read by v1 and v1alpha1
- create CR on v1alpha, test it can be read by v1 and v1alpha1
* update: prolong timeout
* update: move HWProfile test to a dedicated suite

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* RHOAIENG-33158 | feat: Implement migration of HardwareProfiles from AcceleratorProfiles and container sizes (#2529)

* RHOAIENG-33158 | feat: Implement migration of HardwareProfiles from AcceleratorProfiles and container sizes

- Added functions to migrate AcceleratorProfiles to HardwareProfiles, creating separate profiles for notebooks and serving.
- Implemented migration for container sizes, generating HardwareProfiles based on specified resource limits.
- Created a special HardwareProfile for InferenceServices without associated AcceleratorProfiles or container sizes.

* RHOAIENG-33158 | chore: addressed PR comments

- Removed custom-serving HWP out of upgrade.go
- now trying to load odhDashboardConfig from manifests if its not available in cluster
- other refactors and logs

* chore: Renamed funcs to better describe the functionality

- Adjusted comments to explain the code better.

* RHOAIENG-33158 | chore: addressed PR comments

- making sure hwp names are all lowercase and without spaces
- changed string literal references to string constants
- returning early if application namespace is empty
- migration will proceed only if old major version is 2 and current version is 3

* fix: correct GatewayConfig validation error handling for OIDC configuration (#2654)

* chore: update manifest commit SHAs (#2655)

Co-authored-by: openshift-merge-bot <[email protected]>

* chore: remove unused const (#2661)

- rename GatewayKind to GatewayConfigKind

Signed-off-by: Wen Zhou <[email protected]>

* chore: fix previous HWProfile test (#2659)

- revert timeout
- update docs

Signed-off-by: Wen Zhou <[email protected]>

* CLI to retry flaky test in job (#2611)

* feat(RHOAIENG-34877): add test cli to retry e2e tests during job execution to avoid retest

* fix: improve retry skip filter and add github actions to run tests on CLI

* (feat): Rename DatasciencePipelines to AIPipelines (#2589)

* (feat): Rename DatasciencePipelines to AIPipelines

Signed-off-by: Ajay Jaganathan <[email protected]>

# Conflicts:
#	api/datasciencecluster/v1/datasciencecluster_conversion.go
#	api/datasciencecluster/v2/datasciencecluster_types.go
#	api/datasciencecluster/v2/zz_generated.deepcopy.go
#	bundle/manifests/datasciencecluster.opendatahub.io_datascienceclusters.yaml
#	bundle/manifests/opendatahub-operator.clusterserviceversion.yaml
#	config/crd/bases/datasciencecluster.opendatahub.io_datascienceclusters.yaml
#	docs/api-overview.md
#	pkg/upgrade/upgrade.go
#	tests/e2e/datasciencepipelines_test.go

* add logic for converting installed components

Signed-off-by: Ajay Jaganathan <[email protected]>

---------

Signed-off-by: Ajay Jaganathan <[email protected]>

* Fix session issue with kube-auth-proxy (#2625)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>

* update: remove DSCI and DSC v1 sample from CSV (#2667)

Signed-off-by: Wen Zhou <[email protected]>

* RHOAIENG-33892: Kueue controller should report an error in case a DSC has Managed state set in Kueue component. (#2645)

RHOAIENG-33894: Remove all the remaining logic/tests to handle Managed Kueue component state.

* Change api group for hardware profiles in admingroup role (#2631)

* Change api group for hardware profiles in admingroup role

* Apply suggestion from @zdtsw

* Update internal/controller/services/auth/resources/admingroup-role.tmpl.yaml

---------

Co-authored-by: Wen Zhou <[email protected]>

* Enable CLI access via OpenShift OAuth bearer tokens (#2666)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>
Co-authored-by: James Tanner <[email protected]>

* Fix multiple run of upgrade tests (#2665)

* test: fix multiple run of upgrade tests

* test: fix v2tov3upgrade tests and test cli

* test: add test on duplicated test names in cli

* feat: add gen-ai image to the dashboard (#2673)

* fixed bug where dup bearer token was added via cli access (#2684)

* Remove Serverless Mode, Service Mesh, and Authorino Infrastructure from OpenDataHub Operator (#2560)

* remove KServe Serverless mode, removed ServiceMesh, Serverless and Authorino infra

add v2tov3 upgrade test

remove ServiceMesh spec from DSCI v2 api

remove servicemesh manifests

remove servicemesh spec from dsci v1 sample, regenerate bundle

disable webhook test

retain only minimal servicemesh api, update docs generation ignore rules

fix NoMatchError checking

remove redundant istio/serverless predicates and status condition variables

removed OSSMv2 pre-condition check from kserve controller

remove serving spec and defaultDeploymentMode field from KServe spec, re-generate manifests and bundle

remove leftover codeflare mention in CSV manifest

* Clean up types, status and doc strings

Signed-off-by: Christopher Sams <[email protected]>

* re-enable webhook test, re-generate manifests and bundle

* add ServiceMesh to removedCRDToCreate list, fix linter issues

---------

Signed-off-by: Christopher Sams <[email protected]>
Co-authored-by: Christopher Sams <[email protected]>
Co-authored-by: Ugo Giordano <[email protected]>

* chore: update manifest commit SHAs (#2668)

Co-authored-by: openshift-merge-bot <[email protected]>

* set the default mode for kube-auth-proxy (#2677)

* fix: optimize failed test rerun (#2686)

* RHOAIENG-34261 | Enable structured YAML for metrics exporters and fix UX inconsistency between metrics and traces (#2487)

Co-authored-by: Dayakar Maruboena <[email protected]>

* feat(RHOAIENG-35049): add support to pass RAGAS KFP image for downstream (#2566)

Co-authored-by: Wen Zhou <[email protected]>

* fix: remove test on old dashboard AProfile and HWProfile (#2696)

Signed-off-by: Wen Zhou <[email protected]>

* RHOAIENG-33159 | Replace accelerator and container size annotations to hardwareprofile annotations (#2675)

* RHOAIENG-33159 | Replace accelerator and container size annotations to hardwareprofile annotations

* Update:

- remove custom-serving from config, dsci should only create default one
- dynamically create custom-serving HWProfile CR in upgrade.go
- move support functions into upgrade_utils.go
- remove duplicated calls in the loops
- simplify hwprofile name concat
- remove unused gvk
- change logging info.
- remove old test on custom-serving

Signed-off-by: Wen Zhou <[email protected]>

* fix: lint

Signed-off-by: Wen Zhou <[email protected]>

* update: code review change

- fix missing permisison on notebooks
- fix namespace lookup for all namespace
- fix missing annoataion for hwprofile namespace set

Signed-off-by: Wen Zhou <[email protected]>

* update: reduce unnecesary calls to get annoataions

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>
Co-authored-by: Wen Zhou <[email protected]>

* added kube auth proxy metrics address (#2693)

* fix: annoatation for custom-serving HWProfile

Signed-off-by: Wen Zhou <[email protected]>
(cherry picked from commit 719aa32ce6d0fe24b62aa72aa8e2888e02e07c13)

* update notebooks manifest hash for gateway support (#2705)

* fix: Correct RAGAS KDP (#2702)

Signed-off-by: Rui Vieira <[email protected]>

* Make cookie expiration and refresh interval configurable (#2706)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>

* chore: update docs + api comments (#2712)

Signed-off-by: Wen Zhou <[email protected]>

* test: add default kueue configuration test (#2704)

* fix: fixed quotes for HWP visibility annotation (#2717)

* added validation for clientid/issuerurl (#2708)

* Remove installedComponents field from DataScienceCluster v2 API (#2719)

- Remove InstalledComponents field from v2 DataScienceClusterStatus
- Keep InstalledComponents field in v1 for backward compatibility
- Update conversion logic to construct v1 InstalledComponents from v2 component ManagementState
- Add constructInstalledComponentsFromV2Status() function to build InstalledComponents map
- Add getV1ComponentName() helper for component name mapping between versions
- Update conversion tests with new construction logic
- Fix integration tests to properly test v2→v1 conversion behavior
- Remove InstalledComponents assignments from component handlers and templates

The InstalledComponents field in v2 was redundant since component status
already tracks management state. This change simplifies the v2 API while
maintaining full backward compatibility for v1 users through automatic
conversion during webhook processing.

RHOAIENG-36418

* test: fix e2e resilience test to work also for rhoai (#2714)

* chore: update manifest commit SHAs (#2694)

Co-authored-by: openshift-merge-bot <[email protected]>

* chore: update manifest commit SHAs (#2725)

Co-authored-by: openshift-merge-bot <[email protected]>

* fix: wrong image for kube-auth-proxy (#2716)

Signed-off-by: Wen Zhou <[email protected]>

* fix: improve test skip management to correctly manage unordered test run (#2721)

* chore: update manifest commit SHAs (#2726)

* chore: update manifest commit SHAs

* Update get_all_manifests.sh

Co-authored-by: Davide Bianchi <[email protected]>

---------

Co-authored-by: openshift-merge-bot <[email protected]>
Co-authored-by: Wen Zhou <[email protected]>
Co-authored-by: Davide Bianchi <[email protected]>

* update: remove unused oauth-proxy variable (#2727)

Signed-off-by: Wen Zhou <[email protected]>

* override image references from RHAIIS (#2734)

* chore: update manifest commit SHAs (#2735)

Co-authored-by: openshift-merge-bot <[email protected]>

* GatewayConfig does not reconcile when DSCInitialization is created/updated/deleted (#2737)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>

* Use variable for manifest folder in get_all_manifests.sh (#2729)

* fix: e2e test on main

* fix: move ray sha to fix e2e tests. Improve get_all_manifests.sh

* feat: remove rm of old manifests

* feat: use annotation on kube-auth deployment when secret is changed (#2740)

- calculate secret data to hash and set as annoatation to trigger
  deployment change => pod restart with new secret value
- cleanu rbac
- fix lint to use expectedODHDomain

Signed-off-by: Wen Zhou <[email protected]>

* Update: fix typos in code + update Oauthclient name (#2743)

* chore:  fix typos

Signed-off-by: Wen Zhou <[email protected]>

* update: change OAuthClient name to be platform agnostic

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* chore: update manifest commit SHAs (#2742)

Co-authored-by: openshift-merge-bot <[email protected]>

* fix: correct Unmanaged ManagementState description (#2746)

Update the Unmanaged ManagementState description to accurately reflect
the operator's behavior. The operator does not deploy or manage the
component's lifecycle when in Unmanaged state, but may still create
supporting configuration resources.

This fixes the incorrect description that stated the operator is
'actively managing the component', which contradicted the expected
semantics of an Unmanaged state.

Fixes: RHOAIENG-32465

* feat: remove custom mark OdhDasboardConfig as unmanaged internally code (#2672)

* chore: update manifest commit SHAs (#2752)

Co-authored-by: openshift-merge-bot <[email protected]>

* feat: add SNO-aware OpenTelemetry collector replica defaults (#2738)

Set CollectorReplicas default to 1 for single-node clusters and 2 for multi-node clusters

* cleaning up caikit-tgis-image (#2755)

* update: cleanup kueue controller image + remove support on Managed status (#2757)

* chore: cleanup kueue controller image
---------

Signed-off-by: Wen Zhou <[email protected]>

* refactor: keep gateway function in correct file (#2745)

- gateway_auth_actions.go should only host functions : createKubeAuthProxyInfrastructure createEnvoyFilter
- gateway_controller_actions.go only for funtions: createGatewayInfrastructure createDestinationRule syncGatewayConfigStatus
- all the rest functions should be in gateway_support.go
- gateway_util_test.go is for support testing function

Signed-off-by: Wen Zhou <[email protected]>

* chore: update manifest commit SHAs (#2761)

Co-authored-by: openshift-merge-bot <[email protected]>

* fix: remove appwrapper from kueue supported framework configuration (#2764)

* chore: update bundle manifests for SNO-aware collector replicas (#2765)

Update generated bundle manifests via make bundle

* fix: Remove DSCI requirement for service reconciliation (#2750)

* fix: Remove DSCI requirement for service reconciliation

Services can now reconcile without DSCInitialization present:
- Reconciler treats missing DSCI as acceptable (not an error)
- Hash function handles nil DSCI with sentinel value
- ApplicationNamespace helper implements on-demand DSCI fetch
- Auth and Monitoring services use helpers for safe DSCI access
- Templates handle nil DSCI gracefully

This allows services to start before platform initialization and
prevents them from getting stuck in failed state when DSCI doesn't
exist yet.

RHOAIENG-18035

* refactor: remove DSCI from ReconciliationRequest

Make DSCI optional in the reconciliation framework by removing it from
ReconciliationRequest. Actions now fetch DSCI on-demand using helper
functions ApplicationNamespace() and MonitoringNamespace().

Changes:
- Remove DSCI field from ReconciliationRequest struct
- Update ApplicationNamespace/MonitoringNamespace helpers to fetch DSCI
- Update template rendering to pass AppNamespace to templates
- Update all actions and tests to work without rr.DSCI
- Convert monitoring tests to use Gomega assertions

This allows services to reconcile independently of DSCI existence.

* Remove cleanup FeatureTrackers action from KServe controller

* feat: remove secretegenerator controller (#2749)

* feat: remove secretegenerator controller

- we do not need that controller since OAuthClient will be created by
  Gateway service
- move secret releated function into pkg

Signed-off-by: Wen Zhou <[email protected]>

* update: rewrite unit-test

- remove over testing
- only match two functions in secret.go

Signed-off-by: Wen Zhou <[email protected]>

* update: unit-test to use gomega

Signed-off-by: Wen Zhou <[email protected]>

* update: code review comments with more checks

Signed-off-by: Wen Zhou <[email protected]>

* update: remove docs content not valid any more

Signed-off-by: Wen Zhou <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>

* chore: update manifest commit SHAs (#2768)

Co-authored-by: openshift-merge-bot <[email protected]>
Co-authored-by: Wen Zhou <[email protected]>

* update: to support olmv1 we need to replace operatorcondition with (#2766)

subscription

- with this, we lost the ability to check dependent operator version
- need a new predicate on subscritpion on .spec.name (metadata.name wont
  be useful)

Signed-off-by: Wen Zhou <[email protected]>

* update: enable seccomprofile to runtimedefault (#2770)

- we are on ocp 4.19 which support this function

Signed-off-by: Wen Zhou <[email protected]>

* fix: on managed cluster, we cannot set kueue to Managed, API wont work (#2774)

Signed-off-by: Ugo Giordano <[email protected]>

* test: add parallel components and services test execution (#2771)

* chore: update manifest commit SHAs (#2773)

Co-authored-by: openshift-merge-bot <[email protected]>

* Revert "update: to support olmv1 we need to replace operatorcondition with (#…" (#2779)

This reverts commit 6fb1439c98c4a29424bedfdeb8f145603ef14657.

* chore: uplift version from 3.0 to 3.2 (#2780)

- update comments, some feature we need to keep till first stable v3
  release

Signed-off-by: Wen Zhou <[email protected]>

* chore: update manifest commit SHAs (#2781)

Co-authored-by: openshift-merge-bot <[email protected]>

* docs: add e2e testing tips and FAQ section to README (#2682)

* docs: add e2e testing tips and FAQ section to README

Add comprehensive e2e testing guidance including setup
instructions, common workflows, and troubleshooting examples.
Includes minimum/recommended configurations, selective test
execution patterns, and component-specific testing commands.

* Clarify that env vars are space-separated

* chore: update manifest commit SHAs (#2787)

Co-authored-by: openshift-merge-bot <[email protected]>

* chore: cleanup comments and remove unused status for ArgoCD (#2788)

Signed-off-by: Wen Zhou <[email protected]>

* set defaults based on metric log data from perf test (#2791)

* chore: update manifest commit SHAs (#2794)

Co-authored-by: openshift-merge-bot <[email protected]>

* Watch ingress certificate secrets to automatically sync gateway cert on rotation (#2772)

* RHOAIENG-34484 | build: Remove generated files from build (#2329)

* build: Remove generated files from build

Instead generate them as needed. In order to allow the bundle to be
built from the existing `bundle.Dockerfile` mechanism, I introduced some
some logic to generate it as a multi-stage dockerfile, where the first
stage runs `make bundle`.

Testing
-------

1. Build the bundle from main (`make bundle-build`), make note of
   the image hash
2. Build the bundle from this branch (`make bundle-build`), make note of
   the image hash
3. Mount both images (`podman image mount $hash1 ; podman image mount
   $hash2`)
4. compare the directories. I use `meld` for this.

Note the only difference is in timestamp.

* Set controller image in manager.yaml to REPLACE_IMAGE

* Also remove config/crd/external from source

* Remove generated webhook manifest

* Fix typo in .gitignore

* Pass version information as build args

* Handle empty env vars for build args

* Remove generated gatewayconfigs and servicemeshes crds

* fix(build): Address PR feedback

Add documentation comments explaining:
- OPERATOR_VERSION arg avoids clash with VERSION var
- tests directory needed for package references
- .dockerignore retains tests for bundle build

Revert some changes to the manager image overriding, instead reworking
the variable interpretation.

Use SED_COMMAND

* (feat): Upgrade automation (#2782)

- Update release automation to support manifest shas
- Remove duplication by using common utils

* chore: update manifest commit SHAs (#2802)

Co-authored-by: openshift-merge-bot <[email protected]>

* fix: generate manifest before run unit test, fetch all needed external CRD (#2804)

* chore: clean up more images which are not used for ai hub and ai pipeline (#2795)

* chore: clean up more images which are not used for ai hub and ai
pipeline

Signed-off-by: Wen Zhou <[email protected]>

* Update internal/controller/components/modelregistry/modelregistry_support.go

Co-authored-by: Jon Burdo <[email protected]>

---------

Signed-off-by: Wen Zhou <[email protected]>
Co-authored-by: Jon Burdo <[email protected]>

* chore: remove unnecessary deployment mode configuration from KServe component (#2777)

* Remove unnecessary DeployConfig struct from KServe component

Since RawDeployment is the only supported deployment mode and it's
hardcoded in the ConfigMap, we don't need the DeployConfig struct
and getDeployConfig function.

This cleanup is a follow-up to commit 8e20eeef which removed the
deployment mode field from the API.

Changes:
- Remove DeployConfig struct and getDeployConfig function from config.go
- Simplify updateInferenceCM to directly set hardcoded deployment mode
- Update tests to parse JSON directly instead of using removed struct

* Remove defaultDeploymentMode field from inference ConfigMap

Since RawDeployment is the only supported mode, we no longer need to
explicitly set the defaultDeploymentMode field in the inference ConfigMap.
KServe will use RawDeployment by default.

Changes:
- Remove DeployConfigName constant (no longer needed)
- Stop setting defaultDeploymentMode in updateInferenceCM
- Remove deploy config validation from tests
- Remove deploy config from test ConfigMap creation

* chore: optimize SNO detection using Infrastructure ControlPlaneTopology (#2784)

* chore: update manifest commit SHAs (#2810)

Co-authored-by: openshift-merge-bot <[email protected]>

* chore: update manifest commit SHAs (#2818)

Co-authored-by: openshift-merge-bot <[email protected]>

* feat: improve skip logic with parallel groups (#2819)

* chore: fix default HWP description (#2821)

* Skip oauth2_proxy cooike in upstream request (#2805)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>

* Make the subdomain configurable in gateway API (#2790)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>

* chore: update manifest commit SHAs (#2824)

Co-authored-by: openshift-merge-bot <[email protected]>

* feat: make more flex for the application namespace to be used for (#2814)

Gateway


(cherry picked from commit 01c7d6f6ecb0f49b3c906720982fe152e10a5ef3)

Signed-off-by: Wen Zhou <[email protected]>

* restrict httproutes to the application namespace (#2807)

* restrict httproutes to the application namespace

* fix unit tests

* refactor: use cluster.GetApplicationNamespace() instead of custom getPlatformNamespace()

* Call cluster.GetApplicationNamespace() directly instead of passing as parameter

* linter fixes

* adjust ext_authz timeout to 5s (#2823)

* adjusted timetout to 5s

* added env var and gateway config for auth timeout

* chore: update manifest commit SHAs (#2830)

Co-authored-by: openshift-merge-bot <[email protected]>

* feat: use bot to update manifest sha (#2838)

* docs: update pre-req for onboarding (#2835)

* docs: update pre-req for onboarding

Signed-off-by: Wen Zhou <[email protected]>

* update: update prom rules for new stack

* update: code review
- typo
- remove channel for cert-manager

* update: wording and links

* udpate: accurate info.

---------

Signed-off-by: Wen Zhou <[email protected]>

* feat: add env vars for configuring application and workbenches namespaces for e2e test suite (#2837)

extend TestContext with the new field for workbenches namespace

update CreateDSC() to support setting workbenches namespace via env var

update README, set defaults for app and wb namespaces

add variables for monitoring namespace

update e2e test instructions for custom app namespace

* chore(modelregistry): add PostgreSQL 16 image mapping (#2812)

* chore(modelregistry): add PostgreSQL 16 image mapping

Add IMAGES_POSTGRES to RELATED_IMAGE_RHEL9_POSTGRES16_IMAGE mapping
in model registry image replacement configuration to support PostgreSQL
16 container image deployment.

Related: https://github.com/opendatahub-io/model-registry-operator/pull/374

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Chris Hambridge <[email protected]>

* Update internal/controller/components/modelregistry/modelregistry_support.go

---------

Signed-off-by: Chris Hambridge <[email protected]>
Co-authored-by: Claude <[email protected]>

* chore: update manifest commit SHAs (#2843)

Co-authored-by: openshift-merge-bot <[email protected]>

* build: update image placeholder strategy (#2842)

Jira: [RHOAIENG-38505](https://issues.redhat.com/browse/RHOAIENG-38505)

Change the kustomize image reference from a hardcoded "controller"
name to a "REPLACE_IMAGE" placeholder.

The previous strategy didn't work properly for CI, because the
[substitution logic](https://github.com/openshift/ci-tools/blob/2401b722f08b32d2a6926e1d918ef010ef37992c/pkg/steps/bundle_source.go#L90)
modifies YAML files in the repo.

* docs: update REAMDE for workbenchnamespace (#2845)

Signed-off-by: Wen Zhou <[email protected]>

* update: add check on dashboard's acceleratorprofile (#2846)

- if CRD does not exist e.g from 3.0, no need own it -- no error into
  Operator

Signed-off-by: Wen Zhou <[email protected]>

* feat: add warning when create/update connectin API secret on S3 type but no AWS_S3_BUCKET or with "" as value (#2732)

* feat: add warning when create/update connectin API secret on S3 type
- if the secret does not have AWS_S3_BUCKET we show a warning but still
  allowed
- warning if AWS_S3_BUCKET exist but has "" as value
---------

Signed-off-by: Wen Zhou <[email protected]>

* chore: update manifest commit SHAs (#2850)

Co-authored-by: openshift-merge-bot <[email protected]>

* (fix): Adapt release automation to the new make file changes (#2851)

* feat: add e2e test case for verifying workbench namespace configuration (#2852)

* Add comprehensive E2E tests for Gateway with OpenShift OAuth authentication (#2753)

Signed-off-by: Gowtham Shanmugasundaram <[email protected]>

* RHOAIENG-25593 | feat: adding perses instance (#2642)

Co-authored-by: den-rgb <[email protected]>

* RHOAIENG-34502 | feat: Enable rhoai build from main branch (#2220)

* feat: Enable rhoai build from main branch

This required sigificant changes to the Makefile and a few different strategies:

- conditionally build different versions of some structs, where there is an irreconcilable difference between `main` and `rhoai` branches (using build tags)
- maintain a separate overlay of manifests and separate bundle, tracking `rhoai` specific changes where necessary.

Renamed directories:
- `bundle` -> `odh-bundle`
- `config` -> `odh-config`

New directories:
- `rhoai-bundle`: contains the RHOAI bundle
- `rhoai-config`: contains the RHOAI manifests

With these changes most Make targets now accept the `ODH_PLATFORM_TYPE` parameter, and operate in either an odh-mode by default, or a rhoai mode if overridden to any value other than `OpenDataHub`.

`get_all_manifests.sh` now has a different mode when passed `ODH_PLATFORM_TYPE` other than `OpenDataHub`, where it looks at $VERSION and infers the downstream git reference to use. (It is most easily invoked via `make get-manifests ODH_PLATFORM_TYPE=rhoai`).

This adds RHOAI-specific Dockerfiles for the operator and the bundle.

See the difference between the rhoai versions and odh versions by using a diff tool, such as `meld` or `diff -u`.

You can compare the resulting bundle for differences by checking out the rhoai branch, and comparing `bundle.rhoai` to `bundle` in the `rhoai` branch.

There are a number of small differences related to changes that haven't been made to the `rhoai` branch.

* Update rhoai contents

* build: update CLEANFILES to use explicit paths

Replace CONFIG_DIR variable with explicit odh-config and
rhoai-config paths in CLEANFILES to ensure proper cleanup of
generated files for both configurations.

* refactor(api): consolidate build tag type definitions

Extract platform-specific type definitions from monolithic files
into separate build-tagged files. Move ODH and RHOAI variant type
definitions into dedicated .odh.go and .rhoai.go files while
keeping shared types in base files.

Affected components:
- ModelRegistry (components/v1alpha1)
- Workbenches (components/v1alpha1)
- DSCInitialization (dscinitialization/v1, v2)
- Monitoring (services/v1alpha1)

This refactoring improves code organization and maintainability
by separating platform-specific implementations using Go build
tags rather than duplicating entire files.

* docs: fix incorrect reference in MonitoringCommonSpec description

Update MonitoringCommonSpec documentation to correctly reference
"Monitoring" instead of "Dashboard" as the shared desired state.
Remove trailing blank lines at end of file.

* fix: Remove stale DEFAULT_REF reference

* build(makefile): update CLEANFILES to use specific bundle directories

Replace generic BUNDLE_DIR variable with explicit bundle directory
names (rhoai-bundle and odh-bundle) in CLEANFILES to improve clarity
and ensure both bundle variants are properly cleaned.

* chore(api): update copyright year to 2025

* docs: Restore extraneous lines in api docs

Locally, I use a pre-commit hook to clean up ends-of-files, but this is
incompatible with our CI check.

* Fix VERSION for odh/rhoai and make them independent

* build: update image placeholder strategy

Jira: [RHOAIENG-38505](https://issues.redhat.com/browse/RHOAIENG-38505)

Change the kustomize image reference from a hardcoded "controller"
name to a "REPLACE_IMAGE" placeholder.

The previous strategy didn't work properly for CI, because the
[substitution logic](https://github.com/openshift/ci-tools/blob/2401b722f08b32d2a6926e1d918ef010ef37992c/pkg/steps/bundle_source.go#L90)
modifies YAML files in the repo.

* Update placeholder for rhoai manager image

* upgrade Go to 1.24 and update rhoai Dockerfile

- Upgrade GOLANG_VERSION from 1.23 to 1.24
- Set ODH_PLATFORM_TYPE=rhoai in manifest script
- Replace kueue-configs with hardwareprofiles config
- Install tar for component dev workflow compatibility
- Remove redundant chown in favor of COPY --chown

* build(makefile): use platform-specific Dockerfile for image builds

Add DOCKERFILE_FILENAME variable to dynamically select between
Dockerfile (ODH) and rhoai.Dockerfile (RHOAI) based on
ODH_PLATFORM_TYPE. This ensures image-build target uses the correct
Dockerfile for each platform variant.

* docs(readme,testing): document RHOAI build mode and update test guidelines

Update README with instructions for building operator and bundles
in RHOAI mode using ODH_PLATFORM_TYPE=rhoai flag. Remove bundle
manifests from mandatory integration testing requirements and fix
minor formatting inconsistencies.

* Move connectionApi

* push the e2e test image with the tag main instead of adding the commit hash (#2856)

* update: move Prom Rule tests into component folder (#2858)

- test on the PrometheusRule CR not the old Prom config files
- remove triage which is only for SRE
- rename rule files to match component name
- remove deadmansnitch which is only for SRE

Signed-off-by: Wen Zhou <[email protected]>

* add some enhancements to the e2e push gha to have a version for odh (main branch) and rhoai X.Y (rhoai-x.y branch) (#2860)

* RHOAIENG-25595 | Add Perses-Tempo datasource integration (#2553)

Co-authored-by: Dayakar Maruboena <[email protected]>

* Remove load restrictions none for kustomize builds (#2863)

* feat: remove LoadRestrictionsNone from kustomize

* feat: move rhoai-config into odh-config/rhoai overlay

* rebase with master

* fix: generate rhoai manifest and webhook name

* Update manifest sha in get_all_manifests.sh correctly (#2870)

* fix: update manifest sha in get_all_manifests.sh correctly

* fix: revert namespace name in makefile

* feat: update rhoai manifests to rhoai-3.2 tag

* docs: note bundle is still needed for rhoai branch (#2801)

* docs: note bundle is still needed for rhoai branch

* Fix typo

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* fix: dsc odh samples (#2869)

* update: cleanup unnecessary upgrade logic (#2695)

- remove functions and calls not needed for 3.2 (should be removed in
  3.0 already)
  jupyterhub resource cleanup
  rbac for modelreg
  envoyfilter patch for serving
  watson resource docs
  patch odhdashboardconfig for trusty enablement
- move and cleanup function for AP to HWP migration

Signed-off-by: Wen Zhou <[email protected]>

* chore: update manifest commit SHAs (#2875)

Co-authored-by: openshift-merge-bot <148852131+openshift-merg…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants