Skip to content

Conversation

@CFSNM
Copy link
Member

@CFSNM CFSNM commented Dec 1, 2025

Description

We need to support a new framework (TrainJob) in the Kueue CR creation for RHOAI 3.2, for the new trainer component relying on the jobset operator.

If Kueue 1.2 is enabled, then we need to add the new TrainJob framework to the Kueue CR. If not, do not add anything, since this framework is only supportes starting in kueue 1.2.

The operatorExists method has been reworked to also return the installed version of an operator if this operator is installed

https://issues.redhat.com/browse/RHOAIENG-40544

How Has This Been Tested?

Deploying Kueue as Unmanaged and checking the Kueue CR yaml using both RHBoK 1.1 and 1.2

Screenshot or short clip

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work
  • The developer has run the integration test pipeline and verified that it passed successfully

E2E test suite update requirement

When bringing new changes to the operator code, such changes are by default required to be accompanied by extending and/or updating the E2E test suite accordingly.

To opt-out of this requirement:

  1. Please inspect the opt-out guidelines, to determine if the nature of the PR changes allows for skipping this requirement
  2. If opt-out is applicable, provide justification in the dedicated E2E update requirement opt-out justification section below
  3. Check the checkbox below:
  • Skip requirement to update E2E test suite for this PR
  1. Submit/save these changes to the PR description. This will automatically trigger the check.

E2E update requirement opt-out justification

Summary by CodeRabbit

  • New Features

    • Conditional support for TrainJob when Kueue is v1.2.0 or newer.
  • Bug Fixes / Improvements

    • Improved operator detection: precondition checks now distinguish installed vs absent operators and use operator version info to adapt behavior across operator versions.
  • Chores

    • Added semantic-version handling dependency.
  • Tests

    • Expanded tests to validate TrainJob inclusion and version-gated behavior across operator versions.

✏️ Tip: You can customize this high-level summary in your review settings.

@CFSNM
Copy link
Member Author

CFSNM commented Dec 1, 2025

/hold

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 1, 2025

Walkthrough

Operator discovery now returns an OperatorInfo pointer with Version; callers check nil and use Version. Kueue conversion became semver-aware: convertIntegrations now accepts kueueVersion and adds the TrainJob framework when Kueue ≥ v1.2.0. Tests and go.mod updated for versioned OperatorCondition fixtures.

Changes

Cohort / File(s) Summary
Core operator API
pkg/cluster/operator.go
Added exported OperatorInfo (Version string) and changed OperatorExists to return (*OperatorInfo, error). Returns (nil, err) on list errors, nil, nil when not found, and a normalized OperatorInfo when matched.
Cluster config
pkg/cluster/cluster_config.go
Callers updated to use OperatorInfo pointer and nil-checks; detectSelfManaged uses non-nil operatorInfo to set self-managed detection.
Kueue config & conversion
internal/controller/components/kueue/kueue_config.go
Imported golang.org/x/mod/semver; changed convertIntegrations signature to accept kueueVersion string; createKueueCR now fetches kueue operator info/version and returns ErrKueueOperatorNotInstalled if absent; conversion conditionally adds TrainJob when kueueVersion >= v1.2.0.
Controller precondition checks
internal/controller/components/kueue/kueue_controller_actions.go, internal/controller/components/trainer/trainer_controller_actions.go, internal/controller/services/monitoring/monitoring_controller_support.go
Replaced boolean exists checks with operatorInfo nil/non-nil checks; preserved error propagation; some sites return Err*OperatorNotInstalled when info is nil and use operatorInfo.Version where required.
Kueue tests: version behavior
internal/controller/components/kueue/kueue_config_test.go, internal/controller/components/kueue/kueue_controller_actions_test.go
Added TestCreateKueueConfigurationCR_TrainJobFramework verifying TrainJob absent for 1.1.0 and present for 1.2.0; test harness now creates versioned OperatorCondition entries (e.g., kueue-operator.1.2.0) and updated expected CR fixtures.
Trainer & trainer tests
internal/controller/components/trainer/trainer_controller_actions_test.go
Tests updated to suffix OperatorCondition names with a version string (e.g., jobset-operator.1.1.0) to match new operator discovery behavior.
E2E / test helpers
tests/e2e/test_context_test.go
CheckOperatorExists updated to call OperatorExists and return operatorInfo != nil, err.
Kueue controller actions tests
internal/controller/components/kueue/kueue_controller_actions_test.go
Test setups updated to use versioned OperatorCondition names and added constant for minimum trainer-supporting kueue version.
Build dependencies
go.mod
Added module requirement: golang.org/x/mod v0.24.0 (used for semver comparisons).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Files/areas to focus on:
    • pkg/cluster/operator.go — prefix matching, version extraction/normalization (leading 'v'), and (nil, err) vs (nil, nil) return semantics.
    • internal/controller/components/kueue/kueue_config.go — correct use of semver comparisons, passing kueueInfo.Version into convertIntegrations, and conditional insertion of TrainJob.
    • Call sites converted from boolean to *OperatorInfo — ensure consistent nil-checks and accurate error vs missing-operator returns (StopError vs Err*OperatorNotInstalled).
    • Tests under internal/controller/components/kueue/ and trainer tests — verify OperatorCondition naming, fixtures, and assertions for both 1.1.0 and 1.2.0 scenarios.

🔒 Security Findings

  • OperatorCondition name parsing can be spoofed to claim arbitrary operator versions. Remediation: verify OperatorCondition ownerReferences and namespace, or derive version from a trusted field (e.g., status or explicit label) rather than parsing the object name.
  • Semver inputs may be malformed or attacker-controlled. Remediation: validate/parse versions strictly (e.g., semver.Canonical or regex) and treat unparsable versions as "unknown" (do not enable features) rather than attempting comparisons.
  • Do not use parsed version strings from OperatorCondition names for privileged decisions without an allow-list or explicit normalization; add an allow-list of supported operator names/versions before enabling feature gates.

💡 Other

  • Tests: include realistic ownerReferences/namespace in OperatorCondition fixtures to better mirror production and reduce false positives.
  • Backwards compatibility: audit all callers that previously used boolean results to avoid accidental pointer misuse; consider documenting the (nil, err) vs (nil, nil) contract in a CHANGELOG or code comment near OperatorExists.
  • Error semantics: add brief comments near OperatorExists describing when it returns (nil, nil) vs (nil, err) to prevent misinterpretation by future consumers.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 59.38% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding TrainJob framework support to the Kueue CR managed by the operator.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e41f729 and 71d5309.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum, !**/*.sum
📒 Files selected for processing (11)
  • go.mod (1 hunks)
  • internal/controller/components/kueue/kueue_config.go (5 hunks)
  • internal/controller/components/kueue/kueue_config_test.go (18 hunks)
  • internal/controller/components/kueue/kueue_controller_actions.go (1 hunks)
  • internal/controller/components/kueue/kueue_controller_actions_test.go (4 hunks)
  • internal/controller/components/trainer/trainer_controller_actions.go (1 hunks)
  • internal/controller/components/trainer/trainer_controller_actions_test.go (4 hunks)
  • internal/controller/services/monitoring/monitoring_controller_support.go (3 hunks)
  • pkg/cluster/cluster_config.go (1 hunks)
  • pkg/cluster/operator.go (2 hunks)
  • tests/e2e/test_context_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • internal/controller/components/kueue/kueue_controller_actions.go
  • pkg/cluster/operator.go
  • internal/controller/components/kueue/kueue_controller_actions_test.go
  • pkg/cluster/cluster_config.go
🧰 Additional context used
📓 Path-based instructions (1)
internal/controller/**/*.go

⚙️ CodeRabbit configuration file

internal/controller/**/*.go: CRITICAL SECURITY CHECKS:

  1. Hardcoded secrets/credentials
  2. Reconciliation loop safety (infinite loops, resource exhaustion)
  3. Proper error handling for security-critical operations
  4. Validation of CR spec fields before use

Files:

  • internal/controller/services/monitoring/monitoring_controller_support.go
  • internal/controller/components/kueue/kueue_config_test.go
  • internal/controller/components/trainer/trainer_controller_actions_test.go
  • internal/controller/components/trainer/trainer_controller_actions.go
  • internal/controller/components/kueue/kueue_config.go
🧬 Code graph analysis (4)
tests/e2e/test_context_test.go (1)
pkg/cluster/operator.go (1)
  • OperatorExists (66-93)
internal/controller/services/monitoring/monitoring_controller_support.go (1)
pkg/cluster/operator.go (1)
  • OperatorExists (66-93)
internal/controller/components/trainer/trainer_controller_actions.go (1)
pkg/cluster/operator.go (1)
  • OperatorExists (66-93)
internal/controller/components/kueue/kueue_config.go (2)
pkg/cluster/operator.go (1)
  • OperatorExists (66-93)
internal/controller/components/kueue/kueue.go (1)
  • ErrKueueOperatorNotInstalled (32-32)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build/push catalog image
  • GitHub Check: Run tests and collect coverage on internal and pkg
  • GitHub Check: golangci-lint
🔇 Additional comments (18)
go.mod (1)

22-22: LGTM! Dependency appropriately added for version comparison.

The golang.org/x/mod dependency at v0.24.0 is correctly added to support semver-based version comparison logic for TrainJob framework inclusion. This is used in internal/controller/components/kueue/kueue_config.go to conditionally enable TrainJob based on Kueue operator version.

tests/e2e/test_context_test.go (1)

1366-1367: LGTM! Correctly adapted to new OperatorExists API.

The function properly consumes the updated OperatorExists API that returns (*OperatorInfo, error). The nil-check semantics (operatorInfo != nil) correctly replace the previous boolean check, maintaining backward compatibility with the existing function signature.

internal/controller/components/trainer/trainer_controller_actions.go (1)

30-36: LGTM! Correct adaptation to new OperatorExists API.

The precondition check correctly consumes the updated OperatorExists API:

  • Stores *OperatorInfo in jobSetInfo
  • Uses nil-check (jobSetInfo == nil) instead of boolean negation
  • Preserves error handling semantics (StopErrorW on error, ErrJobSetOperatorNotInstalled on nil info)
internal/controller/components/kueue/kueue_config_test.go (3)

5-5: LGTM! Imports appropriately added for test enhancements.

The new imports support:

  • fmt: String formatting for versioned OperatorCondition names
  • ofapiv2: Creating OperatorCondition fixtures for version-gated tests
  • unstructured: Extracting frameworks from generated CR for assertions

Also applies to: 8-8, 12-12


572-579: LGTM! OperatorCondition fixture enables version-aware testing.

The test correctly seeds an OperatorCondition with version "v1.2.0" to enable TrainJob framework inclusion in generated Kueue CRs. This aligns with the version-gating logic in kueue_config.go.


69-69: LGTM! TrainJob correctly included in expected Kueue CRs.

TrainJob is properly added to the expected frameworks across all test scenarios, consistent with the v1.2.0+ version gating behavior implemented in kueue_config.go.

Also applies to: 100-100, 145-145, 172-172, 199-199, 230-230, 269-269, 300-300, 330-330, 363-363, 398-398, 435-435, 475-475, 508-508, 544-544

internal/controller/components/trainer/trainer_controller_actions_test.go (3)

5-5: LGTM! Import added for versioned fixture naming.

The fmt import is appropriately added to support fmt.Sprintf for constructing versioned OperatorCondition names in test fixtures.


24-24: LGTM! Centralized version constant for test fixtures.

The jobSetOperatorRndVersion constant provides a single source of truth for the operator version used across test fixtures, improving maintainability.


55-55: LGTM! OperatorCondition names correctly versioned for test alignment.

The test fixtures now use versioned OperatorCondition names ("jobset-operator.1.1.0") that align with the updated OperatorExists API semantics. This ensures tests accurately reflect production behavior where operator versions are extracted from OperatorCondition names.

Also applies to: 94-94

internal/controller/components/kueue/kueue_config.go (6)

9-9: LGTM! Import added for semver-based version comparison.

The golang.org/x/mod/semver import enables version-aware framework inclusion logic for TrainJob support.


38-38: LGTM! TrainJob mapping correctly added.

The framework mapping "trainer.kubeflow.org/trainjob" → "TrainJob" properly translates the Kueue configuration format to the operator CR format.


84-91: LGTM! Operator existence check properly gates CR creation.

The function correctly checks for Kueue operator presence before proceeding with CR creation:

  • Returns ErrKueueOperatorNotInstalled when operator is absent
  • Captures version information for downstream logic
  • Proper error wrapping maintains context

96-96: LGTM! Version parameter correctly threaded to conversion logic.

The kueueInfo.Version is properly passed to convertIntegrations to enable version-aware framework inclusion.


161-161: LGTM! Function signature updated for version-aware conversion.

The convertIntegrations signature correctly accepts kueueVersion string to support conditional TrainJob inclusion based on operator version.


182-187: Verify semver comparison handles version prefix correctly.

The version-gating logic uses semver.Compare(kueueVersion, "v1.2.0") to determine TrainJob inclusion. Based on the OperatorExists implementation in pkg/cluster/operator.go:65-92, the version string is guaranteed to have a "v" prefix (lines 84-86 of that file ensure this).

However, I cannot see the complete implementation in the annotated code. Please verify:

Run the following script to confirm the version prefix handling is consistent:

#!/bin/bash
# Verify that OperatorExists always returns versions with "v" prefix
# and that semver.Compare handles this correctly

# Check OperatorExists version prefix logic
echo "=== OperatorExists version handling ==="
rg -A 5 'if !strings.HasPrefix\(version, "v"\)' pkg/cluster/operator.go

# Check all uses of kueueVersion in convertIntegrations
echo ""
echo "=== kueueVersion usage in convertIntegrations ==="
rg -B 2 -A 2 'kueueVersion' internal/controller/components/kueue/kueue_config.go

# Verify semver.Compare usage pattern
echo ""
echo "=== semver.Compare usage ==="
rg 'semver\.Compare' internal/controller/components/kueue/
internal/controller/services/monitoring/monitoring_controller_support.go (3)

394-399: LGTM! OpenTelemetry operator check correctly updated.

The precondition check properly consumes the new OperatorExists API:

  • Uses nil-check (openTelemetryInfo == nil) instead of boolean negation
  • Preserves error handling: returns StopErrorW on error, appends StopError with message on nil info
  • Variable naming is clear and consistent

404-409: LGTM! Cluster Observability Operator check correctly updated.

The precondition check follows the same correct pattern as the OpenTelemetry operator check, with appropriate nil-checking and error handling for clusterObservabilityOperatorInfo.


414-419: LGTM! Tempo operator check correctly updated.

The precondition check follows the same correct pattern as the other operator checks, with appropriate nil-checking and error handling for tempoOperatorInfo.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Member

@zdtsw zdtsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zdtsw
Copy link
Member

zdtsw commented Dec 2, 2025

hum, i think the PR can be merged before 1.2 is not out.
as long as nobody is trying to use/test it.
otherwise, you will need any who already have pre-1.2 insatelld force to upgrade to kueue

@zdtsw zdtsw requested review from davidebianchi and valdar and removed request for carlkyrillos and grdryn December 2, 2025 06:07
@CFSNM CFSNM force-pushed the train_job_kueue_cr branch from d2510ac to fdee3c0 Compare December 2, 2025 08:51
@CFSNM
Copy link
Member Author

CFSNM commented Dec 2, 2025

@zdtsw the issue is that this TrainJob framework is not supported in Kueue 1.1, so if anybody tries to enable the kueue component using Kueue 1.1, it will throw an error when creating the Kueue CR

@zdtsw
Copy link
Member

zdtsw commented Dec 2, 2025

@zdtsw the issue is that this TrainJob framework is not supported in Kueue 1.1, so if anybody tries to enable the kueue component using Kueue 1.1, it will throw an error when creating the Kueue CR

but nobody is going to use a nightly build. ODH 3.2 release, is to cut on 8th Dec, release no early than 9th Dec.
Unless the plan to add this support wont be included in ODH 3.2?
Plus, if someone really wanna use kueue with TO2 from officially release ODH 3.2/RHOAI 3.2 they will need upgrade RHBOK to 1.2 anyhow. in that case, you will need add more logic in this PR, to check RHBOK version

@davidebianchi
Copy link
Member

davidebianchi commented Dec 2, 2025

If the Kueue configuration already exists it will be kept as is by the operator, so if it's already installed RHBOK 1.1, probably the user should manually set TrainJob in the Kueue configuration.

To give the correct information to the user, we should:

  • in kueue controller (if Kueue configuration not already exists), check the RHBOK version. If > 1.2, we should add TrainJob, otherwise not
  • maybe, also add a check if kueue and trainer v2 are enabled, trainer v2 controller should verify if Kueue is configured with TrainJob framework, otherwise raise an error to add it manually. This would be to verify with trainer team

@CFSNM CFSNM marked this pull request as draft December 2, 2025 10:16
@CFSNM CFSNM force-pushed the train_job_kueue_cr branch from fdee3c0 to beb3692 Compare December 2, 2025 12:08
@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2025

This PR can't be merged just yet 😢

Please run make generate manifests api-docs and commit the changes.

For more info: https://github.com/opendatahub-io/opendatahub-operator/actions/runs/19858055330

@CFSNM CFSNM force-pushed the train_job_kueue_cr branch from beb3692 to 639497f Compare December 2, 2025 12:10
@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2025

This PR can't be merged just yet 😢

Please run make generate manifests api-docs and commit the changes.

For more info: https://github.com/opendatahub-io/opendatahub-operator/actions/runs/19858123976

@CFSNM CFSNM force-pushed the train_job_kueue_cr branch from 639497f to 2d51f98 Compare December 2, 2025 13:29
@CFSNM CFSNM force-pushed the train_job_kueue_cr branch from 7210935 to 7cd001d Compare December 10, 2025 08:00
@openshift-ci openshift-ci bot removed the lgtm label Dec 10, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
internal/controller/components/trainer/trainer_controller_actions_test.go (1)

5-25: Version constant introduction in tests looks fine; naming could be slightly clearer

🔒 Security Findings

  • None. This is test-only code with no secrets, RBAC, or external I/O.

💡 Other

  • Using a shared jobSetOperatorRndVersion constant is a good way to keep the OperatorCondition naming consistent across tests.
  • If this constant is meant to represent a specific supported JobSet operator version (vs. just a random placeholder), consider renaming to something like jobSetOperatorTestVersion or adding a short comment to clarify its intent. This is purely cosmetic and optional.
internal/controller/components/kueue/kueue_config_test.go (2)

664-725: Well-structured version-gated test for TrainJob framework.

💡 Other

The test properly exercises both Kueue versions:

  • v1.1.0: TrainJob should NOT be in frameworks
  • v1.2.0: TrainJob should be in frameworks

The previous review issue (switch cases not matching table values) has been correctly addressed - both now use "v1.1.0" and "v1.2.0".

One minor suggestion: consider adding a v1.3.0 or v2.0.0 test case to verify forward compatibility (versions greater than v1.2.0 should also include TrainJob).

 		{
 			name:         "TestCreateKueueConfigurationCR_TrainJob_Framework_WithRHBoKv120",
 			kueueVersion: "v1.2.0",
 		},
+		{
+			name:         "TestCreateKueueConfigurationCR_TrainJob_Framework_WithRHBoKv130",
+			kueueVersion: "v1.3.0",
+		},
 	}

And update the switch to handle this:

 			case "v1.1.0":
 				g.Expect(frameworks).ShouldNot(ContainElement("TrainJob"))
-			case "v1.2.0":
+			case "v1.2.0", "v1.3.0":
 				g.Expect(frameworks).Should(ContainElement("TrainJob"))

609-662: Test still valid despite missing OperatorCondition.

💡 Other

This test doesn't create an OperatorCondition, but it still works correctly because lookupKueueManagerConfig is called before cluster.OperatorExists in createKueueCR. The invalid YAML error is returned first, matching the expected error message "failed to lookup kueue manager config".

However, if the YAML were valid, this test would fail at the operator check. Consider adding a comment to clarify this dependency on execution order, or add an OperatorCondition for consistency with other tests.

 	s, err := scheme.New()
 	g.Expect(err).ShouldNot(HaveOccurred())
 	fakeClient := fake.NewClientBuilder().WithScheme(s).Build()
+
+	// Note: OperatorCondition not created since YAML parsing error occurs first
+	// in createKueueCR before the operator existence check
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7210935 and 7cd001d.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum, !**/*.sum
📒 Files selected for processing (11)
  • go.mod (1 hunks)
  • internal/controller/components/kueue/kueue_config.go (5 hunks)
  • internal/controller/components/kueue/kueue_config_test.go (18 hunks)
  • internal/controller/components/kueue/kueue_controller_actions.go (1 hunks)
  • internal/controller/components/kueue/kueue_controller_actions_test.go (4 hunks)
  • internal/controller/components/trainer/trainer_controller_actions.go (1 hunks)
  • internal/controller/components/trainer/trainer_controller_actions_test.go (4 hunks)
  • internal/controller/services/monitoring/monitoring_controller_support.go (3 hunks)
  • pkg/cluster/cluster_config.go (1 hunks)
  • pkg/cluster/operator.go (2 hunks)
  • tests/e2e/test_context_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • tests/e2e/test_context_test.go
  • go.mod
  • internal/controller/components/kueue/kueue_controller_actions_test.go
  • pkg/cluster/cluster_config.go
  • internal/controller/services/monitoring/monitoring_controller_support.go
  • internal/controller/components/trainer/trainer_controller_actions.go
  • pkg/cluster/operator.go
  • internal/controller/components/kueue/kueue_controller_actions.go
🧰 Additional context used
📓 Path-based instructions (1)
internal/controller/**/*.go

⚙️ CodeRabbit configuration file

internal/controller/**/*.go: CRITICAL SECURITY CHECKS:

  1. Hardcoded secrets/credentials
  2. Reconciliation loop safety (infinite loops, resource exhaustion)
  3. Proper error handling for security-critical operations
  4. Validation of CR spec fields before use

Files:

  • internal/controller/components/kueue/kueue_config.go
  • internal/controller/components/kueue/kueue_config_test.go
  • internal/controller/components/trainer/trainer_controller_actions_test.go
🧬 Code graph analysis (1)
internal/controller/components/kueue/kueue_config.go (2)
pkg/cluster/operator.go (1)
  • OperatorExists (66-93)
internal/controller/components/kueue/kueue.go (1)
  • ErrKueueOperatorNotInstalled (32-32)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build/push catalog image
  • GitHub Check: Run tests and collect coverage on internal and pkg
  • GitHub Check: golangci-lint
🔇 Additional comments (7)
internal/controller/components/trainer/trainer_controller_actions_test.go (2)

55-56: OperatorCondition name format change correctly aligns with version-qualified discovery

🔒 Security Findings

  • None. Fake OperatorCondition objects are only used for local test assertions.

💡 Other

  • Switching the OperatorCondition name to fmt.Sprintf("%s.%s", jobSetOperator, jobSetOperatorRndVersion) matches the new convention where conditions are suffixed with the operator version.
  • This should keep TestCheckPreConditions_Managed_JobSetCRDNotInstalled properly in sync with the updated operatorExists behavior.

94-95: Consistent versioned naming in CRD-installed test

🔒 Security Findings

  • None. Test fixture objects don’t introduce security risks.

💡 Other

  • Using the same version-qualified name format for jobSetOperatorCondition in TestCheckPreConditions_Managed_JobSetCRDInstalled keeps the positive-path test aligned with the discovery logic.
  • The fixture setup now mirrors the expected real-world OperatorCondition name, which should make these tests more robust to future changes in operator discovery.
internal/controller/components/kueue/kueue_config.go (3)

84-91: LGTM! Proper operator existence check with version retrieval.

The nil check on kueueInfo correctly returns ErrKueueOperatorNotInstalled (a stop error per the relevant snippet), preventing reconciliation from proceeding when the Kueue operator isn't installed. Error wrapping provides good context.


26-40: Framework mapping extension looks good.

The addition of trainer.kubeflow.org/trainjobTrainJob mapping aligns with the PR objective to support the new TrainJob framework for RHOAI 3.2's trainer component.


182-187: Version-gated TrainJob inclusion logic is correct.

The semver comparison correctly gates TrainJob on Kueue ≥ v1.2.0. The code snippet shows proper version checking without obvious issues. An edge case exists if kueueVersion is empty (which OperatorExists may return), where semver.Compare("", "v1.2.0") would return -1, preventing TrainJob addition—a safe default behavior. However, this edge case behavior depends on the actual implementation of upstream dependencies and how kueueVersion is populated, which requires runtime or integration testing to fully verify.

internal/controller/components/kueue/kueue_config_test.go (2)

572-579: Good addition of OperatorCondition setup for test helper.

This ensures all existing tests run against the v1.2.0 behavior where TrainJob is included. The naming format kueue-operator.v1.2.0 matches the pattern expected by OperatorExists.


100-101: All expected CRs correctly updated to include TrainJob.

The expected YAML outputs for all existing tests have been updated to include TrainJob in the frameworks list, consistent with the v1.2.0 OperatorCondition now seeded by runKueueCRTest.

Also applies to: 145-146, 172-173, 199-200, 230-231, 269-270, 300-301, 330-331, 363-364, 398-399, 435-436, 475-476, 508-509, 544-545

@github-actions
Copy link
Contributor

/test-integration

@rhods-ci-bot
Copy link
Collaborator

ODH Operator Integration Tests #68 triggered by #2945 (comment)

@CFSNM CFSNM force-pushed the train_job_kueue_cr branch from 7cd001d to e41f729 Compare December 10, 2025 08:15
@github-actions
Copy link
Contributor

/test-integration

@rhods-ci-bot
Copy link
Collaborator

ODH Operator Integration Tests #69 triggered by #2945 (comment)

@rhods-ci-bot
Copy link
Collaborator

ODH Operator Integration Tests Results

Build: #68rhoai-test-flow #8679
Result: FAILURE (50% pass rate)

Test Summary:

  • 📊 Total Tests: 2
  • Passed: 1
  • Failed: 1
  • ⚠️ Errors: 0
  • ⏭️ Skipped: 0

Details: View Full Test Report

@rhods-ci-bot
Copy link
Collaborator

ODH Operator Integration Tests Results

Build: #69rhoai-test-flow #8681
Result: FAILURE (50% pass rate)

Test Summary:

  • 📊 Total Tests: 2
  • Passed: 1
  • Failed: 1
  • ⚠️ Errors: 0
  • ⏭️ Skipped: 0

Details: View Full Test Report

Copy link
Member

@davidebianchi davidebianchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci
Copy link

openshift-ci bot commented Dec 10, 2025

@CFSNM: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/opendatahub-operator-e2e-hypershift e41f729 link false /test opendatahub-operator-e2e-hypershift

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD b868dd5 and 2 for PR HEAD e41f729 in total

@CFSNM
Copy link
Member Author

CFSNM commented Dec 10, 2025

/test opendatahub-operator-e2e-hypershift

…d by the operator

Update pkg/cluster/operator.go

Co-authored-by: Davide Bianchi <[email protected]>

Update pkg/cluster/operator.go

Co-authored-by: Davide Bianchi <[email protected]>
@CFSNM CFSNM force-pushed the train_job_kueue_cr branch from e41f729 to 71d5309 Compare December 10, 2025 13:38
@openshift-ci openshift-ci bot removed the lgtm label Dec 10, 2025
@github-actions
Copy link
Contributor

/test-integration

@rhods-ci-bot
Copy link
Collaborator

ODH Operator Integration Tests #70 triggered by #2945 (comment)

@openshift-ci openshift-ci bot added the lgtm label Dec 10, 2025
@openshift-ci
Copy link

openshift-ci bot commented Dec 10, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidebianchi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rhods-ci-bot
Copy link
Collaborator

ODH Operator Integration Tests Results

Build: #70rhoai-test-flow #8707
Result: FAILURE (50% pass rate)

Test Summary:

  • 📊 Total Tests: 2
  • Passed: 1
  • Failed: 1
  • ⚠️ Errors: 0
  • ⏭️ Skipped: 0

Details: View Full Test Report

@CFSNM
Copy link
Member Author

CFSNM commented Dec 10, 2025

/bypass opendatahub-operator-e2e-hypershif

@openshift-merge-bot openshift-merge-bot bot merged commit d2d1481 into opendatahub-io:main Dec 10, 2025
19 of 20 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in ODH Platform Planning Dec 10, 2025
@CFSNM CFSNM deleted the train_job_kueue_cr branch December 10, 2025 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants