-
Notifications
You must be signed in to change notification settings - Fork 215
RHOAIENG-40544: support new TrainJob framework in the Kueue CR managed by the operator #2945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RHOAIENG-40544: support new TrainJob framework in the Kueue CR managed by the operator #2945
Conversation
|
/hold |
WalkthroughOperator discovery now returns an OperatorInfo pointer with Version; callers check nil and use Version. Kueue conversion became semver-aware: Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
🔒 Security Findings
💡 Other
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (11)
🚧 Files skipped from review as they are similar to previous changes (4)
🧰 Additional context used📓 Path-based instructions (1)internal/controller/**/*.go⚙️ CodeRabbit configuration file
Files:
🧬 Code graph analysis (4)tests/e2e/test_context_test.go (1)
internal/controller/services/monitoring/monitoring_controller_support.go (1)
internal/controller/components/trainer/trainer_controller_actions.go (1)
internal/controller/components/kueue/kueue_config.go (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
🔇 Additional comments (18)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
1bc122b to
d2510ac
Compare
zdtsw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
hum, i think the PR can be merged before 1.2 is not out. |
d2510ac to
fdee3c0
Compare
|
@zdtsw the issue is that this TrainJob framework is not supported in Kueue 1.1, so if anybody tries to enable the kueue component using Kueue 1.1, it will throw an error when creating the Kueue CR |
but nobody is going to use a nightly build. ODH 3.2 release, is to cut on 8th Dec, release no early than 9th Dec. |
|
If the Kueue configuration already exists it will be kept as is by the operator, so if it's already installed RHBOK 1.1, probably the user should manually set To give the correct information to the user, we should:
|
fdee3c0 to
beb3692
Compare
This PR can't be merged just yet 😢Please run For more info: https://github.com/opendatahub-io/opendatahub-operator/actions/runs/19858055330 |
beb3692 to
639497f
Compare
This PR can't be merged just yet 😢Please run For more info: https://github.com/opendatahub-io/opendatahub-operator/actions/runs/19858123976 |
639497f to
2d51f98
Compare
7210935 to
7cd001d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
internal/controller/components/trainer/trainer_controller_actions_test.go (1)
5-25: Version constant introduction in tests looks fine; naming could be slightly clearer🔒 Security Findings
- None. This is test-only code with no secrets, RBAC, or external I/O.
💡 Other
- Using a shared
jobSetOperatorRndVersionconstant is a good way to keep the OperatorCondition naming consistent across tests.- If this constant is meant to represent a specific supported JobSet operator version (vs. just a random placeholder), consider renaming to something like
jobSetOperatorTestVersionor adding a short comment to clarify its intent. This is purely cosmetic and optional.internal/controller/components/kueue/kueue_config_test.go (2)
664-725: Well-structured version-gated test for TrainJob framework.💡 Other
The test properly exercises both Kueue versions:
- v1.1.0: TrainJob should NOT be in frameworks
- v1.2.0: TrainJob should be in frameworks
The previous review issue (switch cases not matching table values) has been correctly addressed - both now use
"v1.1.0"and"v1.2.0".One minor suggestion: consider adding a v1.3.0 or v2.0.0 test case to verify forward compatibility (versions greater than v1.2.0 should also include TrainJob).
{ name: "TestCreateKueueConfigurationCR_TrainJob_Framework_WithRHBoKv120", kueueVersion: "v1.2.0", }, + { + name: "TestCreateKueueConfigurationCR_TrainJob_Framework_WithRHBoKv130", + kueueVersion: "v1.3.0", + }, }And update the switch to handle this:
case "v1.1.0": g.Expect(frameworks).ShouldNot(ContainElement("TrainJob")) - case "v1.2.0": + case "v1.2.0", "v1.3.0": g.Expect(frameworks).Should(ContainElement("TrainJob"))
609-662: Test still valid despite missing OperatorCondition.💡 Other
This test doesn't create an OperatorCondition, but it still works correctly because
lookupKueueManagerConfigis called beforecluster.OperatorExistsincreateKueueCR. The invalid YAML error is returned first, matching the expected error message"failed to lookup kueue manager config".However, if the YAML were valid, this test would fail at the operator check. Consider adding a comment to clarify this dependency on execution order, or add an OperatorCondition for consistency with other tests.
s, err := scheme.New() g.Expect(err).ShouldNot(HaveOccurred()) fakeClient := fake.NewClientBuilder().WithScheme(s).Build() + + // Note: OperatorCondition not created since YAML parsing error occurs first + // in createKueueCR before the operator existence check
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum,!**/*.sum
📒 Files selected for processing (11)
go.mod(1 hunks)internal/controller/components/kueue/kueue_config.go(5 hunks)internal/controller/components/kueue/kueue_config_test.go(18 hunks)internal/controller/components/kueue/kueue_controller_actions.go(1 hunks)internal/controller/components/kueue/kueue_controller_actions_test.go(4 hunks)internal/controller/components/trainer/trainer_controller_actions.go(1 hunks)internal/controller/components/trainer/trainer_controller_actions_test.go(4 hunks)internal/controller/services/monitoring/monitoring_controller_support.go(3 hunks)pkg/cluster/cluster_config.go(1 hunks)pkg/cluster/operator.go(2 hunks)tests/e2e/test_context_test.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- tests/e2e/test_context_test.go
- go.mod
- internal/controller/components/kueue/kueue_controller_actions_test.go
- pkg/cluster/cluster_config.go
- internal/controller/services/monitoring/monitoring_controller_support.go
- internal/controller/components/trainer/trainer_controller_actions.go
- pkg/cluster/operator.go
- internal/controller/components/kueue/kueue_controller_actions.go
🧰 Additional context used
📓 Path-based instructions (1)
internal/controller/**/*.go
⚙️ CodeRabbit configuration file
internal/controller/**/*.go: CRITICAL SECURITY CHECKS:
- Hardcoded secrets/credentials
- Reconciliation loop safety (infinite loops, resource exhaustion)
- Proper error handling for security-critical operations
- Validation of CR spec fields before use
Files:
internal/controller/components/kueue/kueue_config.gointernal/controller/components/kueue/kueue_config_test.gointernal/controller/components/trainer/trainer_controller_actions_test.go
🧬 Code graph analysis (1)
internal/controller/components/kueue/kueue_config.go (2)
pkg/cluster/operator.go (1)
OperatorExists(66-93)internal/controller/components/kueue/kueue.go (1)
ErrKueueOperatorNotInstalled(32-32)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Build/push catalog image
- GitHub Check: Run tests and collect coverage on internal and pkg
- GitHub Check: golangci-lint
🔇 Additional comments (7)
internal/controller/components/trainer/trainer_controller_actions_test.go (2)
55-56: OperatorCondition name format change correctly aligns with version-qualified discovery🔒 Security Findings
- None. Fake OperatorCondition objects are only used for local test assertions.
💡 Other
- Switching the
OperatorConditionname tofmt.Sprintf("%s.%s", jobSetOperator, jobSetOperatorRndVersion)matches the new convention where conditions are suffixed with the operator version.- This should keep
TestCheckPreConditions_Managed_JobSetCRDNotInstalledproperly in sync with the updatedoperatorExistsbehavior.
94-95: Consistent versioned naming in CRD-installed test🔒 Security Findings
- None. Test fixture objects don’t introduce security risks.
💡 Other
- Using the same version-qualified name format for
jobSetOperatorConditioninTestCheckPreConditions_Managed_JobSetCRDInstalledkeeps the positive-path test aligned with the discovery logic.- The fixture setup now mirrors the expected real-world
OperatorConditionname, which should make these tests more robust to future changes in operator discovery.internal/controller/components/kueue/kueue_config.go (3)
84-91: LGTM! Proper operator existence check with version retrieval.The nil check on
kueueInfocorrectly returnsErrKueueOperatorNotInstalled(a stop error per the relevant snippet), preventing reconciliation from proceeding when the Kueue operator isn't installed. Error wrapping provides good context.
26-40: Framework mapping extension looks good.The addition of
trainer.kubeflow.org/trainjob→TrainJobmapping aligns with the PR objective to support the new TrainJob framework for RHOAI 3.2's trainer component.
182-187: Version-gated TrainJob inclusion logic is correct.The semver comparison correctly gates TrainJob on Kueue ≥ v1.2.0. The code snippet shows proper version checking without obvious issues. An edge case exists if
kueueVersionis empty (whichOperatorExistsmay return), wheresemver.Compare("", "v1.2.0")would return -1, preventing TrainJob addition—a safe default behavior. However, this edge case behavior depends on the actual implementation of upstream dependencies and howkueueVersionis populated, which requires runtime or integration testing to fully verify.internal/controller/components/kueue/kueue_config_test.go (2)
572-579: Good addition of OperatorCondition setup for test helper.This ensures all existing tests run against the v1.2.0 behavior where TrainJob is included. The naming format
kueue-operator.v1.2.0matches the pattern expected byOperatorExists.
100-101: All expected CRs correctly updated to include TrainJob.The expected YAML outputs for all existing tests have been updated to include
TrainJobin the frameworks list, consistent with the v1.2.0 OperatorCondition now seeded byrunKueueCRTest.Also applies to: 145-146, 172-173, 199-200, 230-231, 269-270, 300-301, 330-331, 363-364, 398-399, 435-436, 475-476, 508-509, 544-545
|
/test-integration |
|
ODH Operator Integration Tests #68 triggered by #2945 (comment) |
7cd001d to
e41f729
Compare
|
/test-integration |
|
ODH Operator Integration Tests #69 triggered by #2945 (comment) |
|
❌ ODH Operator Integration Tests Results Build: #68 → rhoai-test-flow #8679 Test Summary:
Details: View Full Test Report |
|
❌ ODH Operator Integration Tests Results Build: #69 → rhoai-test-flow #8681 Test Summary:
Details: View Full Test Report |
davidebianchi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@CFSNM: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/test opendatahub-operator-e2e-hypershift |
…d by the operator Update pkg/cluster/operator.go Co-authored-by: Davide Bianchi <[email protected]> Update pkg/cluster/operator.go Co-authored-by: Davide Bianchi <[email protected]>
e41f729 to
71d5309
Compare
|
/test-integration |
|
ODH Operator Integration Tests #70 triggered by #2945 (comment) |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: davidebianchi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
❌ ODH Operator Integration Tests Results Build: #70 → rhoai-test-flow #8707 Test Summary:
Details: View Full Test Report |
|
/bypass opendatahub-operator-e2e-hypershif |
d2d1481
into
opendatahub-io:main
Description
We need to support a new framework (TrainJob) in the Kueue CR creation for RHOAI 3.2, for the new trainer component relying on the jobset operator.
If Kueue 1.2 is enabled, then we need to add the new TrainJob framework to the Kueue CR. If not, do not add anything, since this framework is only supportes starting in kueue 1.2.
The operatorExists method has been reworked to also return the installed version of an operator if this operator is installed
https://issues.redhat.com/browse/RHOAIENG-40544
How Has This Been Tested?
Deploying Kueue as Unmanaged and checking the Kueue CR yaml using both RHBoK 1.1 and 1.2
Screenshot or short clip
Merge criteria
E2E test suite update requirement
When bringing new changes to the operator code, such changes are by default required to be accompanied by extending and/or updating the E2E test suite accordingly.
To opt-out of this requirement:
E2E update requirement opt-out justificationsection belowE2E update requirement opt-out justification
Summary by CodeRabbit
New Features
Bug Fixes / Improvements
Chores
Tests
✏️ Tip: You can customize this high-level summary in your review settings.