[RHOAIENG-38499] - OCI model storage is not working with multi-node f… #998

spolti · 2025-11-28T21:09:51Z

…eature

chore: Fix the scenarion when using OCI model cache with multi-node feature where the
issue happens: denied the request: no container found with name kserve-container

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Type of changes
Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Feature/Issue validation/testing:

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test A
Test B
Logs

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Checklist:

Have you added unit/e2e tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?
Have you linked the JIRA issue(s) to this PR?

Release note:

Re-running failed tests

/rerun-all - rerun all failed workflows.
/rerun-workflow <workflow name> - rerun a specific failed workflow. Only one workflow name can be specified. Multiple /rerun-workflow commands are allowed per comment.

Summary by CodeRabbit

New Features
- Enhanced model storage initialization to support multi-container pod configurations with intelligent container prioritization.
- Improved configuration to automatically select appropriate containers when multiple options are available.
Bug Fixes
- Added error handling for scenarios where no valid container is found for model storage configuration.
Tests
- Expanded test coverage for multi-node model storage injection scenarios and container prioritization logic.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

openshift-ci · 2025-11-28T21:10:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: spolti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [spolti]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2025-11-28T21:10:13Z

Walkthrough

The storage initializer injector now supports multi-container pod configurations for modelcar injection. It prioritizes InferenceService containers, falls back to Worker containers if absent, and conditionally configures Transformer containers. Comprehensive test coverage validates container prioritization, error handling, and multi-container scenarios.

Changes

Cohort / File(s)	Summary
Storage Initializer Injection Logic `pkg/webhook/admission/pod/storage_initializer_injector.go`	Enhanced `InjectModelcar` to handle multi-container awareness: prioritizes InferenceService container, falls back to Worker container, and configures Transformer container when present. Returns error if no valid container is found.
Test Suite for Multi-Node Scenarios `pkg/webhook/admission/pod/storage_initializer_injector_test.go`	Added `TestInjectModelcarMultiNode` test suite with subtests validating worker-only injection, container prioritization (kserve over worker), error cases, and combined transformer/worker scenarios. Includes helper Pod constructors for multi-node setup.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Areas requiring attention:
- Container prioritization logic in InjectModelcar to ensure fallback chain is correct and error handling is appropriate
- Test assertion accuracy for each multi-node scenario, particularly the container mount validation and prioritization logic
- Edge cases where containers might be configured unexpectedly or conflict with existing configurations

Poem

🐰 A pod with many containers bright,
We pick the right one—what a sight!
Worker waits when kserve's away,
Transformers join to save the day,
Storage flows through each with care, ✨
Multinode magic floating there!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly identifies the main change: fixing OCI model storage functionality for multi-node scenarios, which is the core purpose of the changeset.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

pkg/webhook/admission/pod/storage_initializer_injector_test.go (1)

4457-4555: Good multi-node coverage; consider tightening assertions and comments

The new multi-node tests and helpers nicely cover:

Worker-only pods getting modelcar + volume mounts.

Explicit error when neither kserve nor worker container exists.

Preference for kserve-container when both kserve and worker are present.

Worker + transformer both receiving mounts when there’s no kserve container.

Two small polish suggestions:

In the "prioritizes kserve-container over worker-container" subtest, the comment says “Both containers should have volume mounts but kserve-container should be prioritized” while the assertions only verify the mount on kserve-container. Either add an explicit check for the worker’s mounts (or lack thereof, depending on the intended behavior) or reword the comment so it matches what is actually asserted.

If you want parity with the single-node tests, you might also assert invariants like ShareProcessNamespace or the model-init env var in at least one multi-node case, but that’s optional.

Functionally this suite looks solid and should catch regressions in the new selection logic.

Also applies to: 4559-4619

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0eaa31 and 7c18959.

📒 Files selected for processing (3)

config/default/manager_image_patch.yaml (1 hunks)
pkg/webhook/admission/pod/storage_initializer_injector.go (1 hunks)
pkg/webhook/admission/pod/storage_initializer_injector_test.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

pkg/webhook/admission/pod/storage_initializer_injector_test.go (5)

pkg/webhook/admission/pod/storage_initializer_injector.go (1)

StorageInitializerInjector (48-52)

pkg/types/config.go (1)

StorageInitializerConfig (19-32)

pkg/utils/utils.go (1)

GetContainerWithName (249-256)

pkg/constants/constants.go (7)

ModelcarContainerName (408-408)

WorkerContainerName (405-405)

StorageInitializerVolumeName (186-186)

InferenceServiceContainerName (398-398)

TransformerContainerName (402-402)

StorageInitializerSourceUriInternalAnnotationKey (135-135)

OciURIPrefix (181-181)

pkg/apis/serving/v1beta1/podspec.go (1)

PodSpec (28-339)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: test
GitHub Check: Test
GitHub Check: build (3.11)
GitHub Check: build (3.12)
GitHub Check: build (3.9)
GitHub Check: build (3.10)
GitHub Check: test
GitHub Check: test
GitHub Check: precommit-check
GitHub Check: test
GitHub Check: test

🔇 Additional comments (1)

pkg/webhook/admission/pod/storage_initializer_injector.go (1)

95-112: Multi-node modelcar container selection logic looks sound

The updated InjectModelcar correctly:

Prefers constants.InferenceServiceContainerName when present.

Falls back to constants.WorkerContainerName when the inference container is absent, matching the pattern already used in InjectStorageInitializer.

Returns a clear configuration error only when neither container exists.

Optionally configures the transformer container when present.

This matches the intended multi-node OCI behavior and aligns with the new tests that cover worker-only, no-valid-container, and worker+transformer cases.

Also applies to: 114-117

config/default/manager_image_patch.yaml

…eature chore: Fix the scenarion when using OCI model cache with multi-node feature where the issue happens: ` denied the request: no container found with name kserve-container` Signed-off-by: Spolti <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

pkg/webhook/admission/pod/storage_initializer_injector.go (1)

95-117: Worker-container fallback in InjectModelcar looks correct; consider clarifying error message

The new fallback to constants.WorkerContainerName when kserve-container is absent is consistent with the InjectStorageInitializer behavior and should resolve the multi-node/worker-only scenario without impacting existing single-node behavior. The only nit is that the error still reports only kserve-container even though both the inference and worker containers are now considered valid targets; if you expect operators to debug worker-only pods directly, consider mentioning both acceptable container names in the error text.

pkg/webhook/admission/pod/storage_initializer_injector_test.go (1)

4457-4555: Multi-node InjectModelcar tests cover key paths; minor comment/assertion mismatch

The new multi-node tests and helpers exercise the worker-only, no-valid-container, inference+worker, and worker+transformer cases and align well with the updated injector logic. In the “prioritizes kserve-container over worker-container” subtest, the comment says both containers should have volume mounts but the code only asserts the mount on kserve-container; either add an assertion for the worker container or relax the comment to match the actual expectation.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7c18959 and 7b8a63d.

📒 Files selected for processing (2)

pkg/webhook/admission/pod/storage_initializer_injector.go (1 hunks)
pkg/webhook/admission/pod/storage_initializer_injector_test.go (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: Test
GitHub Check: build (3.10)
GitHub Check: build (3.12)
GitHub Check: build (3.11)
GitHub Check: build (3.9)
GitHub Check: test
GitHub Check: test
GitHub Check: precommit-check
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test

spolti requested review from Jooho and VedantMahabaleshwarkar November 28, 2025 21:09

github-project-automation bot moved this to New/Backlog in ODH Model Serving Planning Nov 28, 2025

github-project-automation bot added this to ODH Model Serving Planning Nov 28, 2025

openshift-ci bot added the approved label Nov 28, 2025

spolti force-pushed the RHOAIENG-38499 branch from 6b74c84 to 7c18959 Compare November 28, 2025 21:10

coderabbitai bot reviewed Nov 28, 2025

View reviewed changes

config/default/manager_image_patch.yaml Outdated Show resolved Hide resolved

spolti force-pushed the RHOAIENG-38499 branch from 7c18959 to 7b8a63d Compare December 1, 2025 13:45

coderabbitai bot reviewed Dec 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RHOAIENG-38499] - OCI model storage is not working with multi-node f… #998

[RHOAIENG-38499] - OCI model storage is not working with multi-node f… #998

Uh oh!

spolti commented Nov 28, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

openshift-ci bot commented Nov 28, 2025

Uh oh!

coderabbitai bot commented Nov 28, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[RHOAIENG-38499] - OCI model storage is not working with multi-node f… #998

Are you sure you want to change the base?

[RHOAIENG-38499] - OCI model storage is not working with multi-node f… #998

Uh oh!

Conversation

spolti commented Nov 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Nov 28, 2025

Uh oh!

coderabbitai bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

spolti commented Nov 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 28, 2025 •

edited

Loading