[RHOAIENG-36673] - Investigate and Refactor KServe E2E Test Suite for… #987

spolti · 2025-11-24T20:17:15Z

… RawDeployment mode only

chore: Fix tests.
Adapt the setup and run e2e tests to work with Raw Deployment
Update failing tests to work with Raw Deployment mode

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Type of changes
Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Feature/Issue validation/testing:

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test A
Test B
Logs

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Checklist:

Have you added unit/e2e tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?
Have you linked the JIRA issue(s) to this PR?

Release note:

Re-running failed tests

/rerun-all - rerun all failed workflows.
/rerun-workflow <workflow name> - rerun a specific failed workflow. Only one workflow name can be specified. Multiple /rerun-workflow commands are allowed per comment.

Summary by CodeRabbit

New Features
- Services can be labeled to expose inference endpoints via networking visibility.
Bug Fixes
- CA certificate presence check added to avoid startup failures.
- Improved TLS secret selection with safer fallback handling.
Configuration Changes
- Default deployment mode switched from Serverless to RawDeployment.
- KServe management state adjusted for test overlays.
Tests
- E2E tests adapted for RawDeployment: many Knative/Canary tests skipped; test manifests updated to include networking-visibility labels.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-24T20:17:38Z

Walkthrough

Changed default InferenceService deployment mode to RawDeployment, added networking-visibility constants and labels across tests, switched some tests from Knative service reads to Kubernetes Deployment reads, introduced env-driven InferenceService endpoint selection, and updated OpenShift CI gating and cert checks.

Changes

Cohort / File(s)	Summary
Config overlays `config/overlays/odh/inferenceservice-config-patch.yaml`, `config/overlays/test/dsc.yaml`	Default deployment mode changed from `Serverless` to `RawDeployment`; test overlay adjusts `spec.kserve` managementState and adds/updates nim block.
KServe constants `python/kserve/kserve/constants/constants.py`	Added `KSERVE_LABEL_NETWORKING_VISIBILITY` and `KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED` constants.
Tests — metadata labels `test/e2e/...` (e.g. `test/e2e/batcher/test_batcher.py`, `test/e2e/predictor/test_`, `test/e2e/storagespec/`, `test/e2e/logger/test_logger.py`, `test/e2e/*/test_mlflow.py`)	Added `metadata.labels` with `KSERVE_LABEL_NETWORKING_VISIBILITY: KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED` to many `V1ObjectMeta` InferenceService creations; minor formatting adjustments (name/namespace on separate lines).
Tests — skip / Knative-related `test/e2e/predictor/test_autoscaling.py`, `test/e2e/predictor/test_canary.py`, `test/e2e/storagespec/test_s3_tls_storagespec.py`	Added/updated pytest skip markers for tests that depend on Knative/Serverless, with messages reflecting RawDeployment incompatibility.
Probe / resource retrieval changes `test/e2e/predictor/test_multi_container_probing.py`	Replaced Knative `CustomObjectsApi` usage with `AppsV1Api` deployment retrieval; renamed `get_ksvc`→`get_deployment`; updated readiness wait (longer timeout) and probe/container field access and log messages.
Client & endpoint utils `test/e2e/common/utils.py`	Removed gRPC target logging and ssl_target_name_override channel args; added `CI_USE_ISVC_HOST` env-driven endpoint selection and network-layer (istio/istio-ingress) routing logic and logging.
Triton test updates `test/e2e/predictor/test_triton.py`	Switched service lookup to label-based selection, added transformer image/env checks, explicit transformer port config, and applied networking-visibility labels; adjusted skip message.
OpenShift CI scripts `test/scripts/openshift-ci/setup-e2e-tests.sh`, `test/scripts/openshift-ci/deploy.serverless.sh`, `test/scripts/openshift-ci/run-e2e-tests.sh`	Added `MARKERS_SKIP_SERVERLESS` and `skip_serverless()` helper; replaced raw/graph gating with helper; conditional secret selection for ingress certs; CA cert pre-check and exports; added `--force-conflicts` where applicable and adjusted readiness messaging.
Miscellaneous tests & scripts `test/scripts/openshift-ci/`, various `test/e2e/` files	Small formatting tweaks, added labels/skip decorators, minor control-flow/validation changes (blank lines, logging tweaks, small error checks).

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Suite
    participant Utils as test/e2e/common/utils.py
    participant Env as Environment
    participant K8s as Kubernetes API

    Note over Test,Utils: Resolve InferenceService endpoint
    Test->>Utils: get_isvc_endpoint(isvc_name, network_layer)
    Utils->>Env: read CI_USE_ISVC_HOST
    alt CI_USE_ISVC_HOST == "1"
        Utils->>Env: read external host env vars
        Utils-->>Test: return external host endpoint
    else network_layer == "istio" or "istio-ingress"
        Utils->>K8s: get_cluster_ip()
        K8s-->>Utils: cluster IP
        Utils-->>Test: return internal cluster IP endpoint
    else
        Utils->>Utils: fallback logic (existing)
        Utils-->>Test: return fallback endpoint
    end

sequenceDiagram
    participant Test as Test Suite
    participant K8s as Kubernetes API

    Note over Test,K8s: Deployment readiness (RawDeployment)
    Test->>K8s: read_namespaced_deployment(service_name, namespace)
    K8s-->>Test: V1Deployment
    Test->>Test: wait up to 60s for status.ready_replicas > 0
    alt ready
        Test->>Test: extract containers & probes (container.liveness_probe, container.readiness_probe)
    else timeout
        Test->>Test: log "Timeout waiting for deployment to be ready"
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Attention areas:
- test/e2e/predictor/test_multi_container_probing.py — API client switch, type hints, readiness logic, probe access.
- test/e2e/common/utils.py — endpoint selection branches and removal of ssl_target_name_override; verify env handling and logging.
- test/scripts/openshift-ci/setup-e2e-tests.sh & deploy.serverless.sh — gating logic, secret lookup, and oc apply flags.
- Broad review: ensure all added label usages reference the new constants and necessary imports are present across updated tests.

Poem

🐰 I hopped from Serverless clouds down to plain,
Labeled each doorway so traffic finds the lane.
Deployments wake steady, probes blink awake,
CI checks its certs and gates the test lake.
A cheerful rabbit nods — RawDeployment, great!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 4.26% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title references investigating and refactoring the KServe E2E test suite, which aligns with the substantial changes to test files throughout the changeset. However, it is incomplete (ends with 'for…') and lacks clarity on the specific goal of adapting tests for RawDeployment mode.	Complete the title and explicitly state the primary objective, e.g., '[RHOAIENG-36673] - Refactor KServe E2E test suite for RawDeployment mode' to clearly convey the main change.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (11)

test/scripts/openshift-ci/run-e2e-tests.sh (1)
63-67: Strengthen CA bundle check to ensure non-empty file

Using -f only verifies that /tmp/ca.crt exists; it can still be empty if extraction failed earlier. Using -s makes the guard more robust.
-# Use certify go module to get the CA certs
-if [ ! -f "/tmp/ca.crt" ]; then
+# Use certify go module to get the CA certs
+if [ ! -s "/tmp/ca.crt" ]; then
   echo "❌ Error: CA certificate file '/tmp/ca.crt' not found. Please ensure the setup script ran successfully."
   exit 1
 fi
test/e2e/predictor/test_lightgbm.py (1)

57-61: Networking visibility labels for LightGBM services look correct

Adding constants.KSERVE_LABEL_NETWORKING_VISIBILITY: constants.KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED to all LightGBM InferenceServices aligns them with the RawDeployment exposure model and keeps behavior consistent across tests.

If you touch this again, you could consider a small helper for the common V1ObjectMeta(name, namespace, labels=...) pattern to reduce duplication across predictor tests, but it’s not required for this PR.

Also applies to: 100-104, 159-163, 216-220

config/overlays/odh/inferenceservice-config-patch.yaml (1)

80-80: Default deployment mode switch to RawDeployment matches the new test strategy

Updating "defaultDeploymentMode" to "RawDeployment" in the ODH overlay is coherent with the rest of this PR (tests and setup moving away from Knative). Just be aware this affects any consumers of this overlay beyond CI; if they still need serverless defaults, they’ll now have to override explicitly.

test/e2e/predictor/test_canary.py (1)

36-36: Consider conditional skip instead of permanently disabling canary tests

Unconditionally skipping both canary tests will disable them even in environments that do have Knative Serving. If these tests are still valuable upstream, consider a skipif based on deployment mode or a dedicated pytest marker so they can still run in serverless setups.

Please confirm whether these tests are intended to be disabled globally or only for RawDeployment; if the latter, wiring them to your existing serverless/raw gating would be safer.

Also applies to: 101-101

test/e2e/predictor/test_sklearn.py (2)

325-325: Use of protocol_version="grpc-v2" matches the GRPC v2 enum value

Explicitly setting protocol_version="grpc-v2" on the GRPC sklearn runtime looks correct and aligns with PredictorProtocol.GRPC_V2. As a minor improvement, you could consider referencing constants.PredictorProtocol.GRPC_V2.value to avoid string drift.

Please verify against your deployed runtime/docs that "grpc-v2" is still the expected protocol identifier.

446-447: Optional: align mixed GRPC test metadata with new visibility labeling

test_sklearn_v2_mixed_grpc remains skipped but its InferenceService metadata still lacks the new networking visibility label, unlike the other sklearn tests. For future readiness and consistency, you may want to add the same label here as well.

If you plan to re-enable this GRPC test later, consider updating its metadata now to avoid RawDeployment routing issues when that happens.

test/e2e/common/utils.py (1)

311-321: CI_USE_ISVC_HOST override is straightforward and preserves existing behavior

Using CI_USE_ISVC_HOST == "1" to short-circuit to the external host while keeping the network_layer-based paths unchanged for istio/gateway modes is clear and low risk. The additional logging around env, host, and chosen cluster_ip should help a lot when debugging CI routing issues.
If you foresee non-"1" truthy values (e.g. "true"), you might later generalize the check, but it’s not required now.

test/e2e/predictor/test_triton.py (2)

71-76: Label-based service discovery for debugging is a good improvement

Switching the failure diagnostics to list_namespaced_service with serving.kserve.io/inferenceservice=<service_name> and iterating services is more robust than relying on a single hard-coded Service name, especially across modes.
You might eventually replace print(svc) with logger calls for consistency, but the current change is already a net improvement.

Also applies to: 161-167

98-99: Transformer test image/env wiring is clearer; consider using pytest semantics for missing env

The updated skip reason, explicit IMAGE_TRANSFORMER_IMG_TAG check, and wiring transformer_image into the container (with a named 8080/http1 port) make the transformer test configuration much clearer and CI‑friendly.
As a minor refinement, when this test is re‑enabled you might convert the missing env case to pytest.skip(...) instead of raising ValueError, so it’s reported as an environment precondition rather than a test failure.

Also applies to: 116-122, 127-130
test/e2e/predictor/test_multi_container_probing.py (2)
133-135: RawDeployment readiness and probe verification logic looks solid

Adding the networking visibility label to the InferenceService metadata aligns this test with the rest of the suite.

Using AppsV1Api + TimeoutSampler to wait until deployment.status.ready_replicas > 0 before reading the final Deployment and inspecting spec.template.spec.containers is a good adaptation for RawDeployment.

Selecting kserve-container and kserve-agent by container.name and asserting non‑None liveness/readiness probes on both containers matches the intent of the test.

You could reuse the deployment from the sampler instead of doing a second get_deployment call, but that’s a micro‑optimization and not necessary.

Also applies to: 145-176

179-179: Prefer logger.exception here to capture the stack trace

In the timeout handler, switching from logger.error("Timeout waiting for deployment to be ready") to logger.exception(...) would automatically include the traceback for the TimeoutExpiredError, which can be very helpful when debugging CI flakiness, especially since you immediately re‑raise.
-    except TimeoutExpiredError as e:
-        logger.error("Timeout waiting for deployment to be ready")
+    except TimeoutExpiredError as e:
+        logger.exception("Timeout waiting for deployment to be ready")
         raise e

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 788bfca and 6c67f00.

📒 Files selected for processing (21)

config/overlays/odh/inferenceservice-config-patch.yaml (1 hunks)
config/overlays/test/dsc.yaml (1 hunks)
python/kserve/kserve/constants/constants.py (1 hunks)
test/e2e/batcher/test_batcher.py (1 hunks)
test/e2e/batcher/test_batcher_custom_port.py (1 hunks)
test/e2e/common/utils.py (2 hunks)
test/e2e/logger/test_logger.py (3 hunks)
test/e2e/predictor/test_autoscaling.py (5 hunks)
test/e2e/predictor/test_canary.py (4 hunks)
test/e2e/predictor/test_lightgbm.py (4 hunks)
test/e2e/predictor/test_multi_container_probing.py (3 hunks)
test/e2e/predictor/test_paddle.py (3 hunks)
test/e2e/predictor/test_pmml.py (3 hunks)
test/e2e/predictor/test_sklearn.py (8 hunks)
test/e2e/predictor/test_tensorflow.py (2 hunks)
test/e2e/predictor/test_triton.py (7 hunks)
test/e2e/predictor/test_xgboost.py (7 hunks)
test/e2e/storagespec/test_s3_storagespec.py (1 hunks)
test/scripts/openshift-ci/deploy.serverless.sh (1 hunks)
test/scripts/openshift-ci/run-e2e-tests.sh (1 hunks)
test/scripts/openshift-ci/setup-e2e-tests.sh (7 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

test/e2e/predictor/test_paddle.py (1)

python/kserve/kserve/models/v1beta1_inference_service.py (2)

metadata (136-143)

metadata (146-154)

test/e2e/common/utils.py (1)

test/e2e/conftest.py (1)

network_layer (80-81)

test/e2e/predictor/test_triton.py (1)

python/kserve/kserve/api/kserve_client.py (1)

get (166-223)

🪛 Ruff (0.14.5)

test/e2e/predictor/test_multi_container_probing.py

179-179: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: Test
GitHub Check: test
GitHub Check: test
GitHub Check: build (3.9)
GitHub Check: build (3.10)
GitHub Check: test
GitHub Check: build (3.12)
GitHub Check: build (3.11)
GitHub Check: test
GitHub Check: test
GitHub Check: precommit-check
GitHub Check: test

🔇 Additional comments (18)

config/overlays/test/dsc.yaml (1)

22-23: Adding NIM component with managementState: Removed aligns with test simplification.

Disabling the NIM component is consistent with the PR objective to focus e2e tests on RawDeployment mode only, reducing deployment complexity.

test/scripts/openshift-ci/deploy.serverless.sh (1)

175-181: Ingress TLS secret selection logic is more robust

Preferring default-ingress-cert when available and cleanly falling back to router-certs-default makes this script safer across different OpenShift configurations. The control flow under set -eu looks correct.

test/scripts/openshift-ci/setup-e2e-tests.sh (1)

28-40: Serverless gating + apply semantics are cleaner and align with RawDeployment focus

MARKERS_SKIP_SERVERLESS plus skip_serverless() centralize the “raw/graph/predictor don’t need serverless” logic and remove the scattered regex checks. The regex match preserves the old substring semantics while making future extensions trivial.

Using skip_serverless consistently for:

deciding whether to install OSSM/Serverless,

choosing raw vs serverless inferenceservice-config patches,

installing authorino/gateways,

wiring the test namespace into ServiceMesh,

and adding the Knative passthrough annotation
keeps the flow much easier to reason about.

Switching the main KServe install to oc apply --server-side=true --force-conflicts -f - and adding the explicit odh-model-controller readiness log/oc wait improve robustness for CI runs where resources may already exist or be reconciling.

This all looks consistent with the move to RawDeployment-by-default while still keeping serverless paths available where explicitly desired.

Also applies to: 110-116, 129-129, 136-146, 152-158, 164-165, 227-239, 251-253

test/e2e/predictor/test_paddle.py (1)

55-61: Paddle InferenceServices correctly use networking visibility labels

The added labels blocks using constants.KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED on the three Paddle services are consistent with the rest of the suite and should ensure they remain reachable under RawDeployment.

Also applies to: 109-115, 164-170

test/e2e/storagespec/test_s3_storagespec.py (1)

56-60: S3 storagespec test now aligned with visibility labeling

Adding constants.KSERVE_LABEL_NETWORKING_VISIBILITY: constants.KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED to this InferenceService metadata keeps its exposure model consistent with other RawDeployment tests.

test/e2e/predictor/test_autoscaling.py (1)

53-54: Knative autoscaling tests are clearly marked as skipped

Marking the Knative-specific autoscaling tests as skipped with explicit reasons clarifies that coverage has intentionally shifted away from Knative, while keeping the test code around for potential future reuse. The small metadata formatting tweaks are neutral.

Also applies to: 72-73, 99-100, 119-120, 144-144

test/e2e/predictor/test_tensorflow.py (1)

51-55: Networking visibility labels look consistent with the new constants

Both TensorFlow InferenceServices correctly apply KSERVE_LABEL_NETWORKING_VISIBILITY: KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED and should behave consistently with other updated tests.

Please confirm the label is honored by the RawDeployment routing setup in your cluster when you run the e2e suite.

Also applies to: 97-101

test/e2e/logger/test_logger.py (2)

56-62: Exposed visibility labels on logger and sink services are consistent

Both the logger ISVC and the message-dumper ISVC now carry the exposed networking visibility label, which aligns with the rest of the RawDeployment-focused changes.

Please verify during e2e runs that the labeled services are reachable as expected under your RawDeployment networking configuration.

Also applies to: 89-93

74-74: Confirm -predictor service DNS name in RawDeployment

Pointing the logger to http://{msg_dumper}-predictor.<ns>.svc.cluster.local makes sense if RawDeployment exposes a *-predictor Service; just ensure this matches the actual Service name generated by the controller in your environment.

If possible, confirm via kubectl get svc -n <ns> that the sink service is indeed named {msg_dumper}-predictor.

test/e2e/batcher/test_batcher_custom_port.py (1)

59-63: Batcher custom-port service correctly marked as exposed

The added networking visibility label on this custom-port batcher service is consistent with other tests and should help RawDeployment routing.

Please confirm the test passes in a RawDeployment cluster and that the service is reachable on the custom port through your configured ingress path.

test/e2e/predictor/test_canary.py (1)

56-58: Metadata formatting change is benign

Splitting name and namespace over separate lines in V1ObjectMeta is purely stylistic and keeps semantics unchanged.

No action needed beyond ensuring the tests remain discoverable under your updated test selection logic.

Also applies to: 124-126

test/e2e/predictor/test_xgboost.py (1)

56-60: XGBoost tests consistently apply exposed networking label

Each XGBoost InferenceService now uses the shared KSERVE_LABEL_NETWORKING_VISIBILITY/KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED constants, which keeps the tests aligned with RawDeployment exposure requirements.

Please validate that all these XGBoost tests can reach their services successfully in a RawDeployment cluster with your current ingress/gateway setup.

Also applies to: 105-109, 160-164, 208-212, 260-264, 309-313, 362-366

python/kserve/kserve/constants/constants.py (1)

40-42: New networking visibility constants are straightforward; just confirm they match controller usage

Defining KSERVE_LABEL_NETWORKING_VISIBILITY and KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED here centralizes the label semantics nicely and lets tests avoid hard-coded strings.

Please double-check that the KServe/ODH controllers and manifests are also using networking.kserve.io/visibility=exposed so tests and controllers stay in sync.

test/e2e/predictor/test_sklearn.py (1)

62-66: Sklearn InferenceServices correctly tagged with networking visibility

All updated sklearn tests now label their InferenceServices as exposed via the new constants, matching the RawDeployment pattern across the suite.

Please confirm end-to-end that these services are reachable in your RawDeployment environment under the paths your test utilities expect.

Also applies to: 108-112, 157-161, 228-232, 277-281, 344-348, 395-399

test/e2e/batcher/test_batcher.py (1)

55-59: Batcher service exposure label matches the common pattern

Marking the sklearn batcher InferenceService as exposed via the shared label constants is consistent with the rest of the suite.

Please ensure this test passes under RawDeployment and the batch endpoint is accessible using your configured ingress.

test/e2e/predictor/test_pmml.py (1)

54-58: Networking visibility labels look consistent across PMML tests

Adding KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED to the three InferenceServices is consistent with the RawDeployment routing changes and keeps the tests aligned with the rest of the suite. No issues from my side here.

Also applies to: 107-111, 157-161

test/e2e/predictor/test_triton.py (1)

53-57: InferenceService labels align with the new networking visibility convention

Both Triton InferenceServices now include the KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED label, which matches the pattern used elsewhere in the suite and should keep routing consistent under RawDeployment. No further changes needed here.

Also applies to: 142-147

test/e2e/predictor/test_multi_container_probing.py (1)

49-54: get_deployment helper cleanly encapsulates the RawDeployment naming convention

Centralizing the service_name + "-predictor" lookup (scoped to KSERVE_TEST_NAMESPACE) in get_deployment keeps the main test body readable and makes it easy to adjust if the naming scheme ever changes. No issues here.

config/overlays/test/dsc.yaml

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/scripts/openshift-ci/setup-e2e-tests.sh (1)
90-90: Use centralized helper for consistency.

Line 90 still uses inline regex check if [[ "$1" =~ raw ]]; then instead of the new skip_serverless() helper. This inconsistency could lead to maintenance issues if deployment type patterns need to change later. Refactor to use the centralized function for consistency.

Apply this diff to use the helper:
-# Add CA certificate extraction for raw deployments
-if [[ "$1" =~ raw ]]; then
+# Add CA certificate extraction for raw deployments
+if skip_serverless "$1"; then

🧹 Nitpick comments (5)

test/scripts/openshift-ci/setup-e2e-tests.sh (1)

28-40: Solid refactor: Centralized serverless-skipping logic.

The new skip_serverless() helper and MARKERS_SKIP_SERVERLESS array reduce duplication and make future deployment type changes easier to maintain. Note: Line 35 uses regex pattern matching (=~), which works for simple strings but could be fragile if patterns contain regex metacharacters. Consider using exact string matching (==) instead if the intent is to match literal deployment type names.

test/e2e/predictor/test_lightgbm.py (1)

57-62: Networking visibility label looks correct; consider deduplicating pattern.

Using constants.KSERVE_LABEL_NETWORKING_VISIBILITY with KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED in metadata.labels is a good, explicit way to mark this InferenceService as externally exposed, and it aligns with the rest of the PR.

Since the same label dict is repeated in other tests in this file, you could optionally factor it into a small helper (or a shared DEFAULT_LABELS_EXPOSED constant) to avoid drift if the label key/value changes again. Not blocking.

test/e2e/predictor/test_canary.py (1)

36-36: Canary tests permanently skipped in RawDeployment suite

Marking both canary tests as skipped with a clear reason is consistent with RawDeployment mode lacking Knative-based traffic splitting and avoids noisy failures in this job. Longer term, you may want to gate these via a marker or config (e.g., @pytest.mark.serverless_only plus environment selection) instead of a hard skip so the same tests can still be exercised in a Knative/Serverless CI lane, but that’s an optional follow-up outside this PR’s scope.

Also applies to: 101-101

test/e2e/predictor/test_triton.py (1)

116-122: Environment variable guard for transformer image is clear, but consider skip vs failure

Requiring IMAGE_TRANSFORMER_IMG_TAG and raising ValueError gives a loud signal when CI is misconfigured. If you expect some environments to run the suite without this var, you might instead pytest.skip with this message; otherwise this is fine for enforcing CI configuration.
test/e2e/predictor/test_multi_container_probing.py (1)
177-180: Prefer logger.exception and consider always cleaning up the InferenceService

Two small improvements here:

To capture the traceback when the TimeoutExpiredError is raised, prefer logger.exception over logger.error in the except block (this also satisfies the Ruff TRY400 hint):
-    except TimeoutExpiredError as e:
-        logger.error("Timeout waiting for deployment to be ready")
-        raise e
+    except TimeoutExpiredError as e:
+        logger.exception("Timeout waiting for deployment to be ready")
+        raise e
Optional: if you want to ensure the InferenceService is deleted even when the deployment never becomes ready, you can move the kserve_client.delete call into a finally block:
-    try:
-        for deployment in TimeoutSampler(
-            wait_timeout=60,
-            sleep=2,
-            func=lambda: get_deployment(k8s_client, service_name),
-        ):
-            # Wait for Deployment to be ready
-            if deployment.status.ready_replicas and deployment.status.ready_replicas > 0:
-                break
-
-        # Get latest deployment state after ready condition is met
-        ready_deployment = get_deployment(k8s_client, service_name)
-        containers = ready_deployment.spec.template.spec.containers
-        ...
-
-        kserve_client.delete(service_name, KSERVE_TEST_NAMESPACE)
-    except TimeoutExpiredError as e:
-        logger.error("Timeout waiting for deployment to be ready")
-        raise e
+    try:
+        for deployment in TimeoutSampler(
+            wait_timeout=60,
+            sleep=2,
+            func=lambda: get_deployment(k8s_client, service_name),
+        ):
+            # Wait for Deployment to be ready
+            if deployment.status.ready_replicas and deployment.status.ready_replicas > 0:
+                break
+
+        # Get latest deployment state after ready condition is met
+        ready_deployment = get_deployment(k8s_client, service_name)
+        containers = ready_deployment.spec.template.spec.containers
+        ...
+    except TimeoutExpiredError as e:
+        logger.exception("Timeout waiting for deployment to be ready")
+        raise e
+    finally:
+        kserve_client.delete(service_name, KSERVE_TEST_NAMESPACE)
(Only apply the finally change if you’re comfortable always cleaning up the resource on failure; if you rely on leaked resources for post‑mortem debugging, you may want to keep the current behavior.)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6c67f00 and 9412a0e.

📒 Files selected for processing (22)

config/overlays/odh/inferenceservice-config-patch.yaml (1 hunks)
config/overlays/test/dsc.yaml (1 hunks)
python/kserve/kserve/constants/constants.py (1 hunks)
test/e2e/batcher/test_batcher.py (1 hunks)
test/e2e/batcher/test_batcher_custom_port.py (1 hunks)
test/e2e/common/utils.py (2 hunks)
test/e2e/logger/test_logger.py (3 hunks)
test/e2e/predictor/test_autoscaling.py (5 hunks)
test/e2e/predictor/test_canary.py (4 hunks)
test/e2e/predictor/test_lightgbm.py (4 hunks)
test/e2e/predictor/test_multi_container_probing.py (3 hunks)
test/e2e/predictor/test_paddle.py (3 hunks)
test/e2e/predictor/test_pmml.py (3 hunks)
test/e2e/predictor/test_sklearn.py (8 hunks)
test/e2e/predictor/test_tensorflow.py (2 hunks)
test/e2e/predictor/test_triton.py (7 hunks)
test/e2e/predictor/test_xgboost.py (7 hunks)
test/e2e/storagespec/test_s3_storagespec.py (1 hunks)
test/e2e/storagespec/test_s3_tls_storagespec.py (3 hunks)
test/scripts/openshift-ci/deploy.serverless.sh (1 hunks)
test/scripts/openshift-ci/run-e2e-tests.sh (1 hunks)
test/scripts/openshift-ci/setup-e2e-tests.sh (7 hunks)

🚧 Files skipped from review as they are similar to previous changes (13)

test/scripts/openshift-ci/deploy.serverless.sh
test/e2e/predictor/test_xgboost.py
test/e2e/logger/test_logger.py
config/overlays/odh/inferenceservice-config-patch.yaml
test/e2e/predictor/test_pmml.py
test/e2e/common/utils.py
config/overlays/test/dsc.yaml
test/scripts/openshift-ci/run-e2e-tests.sh
python/kserve/kserve/constants/constants.py
test/e2e/storagespec/test_s3_storagespec.py
test/e2e/predictor/test_autoscaling.py
test/e2e/predictor/test_tensorflow.py
test/e2e/predictor/test_paddle.py

🧰 Additional context used

🧬 Code graph analysis (1)

test/e2e/predictor/test_sklearn.py (2)

python/kserve/kserve/protocol/infer_type.py (3)

name (147-153)

name (420-427)

name (948-954)

python/kserve/kserve/models/v1beta1_explainer_spec.py (2)

labels (857-865)

labels (868-877)

🪛 Ruff (0.14.5)

test/e2e/predictor/test_multi_container_probing.py

179-179: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: test
GitHub Check: Test
GitHub Check: test
GitHub Check: test
GitHub Check: precommit-check
GitHub Check: test
GitHub Check: test
GitHub Check: build (3.12)
GitHub Check: test
GitHub Check: build (3.11)
GitHub Check: build (3.10)
GitHub Check: build (3.9)

🔇 Additional comments (31)

test/scripts/openshift-ci/setup-e2e-tests.sh (4)

89-108: Good: Defensive CA certificate extraction with validation.

The CA bundle extraction includes proper validation (lines 103-107) and continues gracefully if extraction fails. This supports raw deployments without blocking the setup. The use of || true combined with downstream validation is a solid pattern here.

129-129: Appropriate: Server-side apply with conflict handling.

The addition of --server-side=true --force-conflicts flags to oc apply is appropriate for managing conflicts and reducing server load in CI environments.

111-111: Consistent use of new helper across script.

All conditional blocks that check for serverless deployment modes now consistently use the centralized skip_serverless() function, improving maintainability.

Also applies to: 136-136, 152-152, 227-227, 251-251

164-164: Improved observability: Added informative log message.

The new echo statement provides users with context on what the script is waiting for and an expected timeout, improving the debugging experience for CI runs.

test/e2e/predictor/test_lightgbm.py (3)

100-105: Consistent use of networking visibility label for runtime-based LightGBM test.

Same pattern as above; label addition is correct and consistent with the RawDeployment-focused changes.

159-164: Label addition for v2 MLServer runtime is consistent and correct.

metadata.labels mirrors the earlier tests and should ensure this v2 MLServer-based service is exposed in the same way.

216-221: Label addition for v2 KServe LightGBM test is correct.

This keeps networking visibility behavior aligned across all HTTP-based LightGBM tests in this file.

test/e2e/predictor/test_canary.py (1)

56-58: Metadata formatting change is non-functional

The client.V1ObjectMeta initializations still pass the same name and namespace keyword arguments; splitting them across lines is purely stylistic and has no behavioral impact.

Also applies to: 124-126

test/e2e/storagespec/test_s3_tls_storagespec.py (3)

115-121: Networking visibility label for global-custom-cert positive case looks correct

Adding labels on V1ObjectMeta with KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED is consistent with exposing this InferenceService in RawDeployment and should not affect existing behavior.

196-202: Label addition for custom-cert positive/negative test pair is consistent

Same networking-visibility label pattern as other tests; matches the RawDeployment exposure model and keeps reuse of the isvc object safe.

261-267: Serving-cert storagespec test correctly adopts networking visibility label

Metadata update to include the visibility label mirrors the other TLS tests and should help keep routing behavior consistent in RawDeployment.

test/e2e/batcher/test_batcher.py (1)

55-59: Batcher InferenceService now correctly labeled for networking visibility

Applying KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED at metadata level aligns this test with the RawDeployment model and the rest of the suite.

test/e2e/batcher/test_batcher_custom_port.py (1)

59-63: Custom-port batcher test adopts networking visibility label correctly

Labeling this InferenceService as exposed keeps behavior consistent with other RawDeployment tests; no issues with the custom port configuration.

test/e2e/predictor/test_sklearn.py (8)

62-66: sklearn HTTP test metadata labeling is consistent and safe

Adding the networking visibility label here matches the broader test suite changes and should help expose the service correctly in RawDeployment.

108-112: v2 HTTP (mlserver) test picks up the correct visibility label

The metadata label for networking visibility is applied in the same way as other tests; no issues spotted.

157-161: Runtime-based sklearn test correctly labeled for exposure

The label addition on test_sklearn_runtime_kserve keeps runtime-backed services aligned with the RawDeployment exposure pattern.

228-232: v2 runtime (mlserver) HTTP test uses the shared visibility label

Consistent use of KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED on metadata here is appropriate and matches the other v2 tests.

277-281: v2 sklearn HTTP test adopts networking visibility label as expected

Labeling this InferenceService as exposed is in line with the rest of the suite and should help keep routing behavior uniform.

325-325: Use of protocol_version="grpc-v2" in ModelSpec

Assuming the deployed sklearn server supports the "grpc-v2" protocol identifier, this is the right place to declare it and should make the gRPC v2 test configuration clearer.

Please double-check against the KServe / runtime docs that "grpc-v2" is the expected protocol_version string for this model server.

344-348: gRPC v2 sklearn test metadata labeled for networking visibility

Even though this test is currently skipped, adding the label now ensures it will behave consistently once re-enabled.

394-398: Mixed-type v2 sklearn HTTP test correctly adds networking visibility label

This brings the mixed-type test in line with the other sklearn predictors from a routing/visibility standpoint.

test/e2e/predictor/test_triton.py (6)

53-57: Base Triton test metadata now labeled for networking visibility

Applying the visibility label here is consistent with other predictors and should help RawDeployment routing without side effects.

71-76: Switch to listing Services by label improves failure diagnostics in RawDeployment

Using list_namespaced_service with the serving.kserve.io/inferenceservice label is a good fit for RawDeployment and avoids assuming a specific Service name; the loop printing each Service is appropriate for debugging.

98-98: Skip reason updated with concrete reactivation condition

Tying the skip reason to a specific PR makes it clearer when this test should be revisited and re-enabled.

127-130: Transformer container now parameterized by env-provided image and explicit HTTP port

Using transformer_image from the environment and adding an explicit 8080 HTTP port (name="http1") makes the transformer configuration more flexible and clearer; no issues spotted.

142-146: Transformer Triton test metadata labeled for networking visibility

The added label matches the pattern used elsewhere and should ensure this composite InferenceService is discoverable in RawDeployment.

161-167: Failure logging updated to list Services by label for transformer test

Mirroring the base Triton test, listing Services via the inferenceservice label improves diagnostics without depending on Knative-specific resources.

test/e2e/predictor/test_multi_container_probing.py (4)

49-54: Deployment helper matches RawDeployment usage

The get_deployment helper correctly targets the predictor Deployment via AppsV1Api.read_namespaced_deployment in KSERVE_TEST_NAMESPACE, which is consistent with the RawDeployment test setup.

133-135: Networking visibility label aligns with exposed RawDeployment behavior

Adding constants.KSERVE_LABEL_NETWORKING_VISIBILITY: constants.KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED on the InferenceService metadata is appropriate for ensuring the service is exposed in RawDeployment mode and matches the broader PR direction.

145-160: Deployment readiness wait via ready_replicas is reasonable

Using TimeoutSampler with a 60s timeout and checking deployment.status.ready_replicas > 0 is a sensible adaptation from Knative readiness to Deployment readiness for this single‑replica test. Re‑fetching the Deployment into ready_deployment before inspecting containers is also a clean pattern.

162-166: Container selection and probe assertions look correct

Selecting kserve-container and kserve-agent from ready_deployment.spec.template.spec.containers and asserting non‑None liveness_probe and readiness_probe on both containers directly validates that the probes are attached as intended.

Also applies to: 168-175

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

config/overlays/test/dsc.yaml (1)
20-20: Configuration inconsistency persists: verify RawDeployment mode is set.

The past review flagged that defaultDeploymentMode: Serverless (line 20) is incompatible with serving.managementState: Removed (line 28), as Serverless mode requires Knative Serving to be enabled. The comment indicates this was addressed in earlier commits, but the code still shows Serverless. Verify that the correct configuration is applied or that an overlay/patch properly sets defaultDeploymentMode to RawDeployment.

Run this script to verify the effective configuration:
#!/bin/bash
# Check if RawDeployment is set in any overlay or patch files
echo "=== Searching for defaultDeploymentMode settings ==="
rg -n "defaultDeploymentMode" config/overlays/test/ -A 1 -B 1

echo -e "\n=== Checking for kserve config patches ==="
fd -e yaml -e yml . config/overlays/test/ --exec cat {} \; --exec echo "---" \;

🧹 Nitpick comments (4)

test/e2e/predictor/test_multi_container_probing.py (1)
179-179: Optional: use logger.exception for exception logging.

Static analysis suggests using logger.exception instead of logger.error within exception handlers, as it automatically includes traceback information.

Apply this diff:
-        logger.error("Timeout waiting for deployment to be ready")
+        logger.exception("Timeout waiting for deployment to be ready")
test/e2e/common/utils.py (1)

311-321: Env-based CI_USE_ISVC_HOST override and logging look reasonable.

The new logic cleanly lets CI force use of the external host from status.url, with clear logging of the env var, host, and network layer, and sensible fallbacks for istio / istio-ingress. For parity, you might optionally add similar “Using … cluster IP” logs for the envoy-gatewayapi and istio-gatewayapi branches or document that CI_USE_ISVC_HOST takes precedence over network_layer in all modes.

test/e2e/storagespec/test_s3_tls_storagespec.py (1)

73-76: Temporary skips for TLS S3 tests are clearly annotated.

Marking these TLS S3 tests as skipped with a direct reference to RHOAIENG-39707 is a reasonable way to unblock RawDeployment work. Just ensure that ticket is tracking re‑enabling them so TLS coverage doesn’t remain disabled indefinitely.

Also applies to: 155-158, 237-240

test/e2e/predictor/test_sklearn.py (1)

442-449: Preemptively aligning test_sklearn_v2_mixed_grpc metadata with new labeling.

Even though this gRPC mixed test remains skipped, its InferenceService metadata still lacks the networking-visibility label that other tests now set. For future re‑enablement, you might consider adding the same label here to avoid surprises later.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9412a0e and dc2074c.

📒 Files selected for processing (23)

config/overlays/odh/inferenceservice-config-patch.yaml (1 hunks)
config/overlays/test/dsc.yaml (1 hunks)
python/kserve/kserve/constants/constants.py (1 hunks)
test/e2e/batcher/test_batcher.py (1 hunks)
test/e2e/batcher/test_batcher_custom_port.py (1 hunks)
test/e2e/common/utils.py (2 hunks)
test/e2e/logger/test_logger.py (3 hunks)
test/e2e/predictor/test_autoscaling.py (5 hunks)
test/e2e/predictor/test_canary.py (4 hunks)
test/e2e/predictor/test_lightgbm.py (4 hunks)
test/e2e/predictor/test_mlflow.py (1 hunks)
test/e2e/predictor/test_multi_container_probing.py (3 hunks)
test/e2e/predictor/test_paddle.py (3 hunks)
test/e2e/predictor/test_pmml.py (3 hunks)
test/e2e/predictor/test_sklearn.py (8 hunks)
test/e2e/predictor/test_tensorflow.py (2 hunks)
test/e2e/predictor/test_triton.py (7 hunks)
test/e2e/predictor/test_xgboost.py (7 hunks)
test/e2e/storagespec/test_s3_storagespec.py (1 hunks)
test/e2e/storagespec/test_s3_tls_storagespec.py (6 hunks)
test/scripts/openshift-ci/deploy.serverless.sh (1 hunks)
test/scripts/openshift-ci/run-e2e-tests.sh (2 hunks)
test/scripts/openshift-ci/setup-e2e-tests.sh (7 hunks)

🚧 Files skipped from review as they are similar to previous changes (10)

test/scripts/openshift-ci/deploy.serverless.sh
test/scripts/openshift-ci/setup-e2e-tests.sh
test/e2e/predictor/test_autoscaling.py
test/e2e/predictor/test_pmml.py
test/e2e/batcher/test_batcher.py
test/scripts/openshift-ci/run-e2e-tests.sh
test/e2e/predictor/test_lightgbm.py
test/e2e/predictor/test_tensorflow.py
config/overlays/odh/inferenceservice-config-patch.yaml
test/e2e/predictor/test_canary.py

🧰 Additional context used

🧬 Code graph analysis (3)

test/e2e/predictor/test_paddle.py (1)

python/kserve/kserve/models/v1beta1_inference_service.py (2)

metadata (136-143)

metadata (146-154)

test/e2e/common/utils.py (1)

test/e2e/conftest.py (1)

network_layer (80-81)

test/e2e/predictor/test_triton.py (1)

python/kserve/kserve/api/kserve_client.py (1)

get (166-223)

🪛 Ruff (0.14.5)

test/e2e/predictor/test_multi_container_probing.py

179-179: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: build (3.11)
GitHub Check: build (3.9)
GitHub Check: build (3.10)
GitHub Check: Test
GitHub Check: build (3.12)
GitHub Check: test
GitHub Check: test
GitHub Check: precommit-check
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test

🔇 Additional comments (21)

test/e2e/predictor/test_xgboost.py (1)

56-60: LGTM — consistent networking visibility labeling.

All InferenceService instances now include the networking visibility label, properly using the constants defined in constants.py. The formatting is consistent across all test functions.

Also applies to: 105-109, 160-164, 208-212, 260-264, 309-313, 362-366

python/kserve/kserve/constants/constants.py (1)

40-42: LGTM — well-defined networking visibility constants.

The new label constants follow KServe naming conventions and provide a clean way to manage networking visibility across the test suite.

test/e2e/predictor/test_mlflow.py (1)

64-68: LGTM — consistent with test suite pattern.

Networking visibility label properly added to InferenceService metadata.

test/e2e/predictor/test_paddle.py (2)

55-61: LGTM — networking visibility labels added consistently.

Three Paddle test scenarios properly include the networking visibility label in InferenceService metadata.

Also applies to: 109-115, 164-170

227-227: I'm experiencing technical difficulties accessing the repository to verify your concern. The repository cloning is failing, which prevents me from:

Examining the full test_paddle_v2_grpc function and its metadata configuration

Comparing with test_xgboost_v2_grpc and other skipped GRPC tests in similar files

Confirming whether KSERVE_LABEL_NETWORKING_VISIBILITY labels are present or consistently applied

Determining if the omission is intentional or an oversight

**Skipped test missing networking visibility labels. The test_paddle_v2_grpc function at line 227 doesn't include the networking visibility labels, unlike other tests in this file. Other skipped tests in the PR (e.g., test_xgboost_v2_grpc in test_xgboost.py) still have the labels added. Is this omission intentional?

test/e2e/predictor/test_triton.py (3)

53-57: LGTM — networking visibility labels properly added.

Both Triton test scenarios now include the networking visibility label in InferenceService metadata.

Also applies to: 142-146

116-121: LGTM — defensive environment variable validation.

The pre-flight check for IMAGE_TRANSFORMER_IMG_TAG provides clear, actionable error messages if the required environment variable is missing.

71-76: LGTM — error handling adapted for label-based service discovery.

Switching from direct service lookups to label-based listing aligns with the test suite's move to RawDeployment mode.

Also applies to: 161-166

test/e2e/predictor/test_multi_container_probing.py (3)

49-54: LGTM — properly refactored for RawDeployment mode.

The function correctly switches from reading Knative Services (CustomObjectsApi) to reading Kubernetes Deployments (AppsV1Api), aligning with the PR objective.

148-175: LGTM — Deployment readiness check and probe verification updated correctly.

The wait logic properly checks deployment.status.ready_replicas, and container/probe access now uses object attributes (e.g., container.name, container.liveness_probe) instead of dictionary indexing, matching the Kubernetes Python client API.

133-135: LGTM — networking visibility label added.

The InferenceService metadata now includes the networking visibility label, consistent with other tests.

test/e2e/batcher/test_batcher_custom_port.py (1)

59-63: LGTM — consistent labeling.

Networking visibility label properly added to the batcher test InferenceService metadata.

test/e2e/storagespec/test_s3_storagespec.py (1)

56-60: Networking visibility label on S3 storagespec ISVC is consistent.

Setting KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED on the InferenceService metadata aligns with the rest of the suite and should help ensure consistent routing in RawDeployment mode.

test/e2e/common/utils.py (1)

271-277: Minor whitespace tweak in predict_grpc only.

This hunk only adds spacing around the model_name check and grpc_client call; it has no behavioral impact.

test/e2e/logger/test_logger.py (2)

56-62: Networking visibility labels on logger ISVCs are aligned with the suite.

Applying KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED to both the message-dumper and logger InferenceServices matches the new pattern elsewhere and should keep routing behavior consistent under RawDeployment.

Also applies to: 89-93

72-75: Please confirm logger sink URL matches the actual RawDeployment Service name.

Switching the logger URL to http://{msg_dumper}-predictor.{namespace}.svc.cluster.local looks like the right direction for RawDeployment, but service naming sometimes includes extra suffixes (e.g., -default). It’s worth double-checking that this hostname matches the created Service so that log events are not silently dropped.

test/e2e/storagespec/test_s3_tls_storagespec.py (2)

117-121: Networking visibility labels on TLS S3 ISVCs are consistent.

Adding KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED to the TLS S3 InferenceServices brings them in line with other tests and should help keep routing behavior uniform.

Also applies to: 199-203, 265-269

283-287: Secret patch for invalid TLS data connection is correctly base64-encoded.

Using b64encode(invalid_data_connection.encode()).decode() when patching the storage-config Secret matches Kubernetes’ expectation that data values are base64 strings; this looks correct.

test/e2e/predictor/test_sklearn.py (3)

62-66: Consistent networking visibility labeling across sklearn tests.

All updated sklearn InferenceServices now set KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED in metadata.labels, which is consistent with the rest of the e2e suite and should help keep endpoint exposure behavior uniform in RawDeployment mode.

Also applies to: 108-112, 157-161, 228-232, 277-281, 344-348, 394-398

323-337: gRPC v2 protocol_version flag in test_sklearn_v2_grpc.

Setting protocol_version="grpc-v2" on V1beta1ModelSpec matches the intent of this test name and payload. Since this test is still skipped, impact is low, but please verify that "grpc-v2" is exactly the value expected by the serving runtime for gRPC v2.

371-417: Unskipping test_sklearn_v2_mixed—ensure environment is ready.

With the skip removed, this mixed‑type sklearn v2 test will now run and assert on a specific numeric output for the 1.3/mixedtype model. Make sure the target clusters have that model artifact available and that the RawDeployment networking changes don’t introduce flakiness here.

… RawDeployment mode only chore: Fix tests. Adapt the setup and run e2e tests to work with Raw Deployment Update failing tests to work with Raw Deployment mode Signed-off-by: Spolti <[email protected]>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (6)

test/e2e/storagespec/test_s3_tls_storagespec.py (4)

155-158: Duplicate: same temporary skip pattern as first S3 TLS test

This skip marker follows the same RHOAIENG-39707 pattern as the earlier test; treating it the same way (temporary infra/behavior issue) makes sense.

198-203: Duplicate: same metadata visibility/namespace pattern as first ISVC

This InferenceService metadata uses the same explicit namespace and networking-visibility label as the earlier S3 TLS test, which is consistent and appropriate for RawDeployment.

237-240: Duplicate: third S3 TLS test is also cleanly skipped with the same tracking reference

Same skip approach and rationale as the other two TLS tests; keeping them aligned under the same tracking ticket is good.

264-269: Duplicate: consistent networking-visibility labeling across all S3 TLS ISVCs

This final InferenceService’s metadata matches the visibility/namespace pattern used above, which keeps these tests uniform and easier to reason about.

test/e2e/predictor/test_triton.py (2)

141-146: Duplicate: transformer ISVC metadata uses the same namespace/visibility convention

The transformer scenario’s InferenceService metadata mirrors the namespace and visibility labeling used in the plain Triton test, which keeps behavior consistent.

160-166: Duplicate: service listing on transformer test failures follows same debug pattern

The failure-path logging for services in the transformer test matches the pattern from test_triton, giving you consistent diagnostics across both tests.

🧹 Nitpick comments (4)

test/e2e/common/utils.py (1)
311-320: Environment-driven endpoint selection looks good.

The CI_USE_ISVC_HOST environment variable provides flexibility for RawDeployment mode tests, and the debug logging will help diagnose CI issues.

Minor observation: the envoy-gatewayapi and istio-gatewayapi branches (lines 321-330) lack the corresponding logger.info(f"Using internal cluster IP: {cluster_ip}") statement. Consider adding similar logging there for consistency if those paths are exercised in your CI.
     elif network_layer == "envoy-gatewayapi":
         cluster_ip = get_cluster_ip(
             namespace="envoy-gateway-system",
             labels={"serving.kserve.io/gateway": "kserve-ingress-gateway"},
         )
+        logger.info(f"Using envoy-gatewayapi cluster IP: {cluster_ip}")
     elif network_layer == "istio-gatewayapi":
         cluster_ip = get_cluster_ip(
             namespace="kserve",
             labels={"serving.kserve.io/gateway": "kserve-ingress-gateway"},
         )
+        logger.info(f"Using istio-gatewayapi cluster IP: {cluster_ip}")
     else:
         raise ValueError(f"Unknown network layer {network_layer}")
test/scripts/openshift-ci/setup-e2e-tests.sh (1)

129-129: --force-conflicts may mask legitimate resource conflicts.

Adding --force-conflicts allows overwriting fields managed by other controllers. While this can resolve CI flakiness, it may hide configuration drift or conflicts that should be investigated.

Consider documenting why this flag is needed (e.g., in a comment) to help future maintainers understand when it's safe to remove.
test/e2e/predictor/test_multi_container_probing.py (1)
178-180: Use logging.exception instead of logging.error to capture the traceback.

Per static analysis hint, logging.exception automatically includes exception info which aids debugging.
     except TimeoutExpiredError as e:
-        logger.error("Timeout waiting for deployment to be ready")
+        logger.exception("Timeout waiting for deployment to be ready")
         raise e
test/e2e/predictor/test_triton.py (1)

70-76: Extra service listing on failure improves debuggability

Listing services by the inferenceservice label and printing them in the wait_isvc_ready failure path should make it much easier to debug routing or selector issues when this test flakes. This is a reasonable trade-off since it only runs on failure.

If the printed objects become too noisy, consider logging only selected fields (name, labels, ports) instead of the full object, but this is non-blocking.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc2074c and ddcdf3a.

📒 Files selected for processing (23)

config/overlays/odh/inferenceservice-config-patch.yaml (1 hunks)
config/overlays/test/dsc.yaml (1 hunks)
python/kserve/kserve/constants/constants.py (1 hunks)
test/e2e/batcher/test_batcher.py (1 hunks)
test/e2e/batcher/test_batcher_custom_port.py (1 hunks)
test/e2e/common/utils.py (2 hunks)
test/e2e/logger/test_logger.py (3 hunks)
test/e2e/predictor/test_autoscaling.py (5 hunks)
test/e2e/predictor/test_canary.py (4 hunks)
test/e2e/predictor/test_lightgbm.py (4 hunks)
test/e2e/predictor/test_mlflow.py (1 hunks)
test/e2e/predictor/test_multi_container_probing.py (3 hunks)
test/e2e/predictor/test_paddle.py (3 hunks)
test/e2e/predictor/test_pmml.py (3 hunks)
test/e2e/predictor/test_sklearn.py (8 hunks)
test/e2e/predictor/test_tensorflow.py (2 hunks)
test/e2e/predictor/test_triton.py (7 hunks)
test/e2e/predictor/test_xgboost.py (7 hunks)
test/e2e/storagespec/test_s3_storagespec.py (1 hunks)
test/e2e/storagespec/test_s3_tls_storagespec.py (6 hunks)
test/scripts/openshift-ci/deploy.serverless.sh (1 hunks)
test/scripts/openshift-ci/run-e2e-tests.sh (2 hunks)
test/scripts/openshift-ci/setup-e2e-tests.sh (7 hunks)

🚧 Files skipped from review as they are similar to previous changes (13)

test/e2e/predictor/test_tensorflow.py
test/e2e/logger/test_logger.py
python/kserve/kserve/constants/constants.py
test/e2e/predictor/test_mlflow.py
test/scripts/openshift-ci/deploy.serverless.sh
config/overlays/test/dsc.yaml
test/e2e/predictor/test_xgboost.py
test/e2e/batcher/test_batcher_custom_port.py
test/e2e/batcher/test_batcher.py
config/overlays/odh/inferenceservice-config-patch.yaml
test/e2e/predictor/test_sklearn.py
test/e2e/predictor/test_canary.py
test/e2e/storagespec/test_s3_storagespec.py

🧰 Additional context used

🧬 Code graph analysis (3)

test/e2e/predictor/test_paddle.py (1)

python/kserve/kserve/models/v1beta1_inference_service.py (2)

metadata (136-143)

metadata (146-154)

test/e2e/predictor/test_triton.py (1)

python/kserve/kserve/api/kserve_client.py (1)

get (166-223)

test/e2e/common/utils.py (1)

test/e2e/conftest.py (1)

network_layer (80-81)

🪛 Ruff (0.14.6)

test/e2e/predictor/test_multi_container_probing.py

179-179: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: Test
GitHub Check: precommit-check
GitHub Check: test
GitHub Check: build (3.11)
GitHub Check: build (3.9)
GitHub Check: build (3.10)
GitHub Check: test
GitHub Check: build (3.12)
GitHub Check: test

🔇 Additional comments (14)

test/e2e/common/utils.py (1)

275-275: Formatting-only change.

No behavioral impact from the blank line addition.

test/e2e/predictor/test_autoscaling.py (1)

53-53: LGTM - Skip markers appropriately disable Knative-dependent tests.

The skip markers correctly identify that these tests rely on Knative-specific autoscaling annotations (autoscaling.knative.dev/target, autoscaling.knative.dev/metric) which are not applicable in RawDeployment mode.

Also applies to: 99-99, 144-144

test/e2e/predictor/test_multi_container_probing.py (1)

49-54: LGTM - Function correctly refactored for RawDeployment mode.

The transition from CustomObjectsApi (Knative services) to AppsV1Api (Kubernetes Deployments) aligns with the RawDeployment mode requirements.

test/scripts/openshift-ci/run-e2e-tests.sh (1)

64-67: LGTM - Good defensive check for CA certificate.

The pre-check ensures the setup script completed successfully before proceeding with tests, providing a clear error message if the CA certificate is missing.

test/e2e/predictor/test_paddle.py (1)

55-61: LGTM - Networking visibility labels consistently applied.

The labels are correctly added using the constants KSERVE_LABEL_NETWORKING_VISIBILITY and KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED across all active test cases.

Also applies to: 109-115, 164-170

test/e2e/predictor/test_lightgbm.py (1)

57-61: LGTM - Networking visibility labels correctly applied to all active tests.

The labeling pattern is consistent with other test files in this PR.

Also applies to: 100-104, 159-163, 216-220

test/e2e/predictor/test_pmml.py (1)

54-58: LGTM - Networking visibility labels correctly applied.

The pattern follows the established convention from other test files in this PR.

Also applies to: 107-111, 157-161

test/scripts/openshift-ci/setup-e2e-tests.sh (1)

28-40: Based on my investigation, I need to mark this for manual verification since I cannot access the repository directly to confirm the actual test patterns and intent. However, I can provide insight based on the code logic:

The original review comment raises a valid concern that requires confirmation from maintainers.

The code logic demonstrates the issue:

The skip_serverless() function at lines 28-40 uses regex matching: if [[ "$deployment_type" =~ $type ]]

Including "predictor" in MARKERS_SKIP_SERVERLESS means ANY deployment type containing "predictor" will skip serverless installation

Since pytest markers can be combined (e.g., @pytest.mark.predictor @pytest.mark.raw), this regex will match all predictor variants

The concern is valid because:

"predictor" is a broad category that subsumes other test types

The distinction between "raw", "graph", and "predictor" suggests these are separate concerns

The regex matching approach may inadvertently skip serverless for predictor tests that should require it

However, without access to the actual test files, git history, and test execution requirements, I cannot definitively determine if this is:

Intentional (all predictor tests should skip serverless)

A mistake (some predictor tests need serverless)

Redundant (the "raw", "graph" markers already cover the necessary cases)

The original review comment should skip serverless for all predictor tests, or whether some predictor variants still need serverless mode. Consider verifying against the test specifications or confirming the intended test execution matrix.

test/e2e/storagespec/test_s3_tls_storagespec.py (2)

73-76: Skip marker for broken S3 TLS test is fine and clearly scoped

Tying the skip reason to RHOAIENG-39707 makes it clear this is temporary and why the test is disabled. No change needed here as long as the ticket tracks re-enabling the test later.

116-121: Explicit namespace + networking-visibility label align this ISVC with RawDeployment routing

Setting namespace=KSERVE_TEST_NAMESPACE and adding the KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED label on the InferenceService metadata matches the RawDeployment exposure pattern used elsewhere in the suite and should help ensure the service is reachable in the test namespace.

Please re-run this e2e test once RawDeployment wiring is in place to confirm the controller is honoring the visibility label as expected.

test/e2e/predictor/test_triton.py (4)

52-57: InferenceService metadata now correctly pins namespace and exposes networking visibility

Adding namespace=KSERVE_TEST_NAMESPACE and the KSERVE_LABEL_NETWORKING_VISIBILITY_EXPOSED label directly on the Triton InferenceService is a good fit for RawDeployment and brings this test in line with the rest of the suite.

Please confirm in CI runs that the resulting Service/Route wiring behaves as expected with this label (i.e., no regressions in how the Triton endpoint is reached).

95-99: Skip marker for transformer test is clearly tied to external controller PR

Marking the transformer test as skipped until the referenced odh-model-controller PR is merged is a clear and maintainable way to document why it isn’t running yet.

116-121: Env-var guard for transformer image makes configuration failures explicit

Checking IMAGE_TRANSFORMER_IMG_TAG up front and raising a ValueError with a clear message prevents the test from silently deploying an invalid or empty image, which is useful for catching CI misconfigurations.

If you’d rather treat a missing image tag as “test not applicable” instead of an infra failure, you could switch this to pytest.skip(error_msg); otherwise, this hard failure behavior is fine as-is.

127-130: Transformer container now correctly uses env-derived image and exposes HTTP port 8080

Wiring image=transformer_image and explicitly defining port 8080 with name http1 matches the usual RawDeployment assumptions for the user container’s HTTP endpoint and should work well with the routing layer.

Once the skip is removed, please verify in a test run that the transformer Pod comes up with the expected image and port configuration (e.g., by inspecting the created Deployment/Pod).

coderabbitai · 2025-11-27T20:45:04Z

test/scripts/openshift-ci/run-e2e-tests.sh

  export GITHUB_SHA=master

  if [ "$BUILD_KSERVE_IMAGES" = "true" ]; then
+    echo "asd"


⚠️ Potential issue | 🟡 Minor

Remove debug output echo "asd".

This appears to be leftover debug output that should be removed before merging.

if [ "$BUILD_KSERVE_IMAGES" = "true" ]; then - echo "asd" pushd $PROJECT_ROOT >/dev/null

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

echo "asd"

if [ "$BUILD_KSERVE_IMAGES" = "true" ]; then

pushd $PROJECT_ROOT >/dev/null

🤖 Prompt for AI Agents

In test/scripts/openshift-ci/run-e2e-tests.sh around line 40, remove the leftover debug output echo "asd" from the script; simply delete that line so the script contains only intended production/test commands and no stray debugging prints.

spolti · 2025-11-27T23:12:41Z

/test e2e-graph

Jooho

/lgtm

openshift-ci · 2025-11-28T14:56:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jooho, spolti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Jooho,spolti]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-project-automation bot added this to ODH Model Serving Planning Nov 24, 2025

github-project-automation bot moved this to New/Backlog in ODH Model Serving Planning Nov 24, 2025

openshift-ci bot added the approved label Nov 24, 2025

coderabbitai bot reviewed Nov 24, 2025

View reviewed changes

config/overlays/test/dsc.yaml Show resolved Hide resolved

spolti force-pushed the RHOAIENG-36673 branch from 6c67f00 to 9412a0e Compare November 25, 2025 15:09

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

spolti force-pushed the RHOAIENG-36673 branch from 9412a0e to dc2074c Compare November 25, 2025 18:40

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

[RHOAIENG-36673] - Investigate and Refactor KServe E2E Test Suite for…

ddcdf3a

… RawDeployment mode only chore: Fix tests. Adapt the setup and run e2e tests to work with Raw Deployment Update failing tests to work with Raw Deployment mode Signed-off-by: Spolti <[email protected]>

spolti force-pushed the RHOAIENG-36673 branch from dc2074c to ddcdf3a Compare November 27, 2025 20:39

coderabbitai bot reviewed Nov 27, 2025

View reviewed changes

Jooho approved these changes Nov 28, 2025

View reviewed changes

openshift-ci bot assigned Jooho Nov 28, 2025

openshift-ci bot added the lgtm label Nov 28, 2025

openshift-merge-bot bot merged commit f0eaa31 into opendatahub-io:master Nov 28, 2025
32 checks passed

github-project-automation bot moved this from New/Backlog to Done in ODH Model Serving Planning Nov 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RHOAIENG-36673] - Investigate and Refactor KServe E2E Test Suite for… #987

[RHOAIENG-36673] - Investigate and Refactor KServe E2E Test Suite for… #987

Uh oh!

spolti commented Nov 24, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 24, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 27, 2025

Uh oh!

spolti commented Nov 27, 2025

Uh oh!

Jooho left a comment

Uh oh!

openshift-ci bot commented Nov 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	echo "asd"
	if [ "$BUILD_KSERVE_IMAGES" = "true" ]; then
	pushd $PROJECT_ROOT >/dev/null

[RHOAIENG-36673] - Investigate and Refactor KServe E2E Test Suite for… #987

[RHOAIENG-36673] - Investigate and Refactor KServe E2E Test Suite for… #987

Uh oh!

Conversation

spolti commented Nov 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

spolti commented Nov 27, 2025

Uh oh!

Jooho left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Nov 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

spolti commented Nov 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 24, 2025 •

edited

Loading