Refactor the mechanism for specifying the container images

Currently, our deployment scripts and configurations primarily allow overriding the image tag (e.g., via `EPP_TAG`), while usually assuming a hardcoded base image path (e.g., `ghcr.io/llm-d/llm-d-inference-scheduler`). This limitation creates significant friction and potential for errors in various development, testing, and CI/CD scenarios.
Note: While this document focuses on the llm-d-inference-scheduler (EPP) image for simplicity, the same problems and proposed solutions apply equally to the configuration of the llm-d-inference-sim  and llm-d-routing-sidecar images.

### Problem Statement
The inability to easily override the full image path (including the registry and repository) with a single environment variable introduces several concrete problems:

**Creates Misleading Configuration Options and Confusion:**
The Makefile allows developers to override the base registry via the `IMAGE_REGISTRY` variable, suggesting flexibility. However, this override is ignored by downstream components like the _run_e2e.sh_ script and the Go e2e tests, which contain hardcoded base paths pointing to _ghcr.io_. This discrepancy creates confusion, as developers expect their override to work but find the system still uses the default registry, leading to failed tests and wasted debugging time trying to understand why their configuration isn't being respected.

**Creates Configuration Ambiguity and Potential for Errors:** 
Allowing users to set the image via three different variables (`EPP_IMAGE`, `EPP_IMAGE_BASE`, `EPP_TAG`) without a clear order of precedence creates confusion. Users might set conflicting values (e.g., providing a full `EPP_IMAGE` and a different `EPP_TAG`), leading to unpredictable behavior depending on which variable the script happens to prioritize or parse incorrectly. This increases the likelihood of configuration errors and makes debugging more difficult.

**Hinders Developer Testing and Forks:**
Developers frequently build images from feature branches or forks and push them to personal registries (e.g., `quay.io/username/llm-d-inference-scheduler:my-fix`). The current setup makes it difficult or impossible to deploy and test these custom images in local (e.g., Kind) or shared development environments without modifying the core deployment scripts.

**Impedes CI/CD Integration:**
Promotion workflows often involve moving images between different registries (e.g., dev -> staging -> prod registries). The hardcoded base path prevents seamless configuration across these environments.

**Increases Script Complexity and Brittleness:**
To work around the limitation, our current test scripts contain logic to parse full image URLs (if provided) into separate _BASE and _TAG variables. This adds unnecessary complexity, potential for parsing errors, and makes the scripts harder to maintain.

### Example
****Example 1:** The Current Problem - Testing a Forked Image**
_Scenario:_ 
A developer builds a custom image from their feature branch (a fork of the main repository) and pushes it to their personal Quay.io repository. The image is available at `quay.io/developer-username/llm-d-inference-scheduler:my-feature-fix`.

_Attempted Configuration:_ 
To test this image, the developer runs the make command, overriding the default registry and tag:
```
export IMAGE_REGISTRY=quay.io/developer-username 
export EPP_TAG=my-feature-fix
make test-e2e
```

_Current Behavior (Failure):_ 
The Makefile correctly sets the IMG variable to `quay.io/developer-username/llm-d-inference-scheduler:my-feature-fix`. However, the _e2e_suite_test.go_ code does not read this combined `IMG` variable. It reads only the `EPP_TAG` (`my-feature-fix`) and ignores the `IMAGE_REGISTRY` override. It then incorrectly constructs the image path by combining the hardcoded base `ghcr.io/llm-d/llm-d-inference-scheduler` with the overridden tag.
- _Source:_ https://github.com/llm-d/llm-d-inference-scheduler/blob/main/test/e2e/e2e_suite_test.go#L65 and https://github.com/llm-d/llm-d-inference-scheduler/blob/main/test/e2e/e2e_suite_test.go#L123 

The run_e2e.sh script receives the EPP_TAG=my-feature-fix environment variable but has no knowledge of the IMAGE_REGISTRY override. It defaults its own EPP_TAG variable to my-feature-fix. It then attempts to pull an image using a hardcoded base path combined with the overridden tag: docker pull ghcr.io/llm-d/llm-d-inference-scheduler:my-feature-fix. This pull likely fails or pulls an irrelevant image, but the script might continue.

- _Source:_  https://github.com/llm-d/llm-d-inference-scheduler/blob/main/test/scripts/run_e2e.sh#L23 

_Result:_ 
The test suite attempts to load and use ghcr.io/llm-d/llm-d-inference-scheduler:my-feature-fix instead of the intended image quay.io/developer-username/llm-d-inference-scheduler:my-feature-fix. This leads to an incorrect test run using the wrong image or failures if the ghcr.io image with that tag doesn't exist or is incompatible. This forces developers into complex workarounds like manually re-tagging images or modifying test code.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor the mechanism for specifying the container images #408

Problem Statement

Example

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refactor the mechanism for specifying the container images #408

Description

Problem Statement

Example

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions