Skip to content

Refactor the mechanism for specifying the container images #408

@hdefazio

Description

@hdefazio

Currently, our deployment scripts and configurations primarily allow overriding the image tag (e.g., via EPP_TAG), while usually assuming a hardcoded base image path (e.g., ghcr.io/llm-d/llm-d-inference-scheduler). This limitation creates significant friction and potential for errors in various development, testing, and CI/CD scenarios.
Note: While this document focuses on the llm-d-inference-scheduler (EPP) image for simplicity, the same problems and proposed solutions apply equally to the configuration of the llm-d-inference-sim and llm-d-routing-sidecar images.

Problem Statement

The inability to easily override the full image path (including the registry and repository) with a single environment variable introduces several concrete problems:

Creates Misleading Configuration Options and Confusion:
The Makefile allows developers to override the base registry via the IMAGE_REGISTRY variable, suggesting flexibility. However, this override is ignored by downstream components like the run_e2e.sh script and the Go e2e tests, which contain hardcoded base paths pointing to ghcr.io. This discrepancy creates confusion, as developers expect their override to work but find the system still uses the default registry, leading to failed tests and wasted debugging time trying to understand why their configuration isn't being respected.

Creates Configuration Ambiguity and Potential for Errors:
Allowing users to set the image via three different variables (EPP_IMAGE, EPP_IMAGE_BASE, EPP_TAG) without a clear order of precedence creates confusion. Users might set conflicting values (e.g., providing a full EPP_IMAGE and a different EPP_TAG), leading to unpredictable behavior depending on which variable the script happens to prioritize or parse incorrectly. This increases the likelihood of configuration errors and makes debugging more difficult.

Hinders Developer Testing and Forks:
Developers frequently build images from feature branches or forks and push them to personal registries (e.g., quay.io/username/llm-d-inference-scheduler:my-fix). The current setup makes it difficult or impossible to deploy and test these custom images in local (e.g., Kind) or shared development environments without modifying the core deployment scripts.

Impedes CI/CD Integration:
Promotion workflows often involve moving images between different registries (e.g., dev -> staging -> prod registries). The hardcoded base path prevents seamless configuration across these environments.

Increases Script Complexity and Brittleness:
To work around the limitation, our current test scripts contain logic to parse full image URLs (if provided) into separate _BASE and _TAG variables. This adds unnecessary complexity, potential for parsing errors, and makes the scripts harder to maintain.

Example

Example 1: The Current Problem - Testing a Forked Image
Scenario:
A developer builds a custom image from their feature branch (a fork of the main repository) and pushes it to their personal Quay.io repository. The image is available at quay.io/developer-username/llm-d-inference-scheduler:my-feature-fix.

Attempted Configuration:
To test this image, the developer runs the make command, overriding the default registry and tag:

export IMAGE_REGISTRY=quay.io/developer-username 
export EPP_TAG=my-feature-fix
make test-e2e

Current Behavior (Failure):
The Makefile correctly sets the IMG variable to quay.io/developer-username/llm-d-inference-scheduler:my-feature-fix. However, the e2e_suite_test.go code does not read this combined IMG variable. It reads only the EPP_TAG (my-feature-fix) and ignores the IMAGE_REGISTRY override. It then incorrectly constructs the image path by combining the hardcoded base ghcr.io/llm-d/llm-d-inference-scheduler with the overridden tag.

The run_e2e.sh script receives the EPP_TAG=my-feature-fix environment variable but has no knowledge of the IMAGE_REGISTRY override. It defaults its own EPP_TAG variable to my-feature-fix. It then attempts to pull an image using a hardcoded base path combined with the overridden tag: docker pull ghcr.io/llm-d/llm-d-inference-scheduler:my-feature-fix. This pull likely fails or pulls an irrelevant image, but the script might continue.

Result:
The test suite attempts to load and use ghcr.io/llm-d/llm-d-inference-scheduler:my-feature-fix instead of the intended image quay.io/developer-username/llm-d-inference-scheduler:my-feature-fix. This leads to an incorrect test run using the wrong image or failures if the ghcr.io image with that tag doesn't exist or is incompatible. This forces developers into complex workarounds like manually re-tagging images or modifying test code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions