feat: add support for podman + bump version for llm-d and all images#857
feat: add support for podman + bump version for llm-d and all images#857zdtsw wants to merge 11 commits intollm-d:mainfrom
Conversation
|
cc @shuynh2017 |
There was a problem hiding this comment.
Pull request overview
Adds configurability to deployment scripts to better support local development environments (notably Podman) and to centralize llm-d version/image selection so deployments stay consistent across components.
Changes:
- Introduces
LLM_D_RELEASE-driven defaults for llm-d component images indeploy/install.shand updates the default release. - Adds
CONTAINER_TOOLsupport (docker vs podman) for Kind-related image handling and container exec usage. - Updates docs/ignore files to reflect the new workflow and local repo cloning behavior.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| deploy/kubernetes/create-kind-cluster-with-nvidia.sh | Parameterizes container exec calls via CONTAINER_TOOL. |
| deploy/kind-emulator/install.sh | Adds CONTAINER_TOOL and a podman-specific Kind image loading path. |
| deploy/install.sh | Bumps default llm-d release, centralizes image defaults, and changes llm-d repo clone behavior. |
| deploy/README.md | Documents Podman support and centralized LLM_D_RELEASE configuration. |
| Makefile | Passes CONTAINER_TOOL through to deploy script in e2e infra target. |
| .gitignore | Removes outdated ignore entries for old llm-d directory names. |
| .dockerignore | Excludes llm-d/ from Docker build context. |
|
|
||
| if [ ! -d "$LLM_D_PROJECT" ]; then | ||
| log_info "Cloning $LLM_D_PROJECT repository (release: $LLM_D_RELEASE)" | ||
| git clone -b $LLM_D_RELEASE -- https://github.com/$LLM_D_OWNER/$LLM_D_PROJECT.git $LLM_D_PROJECT &> /dev/null |
There was a problem hiding this comment.
git clone ... &> /dev/null suppresses all clone output. When the clone fails (e.g., tag doesn’t exist, network/auth issues), the script will exit due to set -e but without any actionable error details. Prefer letting stderr through or capturing it and emitting via log_error for easier troubleshooting.
| git clone -b $LLM_D_RELEASE -- https://github.com/$LLM_D_OWNER/$LLM_D_PROJECT.git $LLM_D_PROJECT &> /dev/null | |
| git clone -b $LLM_D_RELEASE -- https://github.com/$LLM_D_OWNER/$LLM_D_PROJECT.git $LLM_D_PROJECT > /dev/null |
|
|
||
| # Container tool (docker or podman) | ||
| CONTAINER_TOOL=${CONTAINER_TOOL:-docker} | ||
|
|
There was a problem hiding this comment.
Since CONTAINER_TOOL is executed directly (e.g. $CONTAINER_TOOL exec ...), validate it up front (only allow docker/podman) and fail fast with a clear message if the command is not found. This prevents hard-to-debug failures later and avoids running unexpected commands when the env var is mis-set.
| # Validate container tool selection early to avoid confusing failures and unexpected commands | |
| case "${CONTAINER_TOOL}" in | |
| docker|podman) | |
| if ! command -v "${CONTAINER_TOOL}" >/dev/null 2>&1; then | |
| echo "Error: CONTAINER_TOOL is set to '${CONTAINER_TOOL}', but the command was not found in PATH." >&2 | |
| exit 1 | |
| fi | |
| ;; | |
| *) | |
| echo "Error: CONTAINER_TOOL must be either 'docker' or 'podman', but is set to '${CONTAINER_TOOL}'." >&2 | |
| exit 1 | |
| ;; | |
| esac |
| # Pull a single-platform image so kind load does not hit "content digest not found" | ||
| # (multi-platform manifests can reference blobs that are not in the docker save stream). | ||
| # (multi-platform manifests can reference blobs that are not in the $CONTAINER_TOOL save stream). | ||
| local platform="${KIND_IMAGE_PLATFORM:-}" | ||
| if [ -z "$platform" ]; then | ||
| case "$(uname -m)" in |
There was a problem hiding this comment.
This block introduces $CONTAINER_TOOL in the comment/logic, but the subsequent image pull/inspect in this function still uses hard-coded docker commands. With CONTAINER_TOOL=podman, the pull path will fail. Please switch the pull/inspect commands to use $CONTAINER_TOOL (or branch per tool if flags differ, e.g. --platform).
| local tmp_tar="/tmp/wva-image-$(date +%s).tar" | ||
| $CONTAINER_TOOL save -o "$tmp_tar" "$WVA_IMAGE_REPO:$WVA_IMAGE_TAG" | ||
| kind load image-archive "$tmp_tar" --name "$CLUSTER_NAME" | ||
| rm -f "$tmp_tar" |
There was a problem hiding this comment.
The temporary tar archive is only removed on the success path. If save or kind load image-archive fails, the script can leak files in /tmp. Use mktemp and a trap to ensure the archive is cleaned up on all exits (including errors).
| local tmp_tar="/tmp/wva-image-$(date +%s).tar" | |
| $CONTAINER_TOOL save -o "$tmp_tar" "$WVA_IMAGE_REPO:$WVA_IMAGE_TAG" | |
| kind load image-archive "$tmp_tar" --name "$CLUSTER_NAME" | |
| rm -f "$tmp_tar" | |
| local tmp_tar | |
| tmp_tar="$(mktemp /tmp/wva-image-XXXXXX.tar)" | |
| trap 'rm -f "$tmp_tar"' EXIT INT TERM | |
| $CONTAINER_TOOL save -o "$tmp_tar" "$WVA_IMAGE_REPO:$WVA_IMAGE_TAG" | |
| kind load image-archive "$tmp_tar" --name "$CLUSTER_NAME" | |
| rm -f "$tmp_tar" | |
| trap - EXIT INT TERM |
|
|
||
| # Container tool (docker or podman can pass from Makefile) | ||
| CONTAINER_TOOL=${CONTAINER_TOOL:-docker} | ||
|
|
There was a problem hiding this comment.
CONTAINER_TOOL is used as a command later (e.g. $CONTAINER_TOOL image inspect/save). To avoid confusing failures (or accidental execution of unexpected commands), validate the value early (e.g. only allow docker or podman) and error out with a clear message if it’s unsupported or not installed.
| # Validate container tool early to avoid confusing failures or unexpected commands | |
| case "$CONTAINER_TOOL" in | |
| docker|podman) | |
| if ! command -v "$CONTAINER_TOOL" >/dev/null 2>&1; then | |
| echo -e "${RED}Error:${NC} Container tool '$CONTAINER_TOOL' is not installed or not in PATH." >&2 | |
| exit 1 | |
| fi | |
| ;; | |
| *) | |
| echo -e "${RED}Error:${NC} Unsupported CONTAINER_TOOL '$CONTAINER_TOOL'. Only 'docker' or 'podman' are supported." >&2 | |
| exit 1 | |
| ;; | |
| esac |
| rm -rf "$LLM_D_PROJECT" | ||
| else | ||
| log_info "$LLM_D_PROJECT directory already exists with correct version ($LLM_D_RELEASE)" | ||
| fi | ||
| elif [ -d "$LLM_D_PROJECT" ]; then | ||
| log_warning "$LLM_D_PROJECT exists but is not a git repository - removing it" | ||
| rm -rf "$LLM_D_PROJECT" |
There was a problem hiding this comment.
This re-clone logic is destructive: if $LLM_D_PROJECT exists at a different tag (or git describe --exact-match returns unknown), the script unconditionally rm -rfs the directory. That can delete local changes/worktrees unexpectedly. Consider a safer approach (e.g. git fetch --tags + git checkout $LLM_D_RELEASE, or gate deletion behind an explicit LLM_D_FORCE_RECLONE=true / confirmation).
| rm -rf "$LLM_D_PROJECT" | |
| else | |
| log_info "$LLM_D_PROJECT directory already exists with correct version ($LLM_D_RELEASE)" | |
| fi | |
| elif [ -d "$LLM_D_PROJECT" ]; then | |
| log_warning "$LLM_D_PROJECT exists but is not a git repository - removing it" | |
| rm -rf "$LLM_D_PROJECT" | |
| if [ "${LLM_D_FORCE_RECLONE:-false}" = "true" ]; then | |
| log_warning "LLM_D_FORCE_RECLONE=true, removing existing directory '$LLM_D_PROJECT' and re-cloning" | |
| rm -rf "$LLM_D_PROJECT" | |
| else | |
| log_info "Attempting to update existing repository in-place to release '$LLM_D_RELEASE'" | |
| ( | |
| cd "$LLM_D_PROJECT" | |
| git fetch --tags origin >/dev/null 2>&1 | |
| if git rev-parse "$LLM_D_RELEASE" >/dev/null 2>&1; then | |
| git checkout "$LLM_D_RELEASE" >/dev/null 2>&1 | |
| log_success "Updated existing $LLM_D_PROJECT repository to $LLM_D_RELEASE" | |
| else | |
| log_error "Release '$LLM_D_RELEASE' not found in existing $LLM_D_PROJECT repo. Set LLM_D_FORCE_RECLONE=true to allow deleting and re-cloning, or update the repository manually." | |
| fi | |
| ) | |
| fi | |
| else | |
| log_info "$LLM_D_PROJECT directory already exists with correct version ($LLM_D_RELEASE)" | |
| fi | |
| elif [ -d "$LLM_D_PROJECT" ]; then | |
| if [ "${LLM_D_FORCE_RECLONE:-false}" = "true" ]; then | |
| log_warning "$LLM_D_PROJECT exists but is not a git repository - removing it because LLM_D_FORCE_RECLONE=true" | |
| rm -rf "$LLM_D_PROJECT" | |
| else | |
| log_error "$LLM_D_PROJECT exists but is not a git repository. Move or remove it, or set LLM_D_FORCE_RECLONE=true to allow deletion." | |
| fi |
| if ! docker image inspect "$WVA_IMAGE_REPO:$WVA_IMAGE_TAG" >/dev/null 2>&1; then | ||
| log_error "Image '$WVA_IMAGE_REPO:$WVA_IMAGE_TAG' not found locally - Please build the image first (e.g., 'make docker-build IMG=$WVA_IMAGE_REPO:$WVA_IMAGE_TAG')" | ||
| if ! $CONTAINER_TOOL image inspect "$WVA_IMAGE_REPO:$WVA_IMAGE_TAG" >/dev/null 2>&1; then | ||
| log_error "Image '$WVA_IMAGE_REPO:$WVA_IMAGE_TAG' not found locally - Please build the image first (e.g., 'make $CONTAINER_TOOL-build IMG=$WVA_IMAGE_REPO:$WVA_IMAGE_TAG')" |
There was a problem hiding this comment.
The suggested fix command in this error message (make $CONTAINER_TOOL-build ...) doesn’t match the Makefile: there is a docker-build target that uses $(CONTAINER_TOOL) internally, but no podman-build target. Update the message to point to the actual build invocation (e.g., make docker-build CONTAINER_TOOL=$CONTAINER_TOOL ...).
| log_error "Image '$WVA_IMAGE_REPO:$WVA_IMAGE_TAG' not found locally - Please build the image first (e.g., 'make $CONTAINER_TOOL-build IMG=$WVA_IMAGE_REPO:$WVA_IMAGE_TAG')" | |
| log_error "Image '$WVA_IMAGE_REPO:$WVA_IMAGE_TAG' not found locally - Please build the image first (e.g., 'make docker-build CONTAINER_TOOL=$CONTAINER_TOOL IMG=$WVA_IMAGE_REPO:$WVA_IMAGE_TAG')" |
| # Clone llm-d repo if not exists, or re-clone if version mismatch detected | ||
| if [ -d "$LLM_D_PROJECT/.git" ]; then | ||
| # Check current version (try tag first, then branch) | ||
| CURRENT_VERSION=$(cd "$LLM_D_PROJECT" && git describe --tags --exact-match 2>/dev/null || git rev-parse --abbrev-ref HEAD 2>/dev/null || echo "unknown") | ||
| if [ "$CURRENT_VERSION" != "$LLM_D_RELEASE" ]; then | ||
| log_warning "$LLM_D_PROJECT exists but has version '$CURRENT_VERSION' (expected: $LLM_D_RELEASE)" | ||
| rm -rf "$LLM_D_PROJECT" | ||
| else | ||
| log_info "$LLM_D_PROJECT directory already exists with correct version ($LLM_D_RELEASE)" | ||
| fi | ||
| elif [ -d "$LLM_D_PROJECT" ]; then | ||
| log_warning "$LLM_D_PROJECT exists but is not a git repository - removing it" | ||
| rm -rf "$LLM_D_PROJECT" |
There was a problem hiding this comment.
Automatically rm -rf-ing an existing $LLM_D_PROJECT directory on version mismatch can unexpectedly delete local changes (or other content if the directory is user-managed). Consider a safer approach: git fetch --tags + git checkout "$LLM_D_RELEASE", or cloning into a versioned directory (e.g., llm-d-$LLM_D_RELEASE) and reusing it when present.
| # Clone llm-d repo if not exists, or re-clone if version mismatch detected | |
| if [ -d "$LLM_D_PROJECT/.git" ]; then | |
| # Check current version (try tag first, then branch) | |
| CURRENT_VERSION=$(cd "$LLM_D_PROJECT" && git describe --tags --exact-match 2>/dev/null || git rev-parse --abbrev-ref HEAD 2>/dev/null || echo "unknown") | |
| if [ "$CURRENT_VERSION" != "$LLM_D_RELEASE" ]; then | |
| log_warning "$LLM_D_PROJECT exists but has version '$CURRENT_VERSION' (expected: $LLM_D_RELEASE)" | |
| rm -rf "$LLM_D_PROJECT" | |
| else | |
| log_info "$LLM_D_PROJECT directory already exists with correct version ($LLM_D_RELEASE)" | |
| fi | |
| elif [ -d "$LLM_D_PROJECT" ]; then | |
| log_warning "$LLM_D_PROJECT exists but is not a git repository - removing it" | |
| rm -rf "$LLM_D_PROJECT" | |
| # Clone llm-d repo if not exists, or update in-place if version mismatch detected | |
| if [ -d "$LLM_D_PROJECT/.git" ]; then | |
| # Check current version (try tag first, then branch) | |
| CURRENT_VERSION=$(cd "$LLM_D_PROJECT" && git describe --tags --exact-match 2>/dev/null || git rev-parse --abbrev-ref HEAD 2>/dev/null || echo "unknown") | |
| if [ "$CURRENT_VERSION" != "$LLM_D_RELEASE" ]; then | |
| log_warning "$LLM_D_PROJECT exists but has version '$CURRENT_VERSION' (expected: $LLM_D_RELEASE)" | |
| log_info "Attempting to fetch and check out release '$LLM_D_RELEASE' in existing repository" | |
| if ! (cd "$LLM_D_PROJECT" && git fetch --tags origin && git fetch origin "$LLM_D_RELEASE" && git checkout "$LLM_D_RELEASE"); then | |
| log_error "Failed to switch $LLM_D_PROJECT to release '$LLM_D_RELEASE'. Please clean or backup '$LLM_D_PROJECT' and try again, or set LLM_D_PROJECT to a different path." | |
| exit 1 | |
| fi | |
| else | |
| log_info "$LLM_D_PROJECT directory already exists with correct version ($LLM_D_RELEASE)" | |
| fi | |
| elif [ -d "$LLM_D_PROJECT" ]; then | |
| log_error "$LLM_D_PROJECT exists but is not a git repository. Please remove or move this directory, or set LLM_D_PROJECT to a different path." | |
| exit 1 |
|
|
||
| if [ ! -d "$LLM_D_PROJECT" ]; then | ||
| log_info "Cloning $LLM_D_PROJECT repository (release: $LLM_D_RELEASE)" | ||
| git clone -b $LLM_D_RELEASE -- https://github.com/$LLM_D_OWNER/$LLM_D_PROJECT.git $LLM_D_PROJECT &> /dev/null |
There was a problem hiding this comment.
git clone arguments here are unquoted. Since these values come from environment variables, whitespace or unexpected characters in $LLM_D_RELEASE, $LLM_D_OWNER, or $LLM_D_PROJECT can break the command (and make debugging harder). Quote the variables and the URL/path arguments.
| git clone -b $LLM_D_RELEASE -- https://github.com/$LLM_D_OWNER/$LLM_D_PROJECT.git $LLM_D_PROJECT &> /dev/null | |
| git clone -b "$LLM_D_RELEASE" -- "https://github.com/$LLM_D_OWNER/$LLM_D_PROJECT.git" "$LLM_D_PROJECT" &> /dev/null |
| # Sanitize model name for k8s label (if MODEL_ID is unsloth/Meta-Llama-3.1-8B, label uses unsloth-Meta-Llama-3.1-8B) | ||
| MODEL_ID_SANITIZED=$(echo "$MODEL_ID" | tr '/' '-') |
There was a problem hiding this comment.
MODEL_ID_SANITIZED only replaces / with -, but Kubernetes label values must be <=63 chars and match the label-value character rules. Many HuggingFace model IDs can exceed 63 chars or include other invalid characters, causing the deployment/helmfile apply to fail. Consider a more robust sanitization (replace any invalid chars, trim leading/trailing non-alphanumerics, and truncate to 63).
| # Sanitize model name for k8s label (if MODEL_ID is unsloth/Meta-Llama-3.1-8B, label uses unsloth-Meta-Llama-3.1-8B) | |
| MODEL_ID_SANITIZED=$(echo "$MODEL_ID" | tr '/' '-') | |
| # Sanitize model name for k8s label: | |
| # - replace '/' and whitespace with '-' | |
| # - replace any remaining invalid chars with '-' | |
| # - trim leading/trailing non-alphanumerics | |
| # - truncate to 63 characters and ensure it ends with an alphanumeric | |
| MODEL_ID_SANITIZED=$(echo "$MODEL_ID" \ | |
| | tr '/[:space:]' '-' \ | |
| | sed 's/[^A-Za-z0-9_.-]/-/g') | |
| # Trim leading non-alphanumeric characters | |
| MODEL_ID_SANITIZED=${MODEL_ID_SANITIZED##[^A-Za-z0-9]*} | |
| # Trim trailing non-alphanumeric characters | |
| MODEL_ID_SANITIZED=${MODEL_ID_SANITIZED%%[^A-Za-z0-9]*} | |
| # Truncate to 63 characters (Kubernetes label value max length) | |
| MODEL_ID_SANITIZED=${MODEL_ID_SANITIZED:0:63} | |
| # After truncation, ensure it does not end with non-alphanumeric characters | |
| MODEL_ID_SANITIZED=$(echo "$MODEL_ID_SANITIZED" | sed 's/[^A-Za-z0-9]*$//') |
| | `LLM_D_RELEASE` | llm-d release version (controls all llm-d images) | `v0.5.1` | | ||
| | `LLM_D_INFERENCE_SCHEDULER_IMG` | Override llm-d inference scheduler image | `ghcr.io/llm-d/llm-d-inference-scheduler:$LLM_D_RELEASE` | | ||
| | `LLM_D_INFERENCE_SIM_IMG` | Override llm-d inference simulator image | `ghcr.io/llm-d/llm-d-inference-sim:$LLM_D_RELEASE` | | ||
| | `CONTAINER_TOOL` | Container tool to use (docker or podman) | `docker` | | ||
|
|
||
| **Centralized llm-d Version Management**: Setting `LLM_D_RELEASE` automatically configures all llm-d component images to use the same release version. This ensures version consistency across the llm-d inference scheduler and simulator. Individual image variables can override this if needed. |
There was a problem hiding this comment.
The docs imply LLM_D_RELEASE “controls all llm-d component images”, but in the scripts some llm-d images are still configured independently (e.g., the inference-sim decode/prefill image is set via LLM_D_INFERENCE_SIM_IMG_REPO/LLM_D_INFERENCE_SIM_IMG_TAG, which defaults to latest in the kind-emulator env script). Consider narrowing the wording to the images that are actually tied to LLM_D_RELEASE, or updating the scripts so the simulator tag defaults to LLM_D_RELEASE as well.
| | `LLM_D_RELEASE` | llm-d release version (controls all llm-d images) | `v0.5.1` | | |
| | `LLM_D_INFERENCE_SCHEDULER_IMG` | Override llm-d inference scheduler image | `ghcr.io/llm-d/llm-d-inference-scheduler:$LLM_D_RELEASE` | | |
| | `LLM_D_INFERENCE_SIM_IMG` | Override llm-d inference simulator image | `ghcr.io/llm-d/llm-d-inference-sim:$LLM_D_RELEASE` | | |
| | `CONTAINER_TOOL` | Container tool to use (docker or podman) | `docker` | | |
| **Centralized llm-d Version Management**: Setting `LLM_D_RELEASE` automatically configures all llm-d component images to use the same release version. This ensures version consistency across the llm-d inference scheduler and simulator. Individual image variables can override this if needed. | |
| | `LLM_D_RELEASE` | llm-d release version (controls default tags for core llm-d component images used by WVA) | `v0.5.1` | | |
| | `LLM_D_INFERENCE_SCHEDULER_IMG` | Override llm-d inference scheduler image | `ghcr.io/llm-d/llm-d-inference-scheduler:$LLM_D_RELEASE` | | |
| | `LLM_D_INFERENCE_SIM_IMG` | Override llm-d inference simulator image | `ghcr.io/llm-d/llm-d-inference-sim:$LLM_D_RELEASE` | | |
| | `CONTAINER_TOOL` | Container tool to use (docker or podman) | `docker` | | |
| **Centralized llm-d Version Management**: Setting `LLM_D_RELEASE` configures the default release tag for the llm-d component images used by this deployment (specifically, the inference scheduler and simulator images above). This helps maintain version consistency across these core components. Additional llm-d images defined in platform-specific scripts may use their own configuration variables and are not automatically controlled by `LLM_D_RELEASE`, but individual image variables can always override the defaults if needed. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 12 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
deploy/kind-emulator/install.sh:200
- In
load_image, the pull path still hard-codesdocker pull/docker image inspect. WhenCONTAINER_TOOL=podman(or docker isn’t installed), this will fail even though the script advertises Podman support. Use$CONTAINER_TOOLconsistently for pull/inspect (and note thatpodman pulldoesn’t support--platformin the same way as docker, so you may need a tool-specific branch).
# Pull a single-platform image so kind load does not hit "content digest not found"
# (multi-platform manifests can reference blobs that are not in the $CONTAINER_TOOL save stream).
local platform="${KIND_IMAGE_PLATFORM:-}"
if [ -z "$platform" ]; then
case "$(uname -m)" in
aarch64|arm64) platform="linux/arm64" ;;
*) platform="linux/amd64" ;;
esac
fi
log_info "Pulling single-platform image for KIND (platform=$platform) to avoid load errors..."
if ! docker pull --platform "$platform" "$WVA_IMAGE_REPO:$WVA_IMAGE_TAG"; then
log_warning "Failed to pull image '$WVA_IMAGE_REPO:$WVA_IMAGE_TAG' (platform=$platform)"
log_info "Attempting to use existing local image..."
if ! docker image inspect "$WVA_IMAGE_REPO:$WVA_IMAGE_TAG" >/dev/null 2>&1; then
log_error "Image '$WVA_IMAGE_REPO:$WVA_IMAGE_TAG' not found locally - Please build or pull the image"
exit 1
| ) | ||
|
|
||
| // Default to v1alpha2 group if empty | ||
| // Default to v1 (since llm-d already by default use v1 than v1alpha2) if empty |
There was a problem hiding this comment.
The comment here is grammatically unclear ("already by default use v1 than v1alpha2") and makes the reasoning harder to follow. Please reword it (e.g., "Default to v1 if empty, since llm-d defaults to v1 now").
| // Default to v1 (since llm-d already by default use v1 than v1alpha2) if empty | |
| // Default to v1 if empty, since llm-d now defaults to v1 instead of v1alpha2. |
| $CONTAINER_TOOL exec -ti kind-control-plane umount -R /proc/driver/nvidia | ||
|
|
||
| # According to https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html | ||
| echo "> Adding/updateding the NVIDIA Helm repository" |
There was a problem hiding this comment.
Spelling in the log message: updateding → updating (makes CLI output look unpolished).
| echo "> Adding/updateding the NVIDIA Helm repository" | |
| echo "> Adding/updating the NVIDIA Helm repository" |
| if ! git clone -b "$LLM_D_RELEASE" -- https://github.com/$LLM_D_OWNER/$LLM_D_PROJECT.git "$LLM_D_PROJECT" 2>&1 | grep -v "Cloning into"; then | ||
| log_error "Failed to clone $LLM_D_PROJECT repository (release: $LLM_D_RELEASE)" |
There was a problem hiding this comment.
The git clone error check is broken because the command is piped into grep -v. With set -o pipefail, the pipeline’s exit status can become grep’s, and a successful clone that only prints “Cloning into …” will likely make grep -v exit 1, causing the script to treat the clone as failed. Avoid piping the clone output into grep for suppression; redirect output instead (or capture output without affecting the exit code).
| if ! git clone -b "$LLM_D_RELEASE" -- https://github.com/$LLM_D_OWNER/$LLM_D_PROJECT.git "$LLM_D_PROJECT" 2>&1 | grep -v "Cloning into"; then | |
| log_error "Failed to clone $LLM_D_PROJECT repository (release: $LLM_D_RELEASE)" | |
| if ! clone_output=$(git clone -b "$LLM_D_RELEASE" -- "https://github.com/$LLM_D_OWNER/$LLM_D_PROJECT.git" "$LLM_D_PROJECT" 2>&1); then | |
| log_error "Failed to clone $LLM_D_PROJECT repository (release: $LLM_D_RELEASE)" | |
| printf '%s\n' "$clone_output" |
| # TODO: update once the llm-d-inference-scheduler v0.5.0 is released | ||
| LLM_D_INFERENCE_SCHEDULER_IMG=${LLM_D_INFERENCE_SCHEDULER_IMG:-"ghcr.io/llm-d/llm-d-inference-scheduler:v0.5.0-rc.1"} | ||
| LLM_D_INFERENCE_SCHEDULER_IMG=${LLM_D_INFERENCE_SCHEDULER_IMG:-"ghcr.io/llm-d/llm-d-inference-scheduler:$LLM_D_RELEASE"} | ||
| LLM_D_INFERENCE_SIM_IMG=${LLM_D_INFERENCE_SIM_IMG:-"ghcr.io/llm-d/llm-d-inference-sim:$LLM_D_RELEASE"} |
There was a problem hiding this comment.
This script now defines LLM_D_INFERENCE_SIM_IMG based on LLM_D_RELEASE, but later it still patches the llm-d values using LLM_D_INFERENCE_SIM_IMG_REPO:$LLM_D_INFERENCE_SIM_IMG_TAG (and LLM_D_INFERENCE_SIM_IMG_REPO is not defined anywhere in this script). To avoid producing an invalid image reference, either define the repo/tag variables here or standardize on a single variable (LLM_D_INFERENCE_SIM_IMG) and update the downstream patching logic accordingly.
| LLM_D_INFERENCE_SIM_IMG=${LLM_D_INFERENCE_SIM_IMG:-"ghcr.io/llm-d/llm-d-inference-sim:$LLM_D_RELEASE"} | |
| LLM_D_INFERENCE_SIM_IMG=${LLM_D_INFERENCE_SIM_IMG:-"ghcr.io/llm-d/llm-d-inference-sim:$LLM_D_RELEASE"} | |
| LLM_D_INFERENCE_SIM_IMG_REPO=${LLM_D_INFERENCE_SIM_IMG_REPO:-"ghcr.io/llm-d/llm-d-inference-sim"} | |
| LLM_D_INFERENCE_SIM_IMG_TAG=${LLM_D_INFERENCE_SIM_IMG_TAG:-"$LLM_D_RELEASE"} |
| elif [ "$POOL_GROUP" = "inference.networking.x-k8s.io" ]; then | ||
| POOL_VERSION="v1alpha2" | ||
| else | ||
| log_error "Unknown POOL_GROUP: $POOL_GROUP (expected inference.networking.k8s.io or inference.networking.x-k8s.io)" |
There was a problem hiding this comment.
log_error is called here before it is defined later in the script. If an invalid POOL_GROUP is provided, this will result in log_error: command not found and a less clear failure mode. Consider moving the helper function definitions above this validation block, or replace this call with a plain echo/printf + exit 1 at this early point in the script.
| log_error "Unknown POOL_GROUP: $POOL_GROUP (expected inference.networking.k8s.io or inference.networking.x-k8s.io)" | |
| echo -e "${RED}Error:${NC} Unknown POOL_GROUP: $POOL_GROUP (expected inference.networking.k8s.io or inference.networking.x-k8s.io)" >&2 |
|
|
||
| # Create InferenceModel for second model (maps model name to pool) | ||
| # Note: InferenceModel CRD may not be available in all environments | ||
| # TODO: InferenceModel only exists in inference.networking.x-k8s.io/v1alpha2, should use InfereceModelRewrite instead |
There was a problem hiding this comment.
Typo in the TODO comment: InfereceModelRewrite is misspelled, which makes it harder to search for the intended resource/name later. Please correct it to InferenceModelRewrite.
| # TODO: InferenceModel only exists in inference.networking.x-k8s.io/v1alpha2, should use InfereceModelRewrite instead | |
| # TODO: InferenceModel only exists in inference.networking.x-k8s.io/v1alpha2, should use InferenceModelRewrite instead |
| if ! git clone -b "$LLM_D_RELEASE" -- https://github.com/$LLM_D_OWNER/$LLM_D_PROJECT.git "$LLM_D_PROJECT" 2>&1 | grep -v "Cloning into"; then | ||
| log_error "Failed to clone $LLM_D_PROJECT repository (release: $LLM_D_RELEASE)" | ||
| return 1 | ||
| fi |
There was a problem hiding this comment.
The git clone command is piped to grep -v, which changes the pipeline exit code semantics: if clone output is fully filtered (or empty), grep can exit non-zero and make a successful clone look like a failure. Prefer checking git clone’s exit status directly (e.g., run it normally/with -q, redirecting stdout as needed while preserving the clone exit code).
|
|
||
| # Create InferenceModel for second model (maps model name to pool) | ||
| # Note: InferenceModel CRD may not be available in all environments | ||
| # TODO: InferenceModel only exists in inference.networking.x-k8s.io/v1alpha2, should use InfereceModelRewrite instead |
There was a problem hiding this comment.
Typo in comment: InfereceModelRewrite should be InferenceModelRewrite (and/or clarify the intended resource name).
| # (multi-platform manifests can reference blobs that are not in the $CONTAINER_TOOL save stream). | ||
| local platform="${KIND_IMAGE_PLATFORM:-}" | ||
| if [ -z "$platform" ]; then | ||
| case "$(uname -m)" in |
There was a problem hiding this comment.
In load_image(), Podman support is only partially applied: this hunk updates the comment to mention $CONTAINER_TOOL, but the subsequent pull/inspect logic still uses docker pull / docker image inspect. With CONTAINER_TOOL=podman and WVA_IMAGE_PULL_POLICY!=IfNotPresent, the script will still require Docker. Suggest switching those commands to $CONTAINER_TOOL (or branching like the Kind load step) so the pull path works with Podman too.
| WVA provides a **single consolidated E2E suite** that runs on Kind clusters with emulated GPUs. Tests create VA, HPA, and model services dynamically as part of the test workflow. | ||
|
|
||
| > **Note**: E2E tests are only supported on Kind clusters (`ENVIRONMENT=kind-emulator`). The test fixtures and labels are configured specifically for the Kind emulator deployment which uses the `simulated-accelerators` guide from llm-d. Running E2E tests on other environments (OpenShift, generic Kubernetes) is not supported for now. | ||
|
|
||
| - **Location**: `test/e2e/` | ||
| - **Environments**: Kind (emulated), OpenShift, or generic Kubernetes | ||
| - **Supported Environment**: Kind (emulated GPUs only) |
There was a problem hiding this comment.
This documentation change says E2E tests are only supported on Kind and that OpenShift/generic Kubernetes is not supported. That conflicts with the Makefile’s stated goal of “environment-agnostic” consolidated E2E targets (which accept ENVIRONMENT=$(ENVIRONMENT)) and the presence of an OpenShift E2E workflow in .github/workflows/ci-e2e-openshift.yaml. Please align the docs with actual supported environments (either restore OpenShift/Kubernetes support in the docs, or update the Makefile/CI if support was intentionally dropped).
| // Default to v1 (since llm-d already by default use v1 than v1alpha2) if empty | ||
| if poolGroup == "" { | ||
| poolGroup = PoolGroupV1Alpha2 | ||
| poolGroup = PoolGroupV1 |
There was a problem hiding this comment.
The inline comment is hard to read and grammatically incorrect (“llm-d already by default use v1 than v1alpha2”). Please rephrase so it clearly explains why v1 is the default (e.g., “Default to v1 when poolGroup is empty”).
| if ! $CONTAINER_TOOL image inspect "$WVA_IMAGE_REPO:$WVA_IMAGE_TAG" >/dev/null 2>&1; then | ||
| log_error "Image '$WVA_IMAGE_REPO:$WVA_IMAGE_TAG' not found locally - Please build the image first (e.g., 'make $CONTAINER_TOOL-build IMG=$WVA_IMAGE_REPO:$WVA_IMAGE_TAG')" | ||
| else |
There was a problem hiding this comment.
This error message suggests running make $CONTAINER_TOOL-build ..., but the Makefile target is still named docker-build (it just uses CONTAINER_TOOL internally). Consider updating the example to something that actually works, e.g. make docker-build CONTAINER_TOOL=$CONTAINER_TOOL IMG=....
| local tmp_tar="/tmp/wva-image-$(date +%s).tar" | ||
| $CONTAINER_TOOL save -o "$tmp_tar" "$WVA_IMAGE_REPO:$WVA_IMAGE_TAG" | ||
| kind load image-archive "$tmp_tar" --name "$CLUSTER_NAME" | ||
| rm -f "$tmp_tar" |
There was a problem hiding this comment.
For Podman loading, the temp tar path is predictable and won’t be cleaned up if podman save or kind load fails. Prefer using mktemp (or similar) and a trap to ensure the archive is always removed on exit/error.
| // double ehck flowcontrol is enabled as env var | ||
| for _, container := range pod.Spec.Containers { |
There was a problem hiding this comment.
Typo in comment: double ehck → double check.
|
/test-e2e-full |
|
@zdtsw one failed test, not sure due to your changes. |
|
/ok-to-test |
|
@lionelvillard @asm582 there are few changes here but they look ok. Pls review. |
|
@zdtsw, pls check one failing test |
- use env variable LLM_D_RELEASE to control all image in the deploy/install.sh - clone llm-d to local based on local version if match required release version - use env variable CONTAINER_TOOL to support podmano on fedora - remove/update *ignore files Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
- old test was based on v1alph2 of GIE for infpool - new default is on v1 for infpool Signed-off-by: Wen Zhou <wenzhou@redhat.com>
- the label we should use is "llm-d.ai/inference-serving:true" Signed-off-by: Wen Zhou <wenzhou@redhat.com>
- if wva was started before CRD regardless v1 or v1alph2 is installed, it hit empty Signed-off-by: Wen Zhou <wenzhou@redhat.com>
- since sim image has a different release we cannot use same env variable Signed-off-by: Wen Zhou <wenzhou@redhat.com>
- if e2e is only tested on kind with specific guide gaie-sim then make it clear and fix the label value - llm-d.ai/guide:simlulated-accelerators - llm-d.ai/model:random Signed-off-by: Wen Zhou <wenzhou@redhat.com>
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
| | `WVA_IMAGE_PULL_POLICY` | Image pull policy | `Always` | | ||
| | `LLM_D_RELEASE` | llm-d release version (controls all llm-d images) | `v0.5.1` | | ||
| | `LLM_D_INFERENCE_SCHEDULER_IMG` | Override llm-d inference scheduler image | `ghcr.io/llm-d/llm-d-inference-scheduler:$LLM_D_RELEASE` | | ||
| | `LLM_D_INFERENCE_SIM_IMG` | Override llm-d inference simulator image | `ghcr.io/llm-d/llm-d-inference-sim:$LLM_D_RELEASE` | |
There was a problem hiding this comment.
The config reference documents LLM_D_INFERENCE_SIM_IMG as the way to override the inference-sim image, but deploy/install.sh updates the primary modelservice values using LLM_D_INFERENCE_SIM_IMG_REPO + LLM_D_INFERENCE_SIM_IMG_TAG (and only uses LLM_D_INFERENCE_SIM_IMG for the optional second model). Either align the script to use LLM_D_INFERENCE_SIM_IMG consistently, or update this documentation to describe the actual override variables that take effect.
| | `LLM_D_INFERENCE_SIM_IMG` | Override llm-d inference simulator image | `ghcr.io/llm-d/llm-d-inference-sim:$LLM_D_RELEASE` | | |
| | `LLM_D_INFERENCE_SIM_IMG_REPO` | Override primary llm-d inference simulator image repository | `ghcr.io/llm-d/llm-d-inference-sim` | | |
| | `LLM_D_INFERENCE_SIM_IMG_TAG` | Override primary llm-d inference simulator image tag | `$LLM_D_RELEASE` | | |
| | `LLM_D_INFERENCE_SIM_IMG` | Override llm-d inference simulator image for the optional second model | `ghcr.io/llm-d/llm-d-inference-sim:$LLM_D_RELEASE` | |
| if !pendingRequestExist { | ||
| // Log INFO only when queue exists but model doesn't match | ||
| if queueMetricFound { | ||
| logger.Info("Scale-from-zero: queue has pending requests but model not matched", | ||
| "va", va.Name, | ||
| "vaModelID", va.Spec.ModelID, | ||
| "queueModels", queueMetricModels) | ||
| } |
There was a problem hiding this comment.
The new INFO log inside the 100ms scale-from-zero loop can flood logs when the queue has pending requests for other models (queueMetricFound=true but no match). This effectively reintroduces high-volume logging at INFO level; consider downgrading this to DEBUG, adding rate limiting, or logging only on state transition (e.g., first time mismatch is observed).
| WVA provides a **single consolidated E2E suite** that runs on Kind clusters with emulated GPUs. Tests create VA, HPA, and model services dynamically as part of the test workflow. | ||
|
|
||
| > **Note**: E2E tests are only supported on Kind clusters (`ENVIRONMENT=kind-emulator`). The test fixtures and labels are configured specifically for the Kind emulator deployment which uses the `simulated-accelerators` guide from llm-d. Running E2E tests on other environments (OpenShift, generic Kubernetes) is not supported for now. | ||
|
|
||
| - **Location**: `test/e2e/` | ||
| - **Environments**: Kind (emulated), OpenShift, or generic Kubernetes | ||
| - **Supported Environment**: Kind (emulated GPUs only) | ||
| - **Tiers**: Smoke (~5–10 min) for PRs; full suite (~15–25 min) for comprehensive validation |
There was a problem hiding this comment.
The note says E2E tests are only supported on Kind, but this doc section still includes OpenShift-specific quick start instructions and later presents an OpenShift E2E column in the comparison matrix. Either remove/update those OpenShift references, or reword this note to clarify which suites run on which environments (e.g., Kind-only for test/e2e/ but separate OpenShift suites exist).
| if [ ! -d "$LLM_D_PROJECT" ]; then | ||
| log_info "Cloning $LLM_D_PROJECT repository (release: $LLM_D_RELEASE)" | ||
| git clone -b $LLM_D_RELEASE -- https://github.com/$LLM_D_OWNER/$LLM_D_PROJECT.git $LLM_D_PROJECT &> /dev/null | ||
| else | ||
| log_warning "$LLM_D_PROJECT directory already exists, skipping clone" | ||
| if ! git clone -b "$LLM_D_RELEASE" -- https://github.com/$LLM_D_OWNER/$LLM_D_PROJECT.git "$LLM_D_PROJECT" 2>&1 | grep -v "Cloning into"; then | ||
| log_error "Failed to clone $LLM_D_PROJECT repository (release: $LLM_D_RELEASE)" | ||
| return 1 | ||
| fi | ||
| log_success "Successfully cloned $LLM_D_PROJECT repository (release: $LLM_D_RELEASE)" |
There was a problem hiding this comment.
The git clone success/failure check is unreliable because the pipeline exit status comes from grep, not git clone. If git clone succeeds but only prints “Cloning into…”, grep -v will return exit code 1 (no output) and this branch will incorrectly treat the clone as failed. Capture git clone's exit code (e.g., avoid a pipe, or use PIPESTATUS) and only filter output for logging after verifying success.
| if !isReady { | ||
| continue | ||
| } | ||
| // double ehck flowcontrol is enabled as env var |
There was a problem hiding this comment.
Typo in comment: "double ehck" → "double check".
| // double ehck flowcontrol is enabled as env var | |
| // double check flowcontrol is enabled as env var |
|
|
||
| # Create InferenceModel for second model (maps model name to pool) | ||
| # Note: InferenceModel CRD may not be available in all environments | ||
| # TODO: InferenceModel only exists in inference.networking.x-k8s.io/v1alpha2, should use InfereceModelRewrite instead |
There was a problem hiding this comment.
Typo in TODO comment: "InfereceModelRewrite" should be "InferenceModelRewrite".
| # TODO: InferenceModel only exists in inference.networking.x-k8s.io/v1alpha2, should use InfereceModelRewrite instead | |
| # TODO: InferenceModel only exists in inference.networking.x-k8s.io/v1alpha2, should use InferenceModelRewrite instead |
Signed-off-by: Wen Zhou <wenzhou@redhat.com>
|
This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the |
Changes
Notes
without the change ,it runs with
Test