Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 113 additions & 23 deletions .github/workflows/test-spyre.yml
Original file line number Diff line number Diff line change
@@ -1,29 +1,119 @@
name: test-sypre

on: pull_request
on:
# Don't use pull_request.paths filter since this workflow is required for
# all pull requests on main irrespective of file type or location
pull_request:
branches:
- main
push:
branches:
- main
paths:
- "tests/**/*.py"
- "vllm_spyre/**/*.py"
- pyproject.toml
- .github/workflows/test-spyre.yml
workflow_dispatch:

env:
# force output to be colored for non-tty GHA runner shell
FORCE_COLOR: "1"
# prefer index for torch cpu version and match pip's extra index policy
UV_EXTRA_INDEX_URL: "https://download.pytorch.org/whl/cpu"
UV_INDEX_STRATEGY: "unsafe-best-match"
# facilitate testing by building vLLM for CPU when needed
VLLM_CPU_DISABLE_AVX512: "true"
VLLM_TARGET_DEVICE: "cpu"
VLLM_PLUGINS: "spyre"
VLLM_SPYRE_TEST_MODEL_DIR: "${{ github.workspace }}/models"
HF_HUB_CACHE: "${{ github.workspace }}/.cache/huggingface/hub"

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

jobs:
test-spyre:
runs-on: ubuntu-latest
timeout-minutes: 20
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: ["ubuntu-latest"]
python_version: ["3.12"]
vllm_version:
- name: "vLLM:v0.8.0"
repo: "git+https://github.com/vllm-project/vllm --tag v0.8.0"
- name: "vLLM:main"
repo: "git+https://github.com/vllm-project/vllm --branch main"
- name: "ODH:main"
repo: "git+https://github.com/opendatahub-io/vllm --branch main"
test_suite:
- name: "V0"
tests: "V0 and eager"
flags: "--timeout=300"
- name: "V1"
tests: "(V1- and eager) or test_sampling_metadata_in_input_batch"
flags: "--timeout=300 --forked"
exclude:
- vllm_version: { name: "vLLM:main" }
test_suite: { name: "V1" }
- vllm_version: { name: "ODH:main" }
test_suite: { name: "V1" }

name: "${{ matrix.test_suite.name }} (${{ matrix.vllm_version.name }})"

steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Build docker image
run: docker build . -t vllm-spyre -f Dockerfile.spyre
- name: Run Spyre tests within docker container
run: |
docker run -i --rm --entrypoint /bin/bash vllm-spyre -c '''
source vllm-spyre/.venv/bin/activate && \
python -c "from transformers import pipeline; pipeline(\"text-generation\", model=\"JackFram/llama-160m\")" && \
export VARIANT=$(ls /root/.cache/huggingface/hub/models--JackFram--llama-160m/snapshots/) && \
mkdir -p /models && \
ln -s /root/.cache/huggingface/hub/models--JackFram--llama-160m/snapshots/${VARIANT} /models/llama-194m && \
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer(\"sentence-transformers/all-roberta-large-v1\")" && \
export VARIANT=$(ls /root/.cache/huggingface/hub/models--sentence-transformers--all-roberta-large-v1/snapshots/) && \
ln -s /root/.cache/huggingface/hub/models--sentence-transformers--all-roberta-large-v1/snapshots/${VARIANT} /models/all-roberta-large-v1 && \
export MASTER_PORT=12355 && \
export MASTER_ADDR=localhost && \
export DISTRIBUTED_STRATEGY_IGNORE_MODULES=WordEmbedding && \
cd vllm-spyre && \
python -m pytest --timeout=300 tests -v -k "V0 and eager" && \
python -m pytest --forked --timeout=300 tests -v -k "(V1- and eager) or test_sampling_metadata_in_input_batch"
'''
- name: "Checkout"
uses: actions/checkout@v4
with:
fetch-depth: 1

- name: "Install PyTorch"
run: |
pip install torch=="2.5.1+cpu" --index-url https://download.pytorch.org/whl/cpu
- name: "Install uv"
uses: astral-sh/setup-uv@v5
with:
version: "latest"
python-version: ${{ matrix.python_version }}
enable-cache: true
ignore-nothing-to-cache: true
cache-dependency-glob: |
pyproject.toml
- name: "Install vLLM"
env:
VLLM_TARGET_DEVICE: empty
run: |
# Install markupsafe from PyPI, Torch CPU index only has wheels for Python 3.13
uv add markupsafe --index force_pypi_index=https://pypi.org/simple
uv add ${{ matrix.vllm_version.repo }}
uv venv .venv --system-site-packages
source .venv/bin/activate
uv pip install -v -e .
uv sync --frozen --group dev
- name: "Download models"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bonus points if we could cache these, but definitely not necessary for this PR

Copy link
Collaborator Author

@ckadner ckadner Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minus points actually :-)

  • the download time from GHA cache is about equal to the download time from HF using the Python processes

    • restoring from GHA cache: 21s restore from GHA cache
    • downloading from HF: 19s download from HF
  • the two models take up about 1.8 GB of cache (10 GB limit)

GHA cache makes most sense for operations that take up a lot of compute time, not when time is spent on downloads.

Copy link
Collaborator Author

@ckadner ckadner Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can speed up the HF download times by a few seconds by running the two Python processes in "parallel":

      - name: "Download models"
        run: |
          mkdir -p "${VLLM_SPYRE_TEST_MODEL_DIR}"
          download_jackfram_llama() {
            python -c "from transformers import pipeline; pipeline('text-generation', model='JackFram/llama-160m')"
            VARIANT=$(ls "${HF_HUB_CACHE}/models--JackFram--llama-160m/snapshots/")
            ln -s "${HF_HUB_CACHE}/models--JackFram--llama-160m/snapshots/${VARIANT}" "${VLLM_SPYRE_TEST_MODEL_DIR}/llama-194m"
          }
          download_roberta_large() {
            python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('sentence-transformers/all-roberta-large-v1')"
            VARIANT=$(ls "${HF_HUB_CACHE}/models--sentence-transformers--all-roberta-large-v1/snapshots/")
            ln -s "${HF_HUB_CACHE}/models--sentence-transformers--all-roberta-large-v1/snapshots/${VARIANT}" "${VLLM_SPYRE_TEST_MODEL_DIR}/all-roberta-large-v1"
          }
          download_jackfram_llama &
          download_roberta_large &
          wait
  • in sequence: 22s image
  • in parallel: 16s image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice!

I was thinking more along the lines of reliability rather than speed here, since the upstream vllm CI downloads a tons of models in parallel from HF and often flakes out when a download fails. But this test suite is still small enough that it's probably fine to keep pulling from HF for now. We can always switch to gha cache if it becomes a problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point :-)

https://github.com/vllm-project/vllm-spyre/actions/runs/14342866903/job/40206277191?pr=70#step:6:34

huggingface_hub.errors.HfHubHTTPError: 403 Forbidden: None.
Cannot access content at: https://huggingface.co/JackFram/llama-160m/resolve/main/config.json.
Make sure your token has the correct permissions.

I will do the hub cache

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh lol, that was fast!

And of course any comments about the limitations would be great so the next maintainer knows not to try to stick a 7GB model in here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joerunde -- took me a bit to get cache updates to work properly with immutable caches. I pushed another commit that should:

  • only create cache blobs for one of the matrix jobs
  • not creating cache blobs for PR branches
  • updating cache blobs on push to main when new models get added or old ones removed

run: |
mkdir -p "${VLLM_SPYRE_TEST_MODEL_DIR}"
python -c "from transformers import pipeline; pipeline(\"text-generation\", model=\"JackFram/llama-160m\")"
VARIANT=$(ls "${HF_HUB_CACHE}/models--JackFram--llama-160m/snapshots/")
ln -s "${HF_HUB_CACHE}/models--JackFram--llama-160m/snapshots/${VARIANT}" "${VLLM_SPYRE_TEST_MODEL_DIR}/llama-194m"
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer(\"sentence-transformers/all-roberta-large-v1\")"
VARIANT=$(ls "${HF_HUB_CACHE}/models--sentence-transformers--all-roberta-large-v1/snapshots/")
ln -s "${HF_HUB_CACHE}/models--sentence-transformers--all-roberta-large-v1/snapshots/${VARIANT}" "${VLLM_SPYRE_TEST_MODEL_DIR}/all-roberta-large-v1"
- name: "Run tests"
env:
MASTER_PORT: 12355
MASTER_ADDR: localhost
DISTRIBUTED_STRATEGY_IGNORE_MODULES: WordEmbedding
run: |
source .venv/bin/activate
uv run pytest ${{ matrix.test_suite.flags }} \
tests -v -k "${{ matrix.test_suite.tests }}"
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ use_parentheses = true
skip_gitignore = true

[tool.pytest.ini_options]
pythonpath = ["."]
markers = [
"skip_global_cleanup",
"core_model: enable this model test in each PR instead of only nightly",
Expand Down