[CI] Test against vllm/main #70

ckadner · 2025-04-04T06:13:08Z

Changes:

Install Python package on GH action runner instead of building Docker image
Use uv instead of pip as package manager
Use matrix to test against different build targets
Use matrix strategy to run v0 and v1 tests in parallel

Resolves #41

TODO:

AFTER MERGE: Update branch protection rules for test-spyre as merge requirement
- Remove test-spyre
- Add Test/V0 (vLLM:0.8.3), Test/V1 (vLLM:0.8.3)
- Leave off Test/V0 (vLLM:main), Test/V1 (vLLM:main) as a required check

- Install Python package on GH action runner instead of building Docker image - Use uv instead of pip as package manager - Use matrix to test against different build targets - Use matrix strategy to run v0 and v1 tests in parallel Resolves vllm-project#41 Signed-off-by: Christian Kadner <[email protected]>

github-actions · 2025-04-04T06:13:17Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

ckadner · 2025-04-04T06:19:45Z

There is a potential for another speed-up by 10 seconds, if we cache the installed entire Python site packages directory after PyTorch was installed. Cache key would depend on the PyTorch version, Python version, and OS.

ckadner · 2025-04-04T20:52:03Z

@joerunde do you have the powers to kick off a workflow run?

.github/workflows/test-spyre.yml

joerunde · 2025-04-07T20:01:10Z

.github/workflows/test-spyre.yml

+          uv pip install -v -e .
+          uv sync --frozen --group dev
+
+      - name: "Download models"


Bonus points if we could cache these, but definitely not necessary for this PR

Minus points actually :-)

the download time from GHA cache is about equal to the download time from HF using the Python processes

restoring from GHA cache: 21s

downloading from HF: 19s

the two models take up about 1.8 GB of cache (10 GB limit)

GHA cache makes most sense for operations that take up a lot of compute time, not when time is spent on downloads.

I can speed up the HF download times by a few seconds by running the two Python processes in "parallel":

- name: "Download models" run: | mkdir -p "${VLLM_SPYRE_TEST_MODEL_DIR}" download_jackfram_llama() { python -c "from transformers import pipeline; pipeline('text-generation', model='JackFram/llama-160m')" VARIANT=$(ls "${HF_HUB_CACHE}/models--JackFram--llama-160m/snapshots/") ln -s "${HF_HUB_CACHE}/models--JackFram--llama-160m/snapshots/${VARIANT}" "${VLLM_SPYRE_TEST_MODEL_DIR}/llama-194m" } download_roberta_large() { python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('sentence-transformers/all-roberta-large-v1')" VARIANT=$(ls "${HF_HUB_CACHE}/models--sentence-transformers--all-roberta-large-v1/snapshots/") ln -s "${HF_HUB_CACHE}/models--sentence-transformers--all-roberta-large-v1/snapshots/${VARIANT}" "${VLLM_SPYRE_TEST_MODEL_DIR}/all-roberta-large-v1" } download_jackfram_llama & download_roberta_large & wait

in sequence: 22s

in parallel: 16s

Ah, nice!

I was thinking more along the lines of reliability rather than speed here, since the upstream vllm CI downloads a tons of models in parallel from HF and often flakes out when a download fails. But this test suite is still small enough that it's probably fine to keep pulling from HF for now. We can always switch to gha cache if it becomes a problem.

I see your point :-)

https://github.com/vllm-project/vllm-spyre/actions/runs/14342866903/job/40206277191?pr=70#step:6:34

huggingface_hub.errors.HfHubHTTPError: 403 Forbidden: None. Cannot access content at: https://huggingface.co/JackFram/llama-160m/resolve/main/config.json. Make sure your token has the correct permissions.

I will do the hub cache

oh lol, that was fast!

And of course any comments about the limitations would be great so the next maintainer knows not to try to stick a 7GB model in here

@joerunde -- took me a bit to get cache updates to work properly with immutable caches. I pushed another commit that should:

only create cache blobs for one of the matrix jobs

not creating cache blobs for PR branches

updating cache blobs on push to main when new models get added or old ones removed

joerunde · 2025-04-07T20:02:33Z

Thanks @ckadner!

I mainly just had a lot of little questions, overall this looks super great

- Remove extra uv settings: - Don't prefer PyTorch package index - Don't force PyPI for markupsafe package - Download HF models in parallel Signed-off-by: Christian Kadner <[email protected]>

Signed-off-by: Christian Kadner <[email protected]>

joerunde · 2025-04-09T16:29:56Z

Looks like we need another main merge and to test against 0.8.3

Signed-off-by: Christian Kadner <[email protected]>

ckadner · 2025-04-09T16:56:28Z

Looks like we need another main merge and to test against 0.8.3

@joerunde -- Merged. Updated to 0.8.3.

Should I make the 0.8.x matrix version dynamic, i.e. reading it from pyproject.toml?

Signed-off-by: Christian Kadner <[email protected]>

ckadner · 2025-04-09T17:55:13Z

Looks like we need another main merge and to test against 0.8.3

@joerunde -- Merged. Updated to 0.8.3.

Should I make the 0.8.x matrix version dynamic, i.e. reading it from pyproject.toml?

I made a quick change to update the job matrix to test against the project default version of vLLM and vLLM:main.

A prettier version of this workflow could have a pre-test job to pull the actual version out of the pyproject.toml and then properly set the job matrix name. But using default will be better as the repository branch protection rules can not have dynamic merge requirements.

* [CI] Test against vllm/main - Install Python package on GH action runner instead of building Docker image - Use uv instead of pip as package manager - Use matrix to test against different build targets - Use matrix strategy to run v0 and v1 tests in parallel Resolves vllm-project#41 Signed-off-by: Christian Kadner <[email protected]> * review updates - Remove extra uv settings: - Don't prefer PyTorch package index - Don't force PyPI for markupsafe package - Download HF models in parallel Signed-off-by: Christian Kadner <[email protected]> * cache HF models Signed-off-by: Christian Kadner <[email protected]> * update to vLLM:v0.8.3 Signed-off-by: Christian Kadner <[email protected]> * don't explicitely set vLLM 0.8.x version Signed-off-by: Christian Kadner <[email protected]> --------- Signed-off-by: Christian Kadner <[email protected]>

ckadner mentioned this pull request Apr 4, 2025

[CI/BUILD] Use build cache for Spyre tests #46

Closed

12 tasks