Skip to content

Conversation

@joerunde
Copy link
Collaborator

Description

In an effort to reduce test runtime, this uses functools.lru_cache to cache text generation results from hf transformers. This should shave about 15% off the cpu runtime based on some quick measurements on an M3.

NB: This probably will not work when running with --forked, but we don't fork the tests on github actions runs.

@github-actions
Copy link

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

@jberkhahn
Copy link
Collaborator

Looks good to me! I'm seeing the same intermittent test failures on my PR though, not sure what they're from?

Copy link
Collaborator

@maxdebayser maxdebayser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@joerunde
Copy link
Collaborator Author

@jberkhahn the failures on all vLLM:main jobs are expected, they're just a signal that something has changed in vllm that we need to address. Catching those early like this helps us be ready when a new vllm version is released.

It does look like I missed a list -> tuple conversion in the continuous batching tests though :(

# This uses lru_cache to cache the generated text so that we don't have to
# always load and run the transformers model, nor manage a set of files of
# expected results.
@functools.lru_cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we limit the cache size?

https://docs.python.org/3/library/functools.html#functools.lru_cache

@functools.lru_cache(maxsize=128)

Though, I don't suppose there a reason to be concerned about a growing cache?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wasn't too concerned because this should be caching relatively small objects

@joerunde
Copy link
Collaborator Author

🤔 This also doesn't appear to be reducing the runtime on github actions at all, which is a bit odd. Something seems off

Signed-off-by: Joe Runde <[email protected]>
@prashantgupta24
Copy link
Collaborator

🤔 This also doesn't appear to be reducing the runtime on github actions at all, which is a bit odd. Something seems off

I was hoping the unhashable error was the reason the caching wasn't working

@joerunde
Copy link
Collaborator Author

Ah, so the reduction in time here is actually much better when static batching and continuous batching tests are run together, since they use the same prompts and would share cache for the expected results. Separated, there are far fewer cache hits :(

Also the tests like the tkv scheduler tests that generate specific-length prompts on the fly don't benefit either. So the speedup here isn't great, but could work better if we can manage to do things like:

  • Run sb and cb tests together
  • Standardize the prompts we run more across tests
  • Pre-fetch prompts to be used in the suite and run them through hf up-front

@joerunde
Copy link
Collaborator Author

Okay- swapping this over to use a file-based cache which should avoid loading the model w/ transformers at all in these test runs

Signed-off-by: Joe Runde <[email protected]>
json.dump(self.cached_results, f)
self.dirty = False

def get_cached_result(self, model: str, prompt: str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the type annotation for prompt be Union[str, list[int]]?

return self.cached_results.get(model, {}).get(prompt,
{}).get(max_tokens, {})

def add_to_cache(self, model: str, prompt: str, max_tokens: int,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about the prompt type annotation

"""Use a string to represent a list of token ids, so that it can be
hashed and used as a json key."""

return "__tokens__" + "_".join(str(token_id) for token_id in token_ids)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, no tokenizer required.

Copy link
Collaborator

@maxdebayser maxdebayser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The failing main tests are due to a new upstream change fixed in #380

@joerunde joerunde enabled auto-merge (squash) August 14, 2025 19:25
@github-actions github-actions bot added the ready label Aug 14, 2025
@joerunde joerunde merged commit e6c0d33 into main Aug 14, 2025
23 checks passed
@joerunde joerunde deleted the cache-hf-results branch August 14, 2025 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants