-
Notifications
You must be signed in to change notification settings - Fork 26
⚡ cache hf results in tests #373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Joe Runde <[email protected]>
|
👋 Hi! Thank you for contributing to vLLM support on Spyre. Or this can be done with Now you are good to go 🚀 |
|
Looks good to me! I'm seeing the same intermittent test failures on my PR though, not sure what they're from? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
|
@jberkhahn the failures on all It does look like I missed a list -> tuple conversion in the continuous batching tests though :( |
tests/spyre_util.py
Outdated
| # This uses lru_cache to cache the generated text so that we don't have to | ||
| # always load and run the transformers model, nor manage a set of files of | ||
| # expected results. | ||
| @functools.lru_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we limit the cache size?
https://docs.python.org/3/library/functools.html#functools.lru_cache
@functools.lru_cache(maxsize=128)Though, I don't suppose there a reason to be concerned about a growing cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I wasn't too concerned because this should be caching relatively small objects
|
🤔 This also doesn't appear to be reducing the runtime on github actions at all, which is a bit odd. Something seems off |
Signed-off-by: Joe Runde <[email protected]>
I was hoping the unhashable error was the reason the caching wasn't working |
|
Ah, so the reduction in time here is actually much better when static batching and continuous batching tests are run together, since they use the same prompts and would share cache for the expected results. Separated, there are far fewer cache hits :( Also the tests like the tkv scheduler tests that generate specific-length prompts on the fly don't benefit either. So the speedup here isn't great, but could work better if we can manage to do things like:
|
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
|
Okay- swapping this over to use a file-based cache which should avoid loading the model w/ transformers at all in these test runs |
Signed-off-by: Joe Runde <[email protected]>
tests/hf_result_cache.py
Outdated
| json.dump(self.cached_results, f) | ||
| self.dirty = False | ||
|
|
||
| def get_cached_result(self, model: str, prompt: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the type annotation for prompt be Union[str, list[int]]?
tests/hf_result_cache.py
Outdated
| return self.cached_results.get(model, {}).get(prompt, | ||
| {}).get(max_tokens, {}) | ||
|
|
||
| def add_to_cache(self, model: str, prompt: str, max_tokens: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about the prompt type annotation
| """Use a string to represent a list of token ids, so that it can be | ||
| hashed and used as a json key.""" | ||
|
|
||
| return "__tokens__" + "_".join(str(token_id) for token_id in token_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, no tokenizer required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The failing main tests are due to a new upstream change fixed in #380
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Description
In an effort to reduce test runtime, this uses
functools.lru_cacheto cache text generation results from hf transformers. This should shave about 15% off the cpu runtime based on some quick measurements on an M3.NB: This probably will not work when running with
--forked, but we don't fork the tests on github actions runs.