-
Notifications
You must be signed in to change notification settings - Fork 26
🐛 fix static scheduling issues with long prompts #206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Joe Runde <[email protected]>
👋 Hi! Thank you for contributing to vLLM support on Spyre.
Or this can be done with
Now you are good to go 🚀 |
Looks good, I'll get this built and tested on Power. Thanks Joe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice simple fix and well-written test. LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice catch and fix, Joe. Left a minor comment:)
tests/e2e/test_spyre_basic.py
Outdated
|
||
vllm_sampling_params = SamplingParams(max_tokens=20, | ||
temperature=0, | ||
stop="1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we don't need stop="1"
here, looks like a copy paste relict:)
Signed-off-by: Joe Runde <[email protected]>
This fixes an issue where batches of requests with long prompts would not properly be scheduled. Even with a long queue of requests, smaller-than-full batches would be scheduled because requests would be rejected from the schedule once the total number of prompt tokens was >=
--max-num-batched-tokens
.The
--max-num-batched-tokens
config is designed for chunking up prompts for chunked prefill, and isn't relevant for static batching. This PR sets--max-num-batched-tokens
to the maximum number of prompt tokens that could be in a full batch, so that the chunked prefill logic doesn't prevent us from creating large batches.I missed this before because it requires a lower level test that invokes the scheduler directly in order to ensure that the full batches are actually scheduled.