-
Notifications
You must be signed in to change notification settings - Fork 26
[cb] scheduler heuristic 2: unblock long prompts #440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
|
👋 Hi! Thank you for contributing to vLLM support on Spyre. Or this can be done with Now you are good to go 🚀 |
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
|
bot:test |
Signed-off-by: Yannick Schnider <[email protected]>
vllm_spyre/v1/core/scheduler.py
Outdated
| if not self.batch_is_locked and self.can_schedule( | ||
| self.holdback_queue[0]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be tested directly in the can_schedule() function? Maybe it can be the first tested condition, and return False directly if wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it could also go on top of can_schedule(), true. Having it here is less code and avoids jumping into can_schedule() if we already know it is gonna return False. The way I interpret can_schedule(req) is a check whether request req could be scheduled with the current decode batch. The flag batch_is_locked was set by yet another request (not by req nor by any request in self.running). So this case can be treated outside of can_schedule(). But my opinion is not very strong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you decide, my thought what that the decision of scheduling or not was entirely in one place. but I see your point also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having it here is less code and avoids jumping into can_schedule() if we already know it is gonna return False
Couldn't it just be the first thing we check in can_schedule ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved it
Signed-off-by: Yannick Schnider <[email protected]>
vllm_spyre/envs.py
Outdated
| # Prefills waiting longer than VLLM_SPYRE_MAX_WAITING_TIME_PREFILL | ||
| # seconds will have priority after the current decode batch has finished. | ||
| "VLLM_SPYRE_MAX_WAITING_TIME_PREFILL": | ||
| lambda: int(os.getenv("VLLM_SPYRE_MAX_WAITING_TIME_PREFILL", "-1")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this also be a float so that the user can specify 0.5 for 500ms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I just changed that
Signed-off-by: Yannick Schnider <[email protected]>
|
bot:test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of minor comments but looks clean to me
tests/spyre_util.py
Outdated
| sampling_params=sampling_params, | ||
| eos_token_id=None, | ||
| arrival_time=0, | ||
| arrival_time=time.time(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest using time.monotonic() instead to avoid issues with daylight savings etc.
vllm_spyre/envs.py
Outdated
| # scheduling heuristic: maximal waiting (blocking) time for prefill | ||
| # Prefills waiting longer than VLLM_SPYRE_MAX_WAITING_TIME_PREFILL | ||
| # seconds will have priority after the current decode batch has finished. | ||
| "VLLM_SPYRE_MAX_WAITING_TIME_PREFILL": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name should reflect the units of time that are being used (e.g., VLLM_SPYRE_MAX_WAITING_TIME_SECONDS) or something. Should we also consider using an integer instead of a float?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that int vs float has already been considered - please ignore that part.
vllm_spyre/v1/core/scheduler.py
Outdated
| if not self.batch_is_locked and self.can_schedule( | ||
| self.holdback_queue[0]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having it here is less code and avoids jumping into can_schedule() if we already know it is gonna return False
Couldn't it just be the first thing we check in can_schedule ?
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
[don't merge yet, I found something...] |
|
bot:test |
|
hey, @joerunde as spyre-ci is failing currently also on main, I need you again to force merge this (and maybe have a look first:) |
|
bot:test |
|
bot:test |
| # longer then VLLM_SPYRE_MAX_WAITING_TIME_SECONDS, we cannot | ||
| # schedule the current sequence until we have served this request | ||
| if self.batch_is_locked: | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of locking the batch entirely, shouldn't we just disallow any skipping of requests in the queue until the request at the head of the waiting queue schedules?
I haven't followed super closely but my assumption is that the blocked request may be able to be scheduled before the full batch finishes. E.g. with the 128k limit, a 64k request could potentially schedule once the batch has drained down to a single other request, so we wouldn't need to wait for the last one to finish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great idea! I will certainly address that in a follow up. We wanted to keep the first version as simple and fail-proof as possible.
### [CB] 🧹 moving VLLM_SPYRE_MAX_WAITING_TIME_SECONDS to dev branch For fully benefit from this feature, we have to enable skipping sequence in the waiting queue (breaking the FIFO queue). This will be explored on the feature branch [dev-scheduler-allow-skip](https://github.com/vllm-project/vllm-spyre/tree/dev-scheduler-allow-skip). Therefore cleaning up main by reverting PR #440 here. Signed-off-by: Yannick Schnider <[email protected]>
[cb] scheduler heuristic 2: unblock long prompts
Introducing
VLLM_SPYRE_MAX_WAITING_TIME_PREFILL, which is an upper bound on the waiting time [sec] of any request. After a request has waited for longer the current decode batch is locked and will finish decoding. The request will be either added to that locked batch or prefilled into a new exclusive locked batch.