Skip to content

Conversation

@yannicks1
Copy link
Collaborator

@yannicks1 yannicks1 commented Sep 4, 2025

[cb] scheduler heuristic 2: unblock long prompts

Introducing VLLM_SPYRE_MAX_WAITING_TIME_PREFILL, which is an upper bound on the waiting time [sec] of any request. After a request has waited for longer the current decode batch is locked and will finish decoding. The request will be either added to that locked batch or prefilled into a new exclusive locked batch.

@github-actions
Copy link

github-actions bot commented Sep 4, 2025

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

Signed-off-by: Yannick Schnider <[email protected]>
@yannicks1 yannicks1 self-assigned this Sep 5, 2025
Signed-off-by: Yannick Schnider <[email protected]>
@yannicks1 yannicks1 changed the title [WIP][cb] scheduler heuristic 2: unblock long prompts [cb] scheduler heuristic 2: unblock long prompts Sep 5, 2025
@yannicks1 yannicks1 marked this pull request as ready for review September 5, 2025 15:32
@yannicks1
Copy link
Collaborator Author

bot:test

Comment on lines 201 to 202
if not self.batch_is_locked and self.can_schedule(
self.holdback_queue[0]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be tested directly in the can_schedule() function? Maybe it can be the first tested condition, and return False directly if wrong

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it could also go on top of can_schedule(), true. Having it here is less code and avoids jumping into can_schedule() if we already know it is gonna return False. The way I interpret can_schedule(req) is a check whether request req could be scheduled with the current decode batch. The flag batch_is_locked was set by yet another request (not by req nor by any request in self.running). So this case can be treated outside of can_schedule(). But my opinion is not very strong here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you decide, my thought what that the decision of scheduling or not was entirely in one place. but I see your point also

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having it here is less code and avoids jumping into can_schedule() if we already know it is gonna return False

Couldn't it just be the first thing we check in can_schedule ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved it

@yannicks1 yannicks1 requested a review from tdoublep September 8, 2025 17:49
# Prefills waiting longer than VLLM_SPYRE_MAX_WAITING_TIME_PREFILL
# seconds will have priority after the current decode batch has finished.
"VLLM_SPYRE_MAX_WAITING_TIME_PREFILL":
lambda: int(os.getenv("VLLM_SPYRE_MAX_WAITING_TIME_PREFILL", "-1")),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this also be a float so that the user can specify 0.5 for 500ms?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I just changed that

@sducouedic
Copy link
Collaborator

bot:test

Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor comments but looks clean to me

sampling_params=sampling_params,
eos_token_id=None,
arrival_time=0,
arrival_time=time.time(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using time.monotonic() instead to avoid issues with daylight savings etc.

# scheduling heuristic: maximal waiting (blocking) time for prefill
# Prefills waiting longer than VLLM_SPYRE_MAX_WAITING_TIME_PREFILL
# seconds will have priority after the current decode batch has finished.
"VLLM_SPYRE_MAX_WAITING_TIME_PREFILL":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name should reflect the units of time that are being used (e.g., VLLM_SPYRE_MAX_WAITING_TIME_SECONDS) or something. Should we also consider using an integer instead of a float?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that int vs float has already been considered - please ignore that part.

Comment on lines 201 to 202
if not self.batch_is_locked and self.can_schedule(
self.holdback_queue[0]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having it here is less code and avoids jumping into can_schedule() if we already know it is gonna return False

Couldn't it just be the first thing we check in can_schedule ?

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
@yannicks1 yannicks1 requested a review from tdoublep September 9, 2025 08:41
Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yannicks1
Copy link
Collaborator Author

yannicks1 commented Sep 9, 2025

[don't merge yet, I found something...]
false alarm, all behaves as intended. ready to merge

@yannicks1 yannicks1 enabled auto-merge (squash) September 9, 2025 09:03
@yannicks1
Copy link
Collaborator Author

bot:test

@github-actions github-actions bot added the ready label Sep 9, 2025
@yannicks1
Copy link
Collaborator Author

hey, @joerunde as spyre-ci is failing currently also on main, I need you again to force merge this (and maybe have a look first:)

@yannicks1
Copy link
Collaborator Author

bot:test

@yannicks1
Copy link
Collaborator Author

bot:test

# longer then VLLM_SPYRE_MAX_WAITING_TIME_SECONDS, we cannot
# schedule the current sequence until we have served this request
if self.batch_is_locked:
return False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of locking the batch entirely, shouldn't we just disallow any skipping of requests in the queue until the request at the head of the waiting queue schedules?

I haven't followed super closely but my assumption is that the blocked request may be able to be scheduled before the full batch finishes. E.g. with the 128k limit, a 64k request could potentially schedule once the batch has drained down to a single other request, so we wouldn't need to wait for the last one to finish.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea! I will certainly address that in a follow up. We wanted to keep the first version as simple and fail-proof as possible.

@yannicks1 yannicks1 disabled auto-merge September 10, 2025 20:44
@yannicks1 yannicks1 merged commit 2dcb70a into main Sep 10, 2025
16 of 25 checks passed
@yannicks1 yannicks1 deleted the ysc-unblock-long-prompts branch September 10, 2025 20:45
yannicks1 added a commit that referenced this pull request Sep 12, 2025
### [CB] 🧹 moving VLLM_SPYRE_MAX_WAITING_TIME_SECONDS to dev branch

For fully benefit from this feature, we have to enable skipping sequence
in the waiting queue (breaking the FIFO queue). This will be explored on
the feature branch
[dev-scheduler-allow-skip](https://github.com/vllm-project/vllm-spyre/tree/dev-scheduler-allow-skip).
Therefore cleaning up main by reverting PR #440 here.

Signed-off-by: Yannick Schnider <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants