[cb] scheduler heuristic 2: unblock long prompts #440

yannicks1 · 2025-09-04T18:12:48Z

[cb] scheduler heuristic 2: unblock long prompts

Introducing VLLM_SPYRE_MAX_WAITING_TIME_PREFILL, which is an upper bound on the waiting time [sec] of any request. After a request has waited for longer the current decode batch is locked and will finish decoding. The request will be either added to that locked batch or prefilled into a new exclusive locked batch.

Signed-off-by: Yannick Schnider <[email protected]>

github-actions · 2025-09-04T18:12:57Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

Signed-off-by: Yannick Schnider <[email protected]>

vllm_spyre/v1/core/scheduler.py

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 · 2025-09-05T15:47:19Z

bot:test

Signed-off-by: Yannick Schnider <[email protected]>

sducouedic · 2025-09-08T11:27:30Z

vllm_spyre/v1/core/scheduler.py

+            if not self.batch_is_locked and self.can_schedule(
+                    self.holdback_queue[0]):


shouldn't this be tested directly in the can_schedule() function? Maybe it can be the first tested condition, and return False directly if wrong

I guess it could also go on top of can_schedule(), true. Having it here is less code and avoids jumping into can_schedule() if we already know it is gonna return False. The way I interpret can_schedule(req) is a check whether request req could be scheduled with the current decode batch. The flag batch_is_locked was set by yet another request (not by req nor by any request in self.running). So this case can be treated outside of can_schedule(). But my opinion is not very strong here.

you decide, my thought what that the decision of scheduling or not was entirely in one place. but I see your point also

Having it here is less code and avoids jumping into can_schedule() if we already know it is gonna return False

Couldn't it just be the first thing we check in can_schedule ?

Signed-off-by: Yannick Schnider <[email protected]>

maxdebayser · 2025-09-09T00:48:20Z

vllm_spyre/envs.py

+    # Prefills waiting longer than VLLM_SPYRE_MAX_WAITING_TIME_PREFILL
+    # seconds will have priority after the current decode batch has finished.
+    "VLLM_SPYRE_MAX_WAITING_TIME_PREFILL":
+    lambda: int(os.getenv("VLLM_SPYRE_MAX_WAITING_TIME_PREFILL", "-1")),


Could this also be a float so that the user can specify 0.5 for 500ms?

good point, I just changed that

Signed-off-by: Yannick Schnider <[email protected]>

sducouedic · 2025-09-09T08:09:57Z

bot:test

tdoublep

Couple of minor comments but looks clean to me

tdoublep · 2025-09-09T08:10:23Z

tests/spyre_util.py

            sampling_params=sampling_params,
            eos_token_id=None,
-            arrival_time=0,
+            arrival_time=time.time(),


I would suggest using time.monotonic() instead to avoid issues with daylight savings etc.

tdoublep · 2025-09-09T08:12:25Z

vllm_spyre/envs.py

+    # scheduling heuristic: maximal waiting (blocking) time for prefill
+    # Prefills waiting longer than VLLM_SPYRE_MAX_WAITING_TIME_PREFILL
+    # seconds will have priority after the current decode batch has finished.
+    "VLLM_SPYRE_MAX_WAITING_TIME_PREFILL":


The name should reflect the units of time that are being used (e.g., VLLM_SPYRE_MAX_WAITING_TIME_SECONDS) or something. Should we also consider using an integer instead of a float?

I see that int vs float has already been considered - please ignore that part.

tdoublep · 2025-09-09T08:14:04Z

vllm_spyre/v1/core/scheduler.py

+            if not self.batch_is_locked and self.can_schedule(
+                    self.holdback_queue[0]):


Having it here is less code and avoids jumping into can_schedule() if we already know it is gonna return False

Couldn't it just be the first thing we check in can_schedule ?

Signed-off-by: Yannick Schnider <[email protected]>

tdoublep

LGTM

yannicks1 · 2025-09-09T08:51:36Z

[don't merge yet, I found something...]
false alarm, all behaves as intended. ready to merge

yannicks1 · 2025-09-09T09:03:32Z

bot:test

yannicks1 · 2025-09-09T09:50:00Z

hey, @joerunde as spyre-ci is failing currently also on main, I need you again to force merge this (and maybe have a look first:)

yannicks1 · 2025-09-10T07:18:28Z

bot:test

yannicks1 · 2025-09-10T09:00:45Z

bot:test

joerunde · 2025-09-10T20:39:13Z

vllm_spyre/v1/core/scheduler.py

+        # longer then VLLM_SPYRE_MAX_WAITING_TIME_SECONDS, we cannot
+        # schedule the current sequence until we have served this request
+        if self.batch_is_locked:
+            return False


instead of locking the batch entirely, shouldn't we just disallow any skipping of requests in the queue until the request at the head of the waiting queue schedules?

I haven't followed super closely but my assumption is that the blocked request may be able to be scheduled before the full batch finishes. E.g. with the 128k limit, a 64k request could potentially schedule once the batch has drained down to a single other request, so we wouldn't need to wait for the last one to finish.

great idea! I will certainly address that in a follow up. We wanted to keep the first version as simple and fail-proof as possible.

### [CB] 🧹 moving VLLM_SPYRE_MAX_WAITING_TIME_SECONDS to dev branch For fully benefit from this feature, we have to enable skipping sequence in the waiting queue (breaking the FIFO queue). This will be explored on the feature branch [dev-scheduler-allow-skip](https://github.com/vllm-project/vllm-spyre/tree/dev-scheduler-allow-skip). Therefore cleaning up main by reverting PR #440 here. Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 added 2 commits September 4, 2025 16:57

introducing env var VLLM_SPYRE_MAX_WAITING_TIME_PREFILL

5a2d843

Signed-off-by: Yannick Schnider <[email protected]>

first implementation

104bdeb

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 added 2 commits September 4, 2025 18:33

fix isort

2c50ffc

Signed-off-by: Yannick Schnider <[email protected]>

fix arrival time for default vllm version

489daa4

Signed-off-by: Yannick Schnider <[email protected]>

tdoublep reviewed Sep 5, 2025

View reviewed changes

vllm_spyre/v1/core/scheduler.py Show resolved Hide resolved

yannicks1 added 2 commits September 5, 2025 08:03

unblock conditionally

039c674

Signed-off-by: Yannick Schnider <[email protected]>

Merge branch 'main' into ysc-unblock-long-prompts

f1c96d6

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 self-assigned this Sep 5, 2025

update comment

95a4cf4

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 changed the title ~~[WIP][cb] scheduler heuristic 2: unblock long prompts~~ [cb] scheduler heuristic 2: unblock long prompts Sep 5, 2025

yannicks1 marked this pull request as ready for review September 5, 2025 15:32

yannicks1 requested review from nikolaospapandreou, prashantgupta24, rafvasq and sducouedic as code owners September 5, 2025 15:32

yannicks1 requested review from joerunde and maxdebayser September 5, 2025 15:32

Merge branch 'main' into ysc-unblock-long-prompts

ea7edcf

Signed-off-by: Yannick Schnider <[email protected]>

sducouedic reviewed Sep 8, 2025

View reviewed changes

Merge branch 'main' into ysc-unblock-long-prompts

c302cfb

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 requested a review from tdoublep September 8, 2025 17:49

maxdebayser reviewed Sep 9, 2025

View reviewed changes

convert to float instead of int for finer granularity

324f827

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 requested a review from maxdebayser September 9, 2025 08:03

tdoublep reviewed Sep 9, 2025

View reviewed changes

yannicks1 added 2 commits September 9, 2025 08:32

use time.monotonic()

2ff9717

Signed-off-by: Yannick Schnider <[email protected]>

move skip locked batch into can_schedule()

eb5c35a

Signed-off-by: Yannick Schnider <[email protected]>

renaming env var

ab9a8ee

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 requested a review from tdoublep September 9, 2025 08:41

tdoublep approved these changes Sep 9, 2025

View reviewed changes

yannicks1 enabled auto-merge (squash) September 9, 2025 09:03

github-actions bot added the ready label Sep 9, 2025

Merge branch 'main' into ysc-unblock-long-prompts

8877a6e

Merge branch 'main' into ysc-unblock-long-prompts

9a5c444

joerunde reviewed Sep 10, 2025

View reviewed changes

yannicks1 disabled auto-merge September 10, 2025 20:44

yannicks1 merged commit 2dcb70a into main Sep 10, 2025
16 of 25 checks passed

yannicks1 deleted the ysc-unblock-long-prompts branch September 10, 2025 20:45

yannicks1 mentioned this pull request Sep 12, 2025

[CB] 🧹 moving VLLM_SPYRE_MAX_WAITING_TIME_SECONDS to dev branch #459

Merged

		if not self.batch_is_locked and self.can_schedule(
		self.holdback_queue[0]):

[cb] scheduler heuristic 2: unblock long prompts #440

[cb] scheduler heuristic 2: unblock long prompts #440

Uh oh!

Conversation

yannicks1 commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!