[Bugfix][V0][V1] Fix crashes from cancelling requests #64

tjohnson31415 · 2025-03-28T20:46:59Z

Some fixes from testing handling of request cancellation:

in V0, guard against a KeyError in _req_ids2idx
in v1, specialize the Scheduler's finish_requests() to handle the holdback_queue

github-actions · 2025-03-28T20:47:08Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes:

pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

joerunde · 2025-03-28T23:05:36Z

vllm_spyre/v1/core/scheduler.py

+            else:
+                # this try-except is the specialization for Spyre
+                try:
+                    self.holdback_queue.remove(request)


Ah dang, this is unfortunate.

I think maybe we can fix this in a simpler way by removing self.holdback_queue as an instance attribute, and instead just make it a local variable during self.schedule(). After we schedule a new batch, we can take all the requests that we held back and put them back in self.waiting, and then we won't need to worry about breaking assumptions that the v1 scheduler has

That would let us get rid of the override on get_num_unfinished_requests as well

Done!
I left it as an instance variable so that we don't remake the deque. It is also still used in _handle_rejects, though there probably can't be rejection during scheduling?

Right, I think that it should be safe to remove it from usage in _handle_rejects as well since this should all be synchronous. The output processing that calls _handle_rejects can't be happening concurrently with scheduling a new forward pass

I'll leave it up to you on merging now vs. pulling out of _handle_rejects as well. I think we'll be getting rid of this rejected request business soon anyway

Signed-off-by: Travis Johnson <[email protected]>

* fix: add optional arg to abort_seq_group for compat with v0.8 Signed-off-by: Travis Johnson <[email protected]> * fix: guard against KeyError with _req_ids2idx Signed-off-by: Travis Johnson <[email protected]> * fix: specialize finish_requests in V1 scheduler Signed-off-by: Travis Johnson <[email protected]> * fix: check against None... Signed-off-by: Travis Johnson <[email protected]> * refactor: make holdback queue use more temporary Signed-off-by: Travis Johnson <[email protected]> --------- Signed-off-by: Travis Johnson <[email protected]>

tjohnson31415 mentioned this pull request Mar 28, 2025

Request cancellation can cause the server to crash #36

Closed

tjohnson31415 force-pushed the fix-keyerror branch from 2e0e31e to 71435ab Compare March 28, 2025 22:37

joerunde reviewed Mar 28, 2025

View reviewed changes

tjohnson31415 force-pushed the fix-keyerror branch from 71435ab to 1aabb0b Compare April 7, 2025 20:08

tjohnson31415 added 5 commits April 8, 2025 11:33

fix: add optional arg to abort_seq_group for compat with v0.8

e8f624d

Signed-off-by: Travis Johnson <[email protected]>

fix: guard against KeyError with _req_ids2idx

87e573e

Signed-off-by: Travis Johnson <[email protected]>

fix: specialize finish_requests in V1 scheduler

8cf95eb

Signed-off-by: Travis Johnson <[email protected]>

fix: check against None...

8b659ea

Signed-off-by: Travis Johnson <[email protected]>

refactor: make holdback queue use more temporary

d837329

Signed-off-by: Travis Johnson <[email protected]>

tjohnson31415 force-pushed the fix-keyerror branch from 1aabb0b to d837329 Compare April 8, 2025 17:33

joerunde approved these changes Apr 8, 2025

View reviewed changes

tjohnson31415 merged commit 84e01fa into main Apr 8, 2025
7 checks passed

tjohnson31415 deleted the fix-keyerror branch April 8, 2025 19:36

joerunde mentioned this pull request Apr 8, 2025

[Continuous Batching] Support for continuous batching on AIU Spyre #66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][V0][V1] Fix crashes from cancelling requests #64

[Bugfix][V0][V1] Fix crashes from cancelling requests #64

Uh oh!

tjohnson31415 commented Mar 28, 2025

Uh oh!

github-actions bot commented Mar 28, 2025

Uh oh!

joerunde Mar 28, 2025

Uh oh!

joerunde Mar 28, 2025

Uh oh!

tjohnson31415 Apr 7, 2025

Uh oh!

joerunde Apr 8, 2025

Uh oh!

joerunde Apr 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Bugfix][V0][V1] Fix crashes from cancelling requests #64

[Bugfix][V0][V1] Fix crashes from cancelling requests #64

Uh oh!

Conversation

tjohnson31415 commented Mar 28, 2025

Uh oh!

github-actions bot commented Mar 28, 2025

Uh oh!

joerunde Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

joerunde Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

tjohnson31415 Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

joerunde Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

joerunde Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants