[V1] Fix assert failure when finishing a batch #62

wallashss · 2025-03-28T15:34:03Z

This PR address #55

What I observed using vLLM v0.8.0 vs the main of upstream: on upstram main there was an extra call for spyre_model_runner receiving scheduler_output with only finished_request_ids filled with the last request to clean up. However, for v0.8.0 this call did not happen, and looks like it was wrong because the worker did not have a chance to cleanup the data. If I understood correctly, This behavior was solved in this PR vllm-project/vllm#14388, which was also the basis to write this PR to solve the issue.

Signed-off-by: Wallas Santos <[email protected]>

github-actions · 2025-03-28T15:34:12Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes:

pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

joerunde · 2025-03-28T16:31:37Z

Sweeet, thanks @wallashss!
Are we now fully working on v1 again on post-0.8.2 vllm? If so what do you think about updating the CI install to vllm@5f063a8 (or later) so that it tests your new logic?

wallashss · 2025-03-28T18:24:47Z

Are we now fully working on v1 again on post-0.8.2 vllm?

Probably, yes? I think we just have to make sure communicate the team that we are moving forward.

If so what do you think about updating the CI install to vllm@5f063a8 (or later) so that it tests your new logic?

That means update Dockerfile.spyre, right?

joerunde · 2025-03-28T19:21:22Z

That means update Dockerfile.spyre, right?

Yeah, that way the CI running here has the latest v1 engine and can actually test out these changes

Signed-off-by: Wallas Santos <[email protected]>

wallashss · 2025-03-28T20:05:56Z

Yeah, that way the CI running here has the latest v1 engine and can actually test out these changes

Thanks, I just wanted to make sure I am seeing the right spot

joerunde · 2025-03-28T20:09:06Z

@wallashss I think the --depth 1 on the git clone there is causing it to not have that commit in history, and the subsequent git fetch is only fetching the tags

Signed-off-by: Wallas Santos <[email protected]>

wallashss · 2025-03-28T20:19:41Z

@wallashss I think the --depth 1 on the git clone there is causing it to not have that commit in history, and the subsequent git fetch is only fetching the tags

Yeah, I think it's good now.

joerunde · 2025-03-28T23:08:15Z

nice, it looks like the test actually caught a problem!

Signed-off-by: Wallas Santos <[email protected]>

wallashss · 2025-03-31T21:52:51Z

Hey @joerunde,

I think I need an opinion for the error that is failing.

Just to recap, the log from the test:

INFO 03-31 18:39:09 [spyre_model_runner.py:385] Padding request of length 1 tokens to 64 tokens.
ERROR 03-31 18:39:09 [async_llm.py:337] EngineCore output handler hit an error: 
ERROR 03-31 18:39:09 [async_llm.py:337] Traceback (most recent call last):
ERROR 03-31 18:39:09 [async_llm.py:337]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 315, in _run_output_handler
ERROR 03-31 18:39:09 [async_llm.py:337]     processed_outputs = self.output_processor.process_outputs(
ERROR 03-31 18:39:09 [async_llm.py:337]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-31 18:39:09 [async_llm.py:337]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 318, in process_outputs
ERROR 03-31 18:39:09 [async_llm.py:337]     self._update_stats_from_output(req_state, engine_core_output,
ERROR 03-31 18:39:09 [async_llm.py:337]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 382, in _update_stats_from_output
ERROR 03-31 18:39:09 [async_llm.py:337]     iteration_stats.update_from_output(engine_core_output,
ERROR 03-31 18:39:09 [async_llm.py:337]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/metrics/stats.py", line 104, in update_from_output
ERROR 03-31 18:39:09 [async_llm.py:337]     assert num_new_generation_tokens > 0
ERROR 03-31 18:39:09 [async_llm.py:337]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-31 18:39:09 [async_llm.py:337] AssertionError

And it is caused by a test with an interesting comment:

        # Short prompt under context length but requesting too many tokens for
        # the warmup shape should return an empty result
        completion = client.completions.create(model=model,
                                               prompt="Hello World!",
                                               max_tokens=25)

From this in vLLM code line, I am pretty sure that this is the only place where EngineCoreOutput is instanced. And it is only created when there is a new_token_ids.

            if new_token_ids:
                # Add EngineCoreOutput for this Request.
                outputs.append(
                    EngineCoreOutput(
                        request_id=req_id,
                        new_token_ids=new_token_ids,
                        finish_reason=request.get_finished_reason(),
                        new_logprobs=new_logprobs,
                        new_prompt_logprobs_tensors=prompt_logprobs_tensors,
                        stop_reason=request.stop_reason,
                        events=request.take_events()))

However, in the specialization of the Scheduler in vllm-spyre plugin we have this snippet

 for request in rejected_requests:
            queue.remove(request)
            reject_outputs.append(
                EngineCoreOutput(request.request_id,
                                 new_token_ids=[],
                                 finish_reason=FinishReason.ABORT,
                                 stop_reason="Request did not fit any warmup "
                                 "shape"))
            request.status = RequestStatus.FINISHED_ABORTED
            self._free_request(request)
            self.rejected_requests.remove(request.request_id)

What makes me think that we need to review this implementation, because it probably does not match with the engine design, and we may have issues like this in the future. What do you think?

wallashss · 2025-03-31T21:53:47Z

cc/ @tjohnson31415 I think we might have some insight from you as well. 😄

Signed-off-by: Wallas Santos <[email protected]>

wallashss · 2025-04-01T18:13:56Z

I opened this issue #68 to follow up this PR. I just added a workaround to pass the tests and warn users the limitation. Everything else should still works fine, but requests does not fit in warmup shapes will make vllm crash. I did not found a way from the plugin to disable the stats globally. I hope later we can get more updates from vllm upstream and make it proper working to remove these workarounds.

tjohnson31415 · 2025-04-02T17:57:10Z

@wallashss I had another thought we could try: What happens if we just return a dummy token in the output to satisy that assertion?

wallashss · 2025-04-02T18:02:26Z

@wallashss I had another thought we could try: What happens if we just return a dummy token in the output to satisy that assertion?

Already tested it... Unfortunately this assert fails... So I think it is a matter of what decision we should take.

tjohnson31415 · 2025-04-02T18:11:53Z

Ah, well if it is just that test assertion, I think we should update the test; it could assert on the finish_reason and stop_reason instead. Eventually we will need a better way to reject these requests without the empty response anyways.

wallashss · 2025-04-02T19:07:31Z

Ah, well if it is just that test assertion, I think we should update the test; it could assert on the finish_reason and stop_reason instead.

Ok... but this also means change the behavior of v0 as well, are you sure about that? My first look was not so good about that 😬 .

Revert "disable stats for test and warn users" This reverts commit 03cc587. Signed-off-by: Wallas Santos <[email protected]>

Signed-off-by: Wallas Santos <[email protected]>

joerunde · 2025-04-07T20:25:23Z

I had a lil thread about this on the vllm slack here: https://vllm-dev.slack.com/archives/C087RA55P0D/p1742828751236509

We should push on this upstream to figure out if we can align with the same behavior that occurs when the sequence is too long for the model. But for now, this looks like a reasonable workaround to unblock dev

joerunde · 2025-04-07T20:27:00Z

tests/spyre_util.py

-        result['token_ids'] = tuple(req_output.outputs[0].token_ids)
+        # TODO: Workaround for V1, if request does not fit in a warmup shape
+        # token_ids may be filled with -1.
+        token_ids = [t for t in req_output.outputs[0].token_ids if t >= 0]


This makes me sad D:
(But so does all the code in the scheduler that does this dummy scheduling anyway that I wrote)

Approved with sadness lol

Thank you with sadness 😅

joerunde

lgtm

[V1] Fix assert failure when finishing a batch

73584aa

Signed-off-by: Wallas Santos <[email protected]>

update Dockerfile to use current latest vllm

b5a0e25

Signed-off-by: Wallas Santos <[email protected]>

fix dockerfile

e31a13d

Signed-off-by: Wallas Santos <[email protected]>

fix: dockerfile build

9c56bf0

Signed-off-by: Wallas Santos <[email protected]>

disable stats for test and warn users

03cc587

Signed-off-by: Wallas Santos <[email protected]>

wallashss mentioned this pull request Apr 1, 2025

[V1] Re-enable log stats #68

Closed

wallashss added 2 commits April 4, 2025 10:27

Add workaround for requests that does not fit in warmup shapes

5fb35ad

Revert "disable stats for test and warn users" This reverts commit 03cc587. Signed-off-by: Wallas Santos <[email protected]>

Merge branch 'main' into fix-scheduler-regression

84d1633

Signed-off-by: Wallas Santos <[email protected]>

wallashss force-pushed the fix-scheduler-regression branch from 40802cf to 84d1633 Compare April 4, 2025 13:31

wallashss added 5 commits April 4, 2025 15:39

fix: change dummy token id

1eba6e3

Signed-off-by: Wallas Santos <[email protected]>

fix: more workarounds

1ae7d61

Signed-off-by: Wallas Santos <[email protected]>

fix linting

db229e6

Signed-off-by: Wallas Santos <[email protected]>

feat: upgrade vllm

538cbce

Signed-off-by: Wallas Santos <[email protected]>

trying to fix docker build

359d06b

Signed-off-by: Wallas Santos <[email protected]>

wallashss requested a review from joerunde April 4, 2025 19:44

wallashss requested a review from tjohnson31415 April 4, 2025 19:44

joerunde reviewed Apr 7, 2025

View reviewed changes

joerunde approved these changes Apr 7, 2025

View reviewed changes

joerunde merged commit 3ca4480 into vllm-project:main Apr 7, 2025
7 checks passed

joerunde mentioned this pull request May 1, 2025

[V1] Regression on scheduler #55

Closed

[V1] Fix assert failure when finishing a batch #62

[V1] Fix assert failure when finishing a batch #62

Uh oh!

Conversation

wallashss commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 28, 2025

Uh oh!

joerunde commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wallashss commented Mar 28, 2025

Uh oh!

joerunde commented Mar 28, 2025

Uh oh!

wallashss commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joerunde commented Mar 28, 2025

Uh oh!

wallashss commented Mar 28, 2025

Uh oh!

joerunde commented Mar 28, 2025

Uh oh!

wallashss commented Mar 31, 2025

Uh oh!

wallashss commented Mar 31, 2025

Uh oh!

wallashss commented Apr 1, 2025

Uh oh!

tjohnson31415 commented Apr 2, 2025

Uh oh!

wallashss commented Apr 2, 2025

Uh oh!

tjohnson31415 commented Apr 2, 2025

Uh oh!

wallashss commented Apr 2, 2025

Uh oh!

joerunde commented Apr 7, 2025

Uh oh!

joerunde Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

wallashss Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

joerunde left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wallashss commented Mar 28, 2025 •

edited

Loading

joerunde commented Mar 28, 2025 •

edited

Loading

wallashss commented Mar 28, 2025 •

edited

Loading