Skip to content

Conversation

wallashss
Copy link
Collaborator

@wallashss wallashss commented Mar 28, 2025

This PR address #55

What I observed using vLLM v0.8.0 vs the main of upstream: on upstram main there was an extra call for spyre_model_runner receiving scheduler_output with only finished_request_ids filled with the last request to clean up. However, for v0.8.0 this call did not happen, and looks like it was wrong because the worker did not have a chance to cleanup the data. If I understood correctly, This behavior was solved in this PR vllm-project/vllm#14388, which was also the basis to write this PR to solve the issue.

Copy link

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes:

pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

@joerunde
Copy link
Collaborator

joerunde commented Mar 28, 2025

Sweeet, thanks @wallashss!
Are we now fully working on v1 again on post-0.8.2 vllm? If so what do you think about updating the CI install to vllm@5f063a8 (or later) so that it tests your new logic?

@wallashss
Copy link
Collaborator Author

Are we now fully working on v1 again on post-0.8.2 vllm?

Probably, yes? I think we just have to make sure communicate the team that we are moving forward.

If so what do you think about updating the CI install to vllm@5f063a8 (or later) so that it tests your new logic?

That means update Dockerfile.spyre, right?

@joerunde
Copy link
Collaborator

That means update Dockerfile.spyre, right?

Yeah, that way the CI running here has the latest v1 engine and can actually test out these changes

@wallashss
Copy link
Collaborator Author

wallashss commented Mar 28, 2025

Yeah, that way the CI running here has the latest v1 engine and can actually test out these changes

Thanks, I just wanted to make sure I am seeing the right spot

@joerunde
Copy link
Collaborator

@wallashss I think the --depth 1 on the git clone there is causing it to not have that commit in history, and the subsequent git fetch is only fetching the tags

Signed-off-by: Wallas Santos <[email protected]>
@wallashss
Copy link
Collaborator Author

@wallashss I think the --depth 1 on the git clone there is causing it to not have that commit in history, and the subsequent git fetch is only fetching the tags

Yeah, I think it's good now.

@joerunde
Copy link
Collaborator

nice, it looks like the test actually caught a problem!

Signed-off-by: Wallas Santos <[email protected]>
@wallashss
Copy link
Collaborator Author

Hey @joerunde,

I think I need an opinion for the error that is failing.

Just to recap, the log from the test:

INFO 03-31 18:39:09 [spyre_model_runner.py:385] Padding request of length 1 tokens to 64 tokens.
ERROR 03-31 18:39:09 [async_llm.py:337] EngineCore output handler hit an error: 
ERROR 03-31 18:39:09 [async_llm.py:337] Traceback (most recent call last):
ERROR 03-31 18:39:09 [async_llm.py:337]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 315, in _run_output_handler
ERROR 03-31 18:39:09 [async_llm.py:337]     processed_outputs = self.output_processor.process_outputs(
ERROR 03-31 18:39:09 [async_llm.py:337]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-31 18:39:09 [async_llm.py:337]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 318, in process_outputs
ERROR 03-31 18:39:09 [async_llm.py:337]     self._update_stats_from_output(req_state, engine_core_output,
ERROR 03-31 18:39:09 [async_llm.py:337]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 382, in _update_stats_from_output
ERROR 03-31 18:39:09 [async_llm.py:337]     iteration_stats.update_from_output(engine_core_output,
ERROR 03-31 18:39:09 [async_llm.py:337]   File "/usr/local/lib/python3.12/site-packages/vllm/v1/metrics/stats.py", line 104, in update_from_output
ERROR 03-31 18:39:09 [async_llm.py:337]     assert num_new_generation_tokens > 0
ERROR 03-31 18:39:09 [async_llm.py:337]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-31 18:39:09 [async_llm.py:337] AssertionError

And it is caused by a test with an interesting comment:

        # Short prompt under context length but requesting too many tokens for
        # the warmup shape should return an empty result
        completion = client.completions.create(model=model,
                                               prompt="Hello World!",
                                               max_tokens=25)

From this in vLLM code line, I am pretty sure that this is the only place where EngineCoreOutput is instanced. And it is only created when there is a new_token_ids.

            if new_token_ids:
                # Add EngineCoreOutput for this Request.
                outputs.append(
                    EngineCoreOutput(
                        request_id=req_id,
                        new_token_ids=new_token_ids,
                        finish_reason=request.get_finished_reason(),
                        new_logprobs=new_logprobs,
                        new_prompt_logprobs_tensors=prompt_logprobs_tensors,
                        stop_reason=request.stop_reason,
                        events=request.take_events()))

However, in the specialization of the Scheduler in vllm-spyre plugin we have this snippet

 for request in rejected_requests:
            queue.remove(request)
            reject_outputs.append(
                EngineCoreOutput(request.request_id,
                                 new_token_ids=[],
                                 finish_reason=FinishReason.ABORT,
                                 stop_reason="Request did not fit any warmup "
                                 "shape"))
            request.status = RequestStatus.FINISHED_ABORTED
            self._free_request(request)
            self.rejected_requests.remove(request.request_id)

What makes me think that we need to review this implementation, because it probably does not match with the engine design, and we may have issues like this in the future. What do you think?

@wallashss
Copy link
Collaborator Author

cc/ @tjohnson31415 I think we might have some insight from you as well. 😄

@wallashss
Copy link
Collaborator Author

I opened this issue #68 to follow up this PR. I just added a workaround to pass the tests and warn users the limitation. Everything else should still works fine, but requests does not fit in warmup shapes will make vllm crash. I did not found a way from the plugin to disable the stats globally. I hope later we can get more updates from vllm upstream and make it proper working to remove these workarounds.

@wallashss wallashss mentioned this pull request Apr 1, 2025
@tjohnson31415
Copy link
Collaborator

@wallashss I had another thought we could try: What happens if we just return a dummy token in the output to satisy that assertion?

@wallashss
Copy link
Collaborator Author

@wallashss I had another thought we could try: What happens if we just return a dummy token in the output to satisy that assertion?

Already tested it... Unfortunately this assert fails... So I think it is a matter of what decision we should take.

@tjohnson31415
Copy link
Collaborator

Ah, well if it is just that test assertion, I think we should update the test; it could assert on the finish_reason and stop_reason instead. Eventually we will need a better way to reject these requests without the empty response anyways.

@wallashss
Copy link
Collaborator Author

Ah, well if it is just that test assertion, I think we should update the test; it could assert on the finish_reason and stop_reason instead.

Ok... but this also means change the behavior of v0 as well, are you sure about that? My first look was not so good about that 😬 .

Revert "disable stats for test and warn users"

This reverts commit 03cc587.

Signed-off-by: Wallas Santos <[email protected]>
@wallashss wallashss force-pushed the fix-scheduler-regression branch from 40802cf to 84d1633 Compare April 4, 2025 13:31
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
@wallashss wallashss requested a review from joerunde April 4, 2025 19:44
@wallashss wallashss requested a review from tjohnson31415 April 4, 2025 19:44
@joerunde
Copy link
Collaborator

joerunde commented Apr 7, 2025

I had a lil thread about this on the vllm slack here: https://vllm-dev.slack.com/archives/C087RA55P0D/p1742828751236509

We should push on this upstream to figure out if we can align with the same behavior that occurs when the sequence is too long for the model. But for now, this looks like a reasonable workaround to unblock dev

result['token_ids'] = tuple(req_output.outputs[0].token_ids)
# TODO: Workaround for V1, if request does not fit in a warmup shape
# token_ids may be filled with -1.
token_ids = [t for t in req_output.outputs[0].token_ids if t >= 0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me sad D:
(But so does all the code in the scheduler that does this dummy scheduling anyway that I wrote)

Approved with sadness lol

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you with sadness 😅

Copy link
Collaborator

@joerunde joerunde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@joerunde joerunde merged commit 3ca4480 into vllm-project:main Apr 7, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants