-
Notifications
You must be signed in to change notification settings - Fork 26
[V1] Fix assert failure when finishing a batch #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Fix assert failure when finishing a batch #62
Conversation
Signed-off-by: Wallas Santos <[email protected]>
👋 Hi! Thank you for contributing to vLLM support on Spyre.
Now you are good to go 🚀 |
Sweeet, thanks @wallashss! |
Probably, yes? I think we just have to make sure communicate the team that we are moving forward.
That means update |
Yeah, that way the CI running here has the latest v1 engine and can actually test out these changes |
Signed-off-by: Wallas Santos <[email protected]>
Thanks, I just wanted to make sure I am seeing the right spot |
@wallashss I think the |
Signed-off-by: Wallas Santos <[email protected]>
Yeah, I think it's good now. |
nice, it looks like the test actually caught a problem! |
Signed-off-by: Wallas Santos <[email protected]>
Hey @joerunde, I think I need an opinion for the error that is failing. Just to recap, the log from the test:
And it is caused by a test with an interesting comment:
From this in vLLM code line, I am pretty sure that this is the only place where EngineCoreOutput is instanced. And it is only created when there is a if new_token_ids:
# Add EngineCoreOutput for this Request.
outputs.append(
EngineCoreOutput(
request_id=req_id,
new_token_ids=new_token_ids,
finish_reason=request.get_finished_reason(),
new_logprobs=new_logprobs,
new_prompt_logprobs_tensors=prompt_logprobs_tensors,
stop_reason=request.stop_reason,
events=request.take_events())) However, in the specialization of the Scheduler in vllm-spyre plugin we have this snippet for request in rejected_requests:
queue.remove(request)
reject_outputs.append(
EngineCoreOutput(request.request_id,
new_token_ids=[],
finish_reason=FinishReason.ABORT,
stop_reason="Request did not fit any warmup "
"shape"))
request.status = RequestStatus.FINISHED_ABORTED
self._free_request(request)
self.rejected_requests.remove(request.request_id) What makes me think that we need to review this implementation, because it probably does not match with the engine design, and we may have issues like this in the future. What do you think? |
cc/ @tjohnson31415 I think we might have some insight from you as well. 😄 |
Signed-off-by: Wallas Santos <[email protected]>
I opened this issue #68 to follow up this PR. I just added a workaround to pass the tests and warn users the limitation. Everything else should still works fine, but requests does not fit in warmup shapes will make vllm crash. I did not found a way from the plugin to disable the stats globally. I hope later we can get more updates from vllm upstream and make it proper working to remove these workarounds. |
@wallashss I had another thought we could try: What happens if we just return a dummy token in the output to satisy that assertion? |
Already tested it... Unfortunately this assert fails... So I think it is a matter of what decision we should take. |
Ah, well if it is just that test assertion, I think we should update the test; it could assert on the |
Ok... but this also means change the behavior of v0 as well, are you sure about that? My first look was not so good about that 😬 . |
Revert "disable stats for test and warn users" This reverts commit 03cc587. Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
40802cf
to
84d1633
Compare
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
Signed-off-by: Wallas Santos <[email protected]>
I had a lil thread about this on the vllm slack here: https://vllm-dev.slack.com/archives/C087RA55P0D/p1742828751236509 We should push on this upstream to figure out if we can align with the same behavior that occurs when the sequence is too long for the model. But for now, this looks like a reasonable workaround to unblock dev |
result['token_ids'] = tuple(req_output.outputs[0].token_ids) | ||
# TODO: Workaround for V1, if request does not fit in a warmup shape | ||
# token_ids may be filled with -1. | ||
token_ids = [t for t in req_output.outputs[0].token_ids if t >= 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me sad D:
(But so does all the code in the scheduler that does this dummy scheduling anyway that I wrote)
Approved with sadness lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you with sadness 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
This PR address #55
What I observed using vLLM v0.8.0 vs the main of upstream: on upstram main there was an extra call for spyre_model_runner receiving
scheduler_output
with onlyfinished_request_ids
filled with the last request to clean up. However, for v0.8.0 this call did not happen, and looks like it was wrong because the worker did not have a chance to cleanup the data. If I understood correctly, This behavior was solved in this PR vllm-project/vllm#14388, which was also the basis to write this PR to solve the issue.