Skip to content

[V1] Regression on scheduler #55

@wallashss

Description

@wallashss

We found a regression during the development of #54, an assert spyre_model_runner.py fails probably due to behavior change of vllm engine's scheduler.

Note: #54 is required to repro the issue due to broken imports caused by recent refactoring of vLLM

Repro on:

commit 27df5199d99627e1eb101071c2155f888181bd64 (HEAD -> main, origin/main, origin/HEAD)

This simple offline script can consistently repro the issue.

from vllm import LLM, SamplingParams

# Define prompts and their corresponding sampling parameters
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is"
]
sampling_params_list = [
    SamplingParams(seed=123, temperature=0.8, top_p=0.95),
    SamplingParams(seed=123, temperature=0.5, top_p=0.9),
    SamplingParams(seed=123, temperature=0.7, top_p=0.85)
]

model = "/models/llama-194m/"
llm = LLM(model=model, enforce_eager=False)

# Generate texts for each prompt with its sampling parameters
outputs = llm.generate(prompts, sampling_params_list)

# Print the outputs
for response in outputs:
    print(f"Prompt: (response.prompt!r), Generated text: {response.outputs[0].text!r}")

Outputs:

Prompt: (response.prompt!r), Generated text: " 5c1. I'm a teacher, and you teach me how to"
Prompt: (response.prompt!r), Generated text: ' Paris. It is located in the center of France and is the largest city in'
Prompt: (response.prompt!r), Generated text: ' in the hands of machine learning, which is the process of a computer learning to'
ERROR 03-26 20:20:17 [core.py:344] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-26 20:20:17 [core.py:344]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 337, in run_engine_core
ERROR 03-26 20:20:17 [core.py:344]     engine_core.run_busy_loop()
ERROR 03-26 20:20:17 [core.py:344]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 371, in run_busy_loop
ERROR 03-26 20:20:17 [core.py:344]     outputs = step_fn()
ERROR 03-26 20:20:17 [core.py:344]               ^^^^^^^^^
ERROR 03-26 20:20:17 [core.py:344]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/v1/engine/core.py", line 196, in step
ERROR 03-26 20:20:17 [core.py:344]     output = self.model_executor.execute_model(scheduler_output)
ERROR 03-26 20:20:17 [core.py:344]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 20:20:17 [core.py:344]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/v1/executor/abstract.py", line 77, in execute_model
ERROR 03-26 20:20:17 [core.py:344]     output = self.collective_rpc("execute_model",
ERROR 03-26 20:20:17 [core.py:344]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 20:20:17 [core.py:344]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 03-26 20:20:17 [core.py:344]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 03-26 20:20:17 [core.py:344]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 20:20:17 [core.py:344]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/utils.py", line 2257, in run_method
ERROR 03-26 20:20:17 [core.py:344]     return func(*args, **kwargs)
ERROR 03-26 20:20:17 [core.py:344]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 20:20:17 [core.py:344]   File "/usr/local/lib64/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 03-26 20:20:17 [core.py:344]     return func(*args, **kwargs)
ERROR 03-26 20:20:17 [core.py:344]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 20:20:17 [core.py:344]   File "/opt/vllm/lib64/python3.11/site-packages/vllm_spyre/v1/worker/spyre_worker.py", line 370, in execute_model
ERROR 03-26 20:20:17 [core.py:344]     output = self.model_runner.execute_model(scheduler_output)
ERROR 03-26 20:20:17 [core.py:344]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 20:20:17 [core.py:344]   File "/usr/local/lib64/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 03-26 20:20:17 [core.py:344]     return func(*args, **kwargs)
ERROR 03-26 20:20:17 [core.py:344]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 20:20:17 [core.py:344]   File "/opt/vllm/lib64/python3.11/site-packages/vllm_spyre/v1/worker/spyre_model_runner.py", line 304, in execute_model
ERROR 03-26 20:20:17 [core.py:344]     model_input = self.prepare_model_input(scheduler_output)
ERROR 03-26 20:20:17 [core.py:344]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 20:20:17 [core.py:344]   File "/opt/vllm/lib64/python3.11/site-packages/vllm_spyre/v1/worker/spyre_model_runner.py", line 283, in prepare_model_input
ERROR 03-26 20:20:17 [core.py:344]     self._prepare_decode(scheduler_output.scheduled_cached_reqs)
ERROR 03-26 20:20:17 [core.py:344]   File "/opt/vllm/lib64/python3.11/site-packages/vllm_spyre/v1/worker/spyre_model_runner.py", line 203, in _prepare_decode
ERROR 03-26 20:20:17 [core.py:344]     assert len(cached_requests) > 0
ERROR 03-26 20:20:17 [core.py:344]            ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 20:20:17 [core.py:344] AssertionError
ERROR 03-26 20:20:17 [core.py:344] 
CRITICAL 03-26 20:20:17 [core_client.py:269] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions