Skip to content

Conversation

joerunde
Copy link
Collaborator

@joerunde joerunde commented Jun 3, 2025

This fixes an issue where batches of requests with long prompts would not properly be scheduled. Even with a long queue of requests, smaller-than-full batches would be scheduled because requests would be rejected from the schedule once the total number of prompt tokens was >= --max-num-batched-tokens.

The --max-num-batched-tokens config is designed for chunking up prompts for chunked prefill, and isn't relevant for static batching. This PR sets --max-num-batched-tokens to the maximum number of prompt tokens that could be in a full batch, so that the chunked prefill logic doesn't prevent us from creating large batches.

I missed this before because it requires a lower level test that invokes the scheduler directly in order to ensure that the full batches are actually scheduled.

Copy link

github-actions bot commented Jun 3, 2025

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

@Daniel-Schenker
Copy link
Collaborator

Looks good, I'll get this built and tested on Power. Thanks Joe.

Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice simple fix and well-written test. LGTM!

Copy link
Collaborator

@yannicks1 yannicks1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice catch and fix, Joe. Left a minor comment:)


vllm_sampling_params = SamplingParams(max_tokens=20,
temperature=0,
stop="1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't need stop="1" here, looks like a copy paste relict:)

@joerunde joerunde merged commit 7fca47e into main Jun 4, 2025
21 checks passed
@joerunde joerunde deleted the block-problems branch June 4, 2025 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants