-
Notifications
You must be signed in to change notification settings - Fork 77
bucket: add query len 1 to prefill bucket #645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
bucket: add query len 1 to prefill bucket #645
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR modifies the bucketing configuration to support query length 1 for prefill operations, preventing unnecessary padding to block size in prefix-decode (PD) scenarios.
- Changes the minimum query bucket size from block_size to 1
- Updates dummy prefill batch generation to use query_len=1 and context_len=127
- Adds support for bucket value 1 in the exponential bucketing warmup logic
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| vllm_gaudi/extension/bucketing/linear.py | Sets minimum prompt query bucket to 1 instead of block_size |
| vllm_gaudi/extension/bucketing/exponential.py | Updates exponential bucketing to start from 1 and adds logic to handle bucket value 1 |
| vllm_gaudi/v1/worker/hpu_model_runner.py | Adjusts dummy prefill batch to use query_len=1 and context_len=127 |
| tests/unit_tests/test_bucketing.py | Adds test case for warmup_range starting with 1 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
500c177 to
37b3e7d
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
32c4ddc to
01a12b7
Compare
Signed-off-by: Xinyu Chen <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Signed-off-by: Xinyu Chen <[email protected]>
Co-authored-by: Wuxun Zhang <[email protected]> Signed-off-by: Xinyu Chen <[email protected]>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Avoid the query length(1) of the prefix prefill on the decode side to be padded to the block size under PD+DP scenario.
Use case:
VLLM_EXPONENTIAL_BUCKETING=false/true VLLM_PROMPT_QUERY_BUCKET_MIN=1 on the decode side.