-
Notifications
You must be signed in to change notification settings - Fork 26
Documents a bit CB script and tests #300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 6 commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
b7ed084
first draft
sducouedic f02b62d
first draft
sducouedic 065b059
apply comments - 1
sducouedic 147c5b9
apply comment - 2
sducouedic 2d1de90
apply comment - 3
sducouedic 120cd73
address lint code issues
sducouedic b2fb7f7
Move to Developer Guide section
sducouedic f8c53d6
adhere to current 'notes' formatting
sducouedic 08d59ee
fix merge conflicts
sducouedic 3c48c7f
put date in admonition
sducouedic 01ad048
Merge remote-tracking branch 'origin/main' into doc_cb_script_and_tests
sducouedic 5f2581f
add summary table
sducouedic 58e0f52
add unit test parametrization caution message
sducouedic aa5baec
remove html table coloring
sducouedic 62956bc
add short descriptions of all the tests running with CB
sducouedic d199be2
different command for different granularity of testing
sducouedic 73cf362
update short descriptions
sducouedic 5bd96eb
fix indentation
sducouedic 97ad5c0
fix conflicts
sducouedic 7c6efb3
refactor given latest changes
sducouedic 6c48227
remove table and short desc
sducouedic b10b8c9
bash format
sducouedic File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| # Continuous Batching tests / inference scripts in vLLM | ||
|
|
||
| Brief overview of what has been implemented so far in VLLM to test / debug continuous batching | ||
|
|
||
| Date: 10th July 2025 | ||
|
|
||
| ## Inference script | ||
|
|
||
| * **File path:** `examples/offline_inference/cb_spyre_inference.py` | ||
| * **Purpose:** Debugging (ie. using manual execution) | ||
|
|
||
| ### Description | ||
| * Runs inference on a set of prompts with continuous batching enabled (number of prompts is parametrizable) | ||
| * Prints the generated text for each sequence. | ||
| * All the requested sequences are defined in the beginning, there is no requests joining the waiting queue while the decoding of some other request has already started. | ||
| * The exact sequence of prefill and decode steps depends on the parameter values `max_num_seqs`, `num-prompts`, `max-tokens`. | ||
| * If `--compare-with-CPU` is set, then the output text is compared to the one of hugging face, running on CPU. Note that here the logprobs are not compared, only tokens. | ||
|
|
||
| ### Parametrization | ||
| * `--model`: the model | ||
| * `--max_model_len`: maximum length of a sequence (padded prompt plus decoded tokens) (cannot exceed model size) | ||
| * `--max_num_seqs`: Max number of sequences processed in a single iteration (decode batch size) | ||
| * `--tp`: Tensor parallelism (number of Spyre cards) | ||
| * `--num-prompts`: Total number of requested prompts | ||
| * `--max-tokens`: Number of tokens generated for each requested sequence | ||
| * `--compare-with-CPU`: If set, compare the text output with CPU version running with Hugging Face instead of vLLM | ||
|
|
||
| ## CB tests through unit tests | ||
|
|
||
| * **File path (tests targeting CB specifically):** `vllm-spyre/tests/e2e/test_spyre_cb.py` | ||
| * **Purpose:** Automated execution to verify that a specific behaviour acts as expected (passing/failing) | ||
| * **Usage (when running locally):** `python -m pytest -sv -m "spyre and cb" --forked tests` | ||
| * `-s` option: show all the print statements in the code | ||
| * `-v` option: verbose mode, make the test output more detailed: show name of each test function and whether it passed, failed or was skipped | ||
| * `--forked` option: isolates the tests and avoid having one test crashing impacting the other tests | ||
| * `-m "spyre and cb"`: runs the tests with configurations marked as "spyre" and "cb" only | ||
|
|
||
| ### Description | ||
|
|
||
| Unit tests are designed for automated and systematic execution to verify that CB behaves as expected for different scenarios. For each scenario (i.e. configuration of parameters), the test either passes or fails. When a test suite fails, identifying which specific test case failed is often more informative than the failure message itself. Below is a brief description of the different unit tests targeting CB. The description can also be found in the docstring of the different test functions: | ||
|
|
||
| > All the applicable unit tests in vLLM will eventually also execute with CB enabled in addition to SB, but two test functions specifically target continuous batching correctness: `test_cb_output` and `test_scheduler_cb_steps_tkv`. The other functions found in that files are mostly helper methods, or functions that test CB in aspects more specific to vLLM (such as scheduling constraints). Still it can be interesting to have a look in the code, but their description is skipped here. | ||
|
|
||
| #### `test_cb_output` | ||
| `test_cb_output` checks the correctness of the output of CB on a set of prompts (4 hardcoded prompts for that test). The output from vllm is compared to this of Hugging Face on CPU. | ||
|
|
||
| * **The test passes if:** the logprobs of HF on CPU and vLLM (on Spyre or CPU depending on the backend) are compared, and the test passes only if the pairwise relative differences of the values are all below a threshold: `math.isclose(hf_logprob, vllm_logprob, rel_tol=0.35)`. Otherwise it fails. | ||
| > The above applies for sendnn backend, on CPU the tokens need to additionally be exactly the same for the test to pass | ||
|
|
||
| * **parametrization:** | ||
| * `model`: the model | ||
| * `backend`: if the test is running with `eager` backend for CPU or `sendnn` for Spyre | ||
| * `max_num_seqs`: Max number of sequences processed in a single iteration (decode batch size) | ||
|
|
||
| * **Note on parametrization of prompts and max_tokens**: so far both the prompts and max tokens are hardcoded in that test. There are 4 prompts about chicken soup, and the `max_tokens` is set to 20 for all the prompts. It could be a good idea to add a `max-tokens` and `num-prompts` parametrization to create prompts in the same way as it is done in the inference script. Or we could even create prompts of specified length by parametrizing the number of tokens in the prompt, as it is done in the `test_scheduler_cb_steps_tkv` where artificial prompts are created by setting the parameter `prompts_lengths`. To be discussed. | ||
|
|
||
| #### `test_scheduler_cb_steps_tkv` (For this test the final output is not checked) | ||
|
|
||
| > **Note 1: since we are now testing more than only the tkv value at each step, I plan to rename that test, because the name is now a bit misleading** | ||
|
|
||
| > **Note 2: the final output is not checked because for a lot of parametrized scenarios they are only relevant for testing scheduling implementation, this saves some computation time.** | ||
|
|
||
| Checking the final output correctness alone is not enough to ensure that CB is correctly implemented (otherwise how can we differentiate with static batching for example). So `test_scheduler_cb_steps_tkv` is meant to check the correctness of the step-by-step execution of continuous batching. It does so by comparing, at every engine step (i.e. prefill or decode iteration), a bunch of attributes. This is allows a finer testing of the padding and scheduling implementation. | ||
|
|
||
| * **Checked attributes at each step:** | ||
| * `tkv`: after each step, the tkv is compared against the expected tkv value for that step | ||
| * `waiting`, `running`, `request_outputs`, `finished_requests` not really relevant in a compiler point of view, but after each iteration, we check that the list of running and waiting requests and those that have finished are correct. this tests the scheduler correctness. | ||
| * (waiting to be merged, PR #261): `n_reserved_blocks` and `n_used_blocks` | ||
|
|
||
| * **Parametrization:** | ||
| * `model`: the model | ||
| * `backend`: if the test is running with `eager` backend for CPU or `sendnn` for Spyre | ||
| * `seqs_max_tokens`: Number of tokens generated for each requested sequence prompt | ||
| * `prompts_lengths`: Number of tokens for each prompt. Prompts are artificially generated given that parameter | ||
| * `steps_add_reqs`: Steps where the requests prompts are joining, this helps simulating a online setup where server receives requests from users at different times | ||
| * `checked_steps`: a list of reference values containing the reference values for the attributes described above at each step | ||
|
|
||
| * **Parametrization functions:** Because there are a lot of different scenarios and edge cases that we want to test through `test_scheduler_cb_steps_tkv`, the set of parameters described in the previous point are provided through different functions. Each function provide a different set of parameters and expected values at each step for the case that it is testing. This improves readability because the function names gives a brief idea of the tested scenario. For example: | ||
| * `get_params_test_blocks_borders_aligned_prompts`: parametrization for the situation where the prompts are by chance already aligned with the blocks boundaries (no **right** padding required) | ||
| * `get_params_test_blocks_borders_misaligned_prompts`: parametrization for the situation where the prompts are misaligned with the block boundaries, and thus **right** padding is required | ||
| * ... additional special cases |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if all this can be added as docstring to the top of the respective test files themselves instead of having it sit separately (which is prone to being stale and an extra step for anyone trying to understand the code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having one doc file encompassing the different CB testing/debugging functions is a request from the compiler team actually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I wonder if we can add all the text from this PR into the test files and have some sort of table/visual representation encompassing the various configurations tested within the docs?
I worry that when we change something for the CB tests (which happens so frequently these days), we will forget to update the docs for the same which will lead to outdated stuff pretty fast.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table/visual is a good idea, and actually also a request :)
But do you have in mind an automation code that update the table/visuals in the docs given the parameters found in the tests functions? also in your mind the table/visual would be in addition to the text in the docs, or replacing the text?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally the visuals would be in addition to the docstring but I don't know if this can be automated, seems complex.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to jump in, I also agree that a lot of the info here probably should be just contained in the files too and this CB developer doc could point to where the script and tools are, with high-level descriptions, along with debugging notes similar to the ones here. Including that table you both mentioned.
I also noticed some "TODO" type of notes in this guide too. Those should probably be made issues for tracking rather than notes in the docs if they're important