Skip to content
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 81 additions & 0 deletions docs/user_guide/cb_testing_in_vllm.md
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if all this can be added as docstring to the top of the respective test files themselves instead of having it sit separately (which is prone to being stale and an extra step for anyone trying to understand the code)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having one doc file encompassing the different CB testing/debugging functions is a request from the compiler team actually

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I wonder if we can add all the text from this PR into the test files and have some sort of table/visual representation encompassing the various configurations tested within the docs?

I worry that when we change something for the CB tests (which happens so frequently these days), we will forget to update the docs for the same which will lead to outdated stuff pretty fast.

Copy link
Collaborator Author

@sducouedic sducouedic Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table/visual is a good idea, and actually also a request :)
But do you have in mind an automation code that update the table/visuals in the docs given the parameters found in the tests functions? also in your mind the table/visual would be in addition to the text in the docs, or replacing the text?

Copy link
Collaborator

@prashantgupta24 prashantgupta24 Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the visuals would be in addition to the docstring but I don't know if this can be automated, seems complex.

Copy link
Collaborator

@rafvasq rafvasq Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to jump in, I also agree that a lot of the info here probably should be just contained in the files too and this CB developer doc could point to where the script and tools are, with high-level descriptions, along with debugging notes similar to the ones here. Including that table you both mentioned.

I also noticed some "TODO" type of notes in this guide too. Those should probably be made issues for tracking rather than notes in the docs if they're important

Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Continuous Batching tests / inference scripts in vLLM

Brief overview of what has been implemented so far in VLLM to test / debug continuous batching

Date: 10th July 2025

## Inference script

* **File path:** `examples/offline_inference/cb_spyre_inference.py`
* **Purpose:** Debugging (ie. using manual execution)

### Description
* Runs inference on a set of prompts with continuous batching enabled (number of prompts is parametrizable)
* Prints the generated text for each sequence.
* All the requested sequences are defined in the beginning, there is no requests joining the waiting queue while the decoding of some other request has already started.
* The exact sequence of prefill and decode steps depends on the parameter values `max_num_seqs`, `num-prompts`, `max-tokens`.
* If `--compare-with-CPU` is set, then the output text is compared to the one of hugging face, running on CPU. Note that here the logprobs are not compared, only tokens.

### Parametrization
* `--model`: the model
* `--max_model_len`: maximum length of a sequence (padded prompt plus decoded tokens) (cannot exceed model size)
* `--max_num_seqs`: Max number of sequences processed in a single iteration (decode batch size)
* `--tp`: Tensor parallelism (number of Spyre cards)
* `--num-prompts`: Total number of requested prompts
* `--max-tokens`: Number of tokens generated for each requested sequence
* `--compare-with-CPU`: If set, compare the text output with CPU version running with Hugging Face instead of vLLM

## CB tests through unit tests

* **File path (tests targeting CB specifically):** `vllm-spyre/tests/e2e/test_spyre_cb.py`
* **Purpose:** Automated execution to verify that a specific behaviour acts as expected (passing/failing)
* **Usage (when running locally):** `python -m pytest -sv -m "spyre and cb" --forked tests`
* `-s` option: show all the print statements in the code
* `-v` option: verbose mode, make the test output more detailed: show name of each test function and whether it passed, failed or was skipped
* `--forked` option: isolates the tests and avoid having one test crashing impacting the other tests
* `-m "spyre and cb"`: runs the tests with configurations marked as "spyre" and "cb" only

### Description

Unit tests are designed for automated and systematic execution to verify that CB behaves as expected for different scenarios. For each scenario (i.e. configuration of parameters), the test either passes or fails. When a test suite fails, identifying which specific test case failed is often more informative than the failure message itself. Below is a brief description of the different unit tests targeting CB. The description can also be found in the docstring of the different test functions:

> All the applicable unit tests in vLLM will eventually also execute with CB enabled in addition to SB, but two test functions specifically target continuous batching correctness: `test_cb_output` and `test_scheduler_cb_steps_tkv`. The other functions found in that files are mostly helper methods, or functions that test CB in aspects more specific to vLLM (such as scheduling constraints). Still it can be interesting to have a look in the code, but their description is skipped here.

#### `test_cb_output`
`test_cb_output` checks the correctness of the output of CB on a set of prompts (4 hardcoded prompts for that test). The output from vllm is compared to this of Hugging Face on CPU.

* **The test passes if:** the logprobs of HF on CPU and vLLM (on Spyre or CPU depending on the backend) are compared, and the test passes only if the pairwise relative differences of the values are all below a threshold: `math.isclose(hf_logprob, vllm_logprob, rel_tol=0.35)`. Otherwise it fails.
> The above applies for sendnn backend, on CPU the tokens need to additionally be exactly the same for the test to pass

* **parametrization:**
* `model`: the model
* `backend`: if the test is running with `eager` backend for CPU or `sendnn` for Spyre
* `max_num_seqs`: Max number of sequences processed in a single iteration (decode batch size)

* **Note on parametrization of prompts and max_tokens**: so far both the prompts and max tokens are hardcoded in that test. There are 4 prompts about chicken soup, and the `max_tokens` is set to 20 for all the prompts. It could be a good idea to add a `max-tokens` and `num-prompts` parametrization to create prompts in the same way as it is done in the inference script. Or we could even create prompts of specified length by parametrizing the number of tokens in the prompt, as it is done in the `test_scheduler_cb_steps_tkv` where artificial prompts are created by setting the parameter `prompts_lengths`. To be discussed.

#### `test_scheduler_cb_steps_tkv` (For this test the final output is not checked)

> **Note 1: since we are now testing more than only the tkv value at each step, I plan to rename that test, because the name is now a bit misleading**

> **Note 2: the final output is not checked because for a lot of parametrized scenarios they are only relevant for testing scheduling implementation, this saves some computation time.**

Checking the final output correctness alone is not enough to ensure that CB is correctly implemented (otherwise how can we differentiate with static batching for example). So `test_scheduler_cb_steps_tkv` is meant to check the correctness of the step-by-step execution of continuous batching. It does so by comparing, at every engine step (i.e. prefill or decode iteration), a bunch of attributes. This is allows a finer testing of the padding and scheduling implementation.

* **Checked attributes at each step:**
* `tkv`: after each step, the tkv is compared against the expected tkv value for that step
* `waiting`, `running`, `request_outputs`, `finished_requests` not really relevant in a compiler point of view, but after each iteration, we check that the list of running and waiting requests and those that have finished are correct. this tests the scheduler correctness.
* (waiting to be merged, PR #261): `n_reserved_blocks` and `n_used_blocks`

* **Parametrization:**
* `model`: the model
* `backend`: if the test is running with `eager` backend for CPU or `sendnn` for Spyre
* `seqs_max_tokens`: Number of tokens generated for each requested sequence prompt
* `prompts_lengths`: Number of tokens for each prompt. Prompts are artificially generated given that parameter
* `steps_add_reqs`: Steps where the requests prompts are joining, this helps simulating a online setup where server receives requests from users at different times
* `checked_steps`: a list of reference values containing the reference values for the attributes described above at each step

* **Parametrization functions:** Because there are a lot of different scenarios and edge cases that we want to test through `test_scheduler_cb_steps_tkv`, the set of parameters described in the previous point are provided through different functions. Each function provide a different set of parameters and expected values at each step for the case that it is testing. This improves readability because the function names gives a brief idea of the tested scenario. For example:
* `get_params_test_blocks_borders_aligned_prompts`: parametrization for the situation where the prompts are by chance already aligned with the blocks boundaries (no **right** padding required)
* `get_params_test_blocks_borders_misaligned_prompts`: parametrization for the situation where the prompts are misaligned with the block boundaries, and thus **right** padding is required
* ... additional special cases