Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion docs/user_guide/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,15 @@ export VLLM_SPYRE_WARMUP_NEW_TOKENS=1024,256
### Continuous Batching

!!! attention
Continuous batching is not currently supported on IBM Spyre Accelerators. A CPU-only implementation is available by setting `VLLM_SPYRE_DYNAMO_BACKEND=eager`. Continuous batching can be enabled with `VLLM_SPYRE_USE_CB=1`.
Continuous batching can be enabled with `VLLM_SPYRE_USE_CB=1`.

Continuous batching works much more like other accelerator implementations on vLLM. Requests can be continually appended to a running batch, and requests that finish generating can be evicted from the batch to make room for more requests. Neither chunked prefill nor prefix caching are currently supported though, so when a request is added to the running batch it must first be paused for a full prefill of the incoming prompt.

Unlike static batching, no warmup shapes need to be provided for continuous batching. While the user does not have to specify the prompt lengths (see `VLLM_SPYRE_WARMUP_PROMPT_LENS` for static batching), the vLLM arguments `max_num_seqs` and `max_tokens` can be used to control the maximum batch size and the upper limit on the number of generated tokens (analogous to `VLLM_SPYRE_WARMUP_BATCH_SIZES` and `VLLM_SPYRE_WARMUP_NEW_TOKENS` for static batching).

!!! attention
Currently the maximal context length for which continuous batching is supported on IBM Spyre Accelerators is 32K (32,768). Therefore the length of the submitted prompts plus the number of requested output tokens should be less than 32K. We strongly recommend not setting the `max_tokens` too high, such that prompt lengths plus output tokens are well below 32K. Otherwise there is a risk of performance degradation due to scheduling constraints.

## Caching Compiled Graphs

`torch_sendnn` supports caching compiled model graphs, which can vastly speed up warmup time when loading models in a distributed setting.
Expand Down