From 860ffad269f4f2e7a54e810adac2a51fda511b8e Mon Sep 17 00:00:00 2001 From: Yannick Schnider Date: Fri, 26 Sep 2025 20:00:06 +0200 Subject: [PATCH 1/4] update docs continuous batching Signed-off-by: Yannick Schnider --- docs/user_guide/configuration.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/user_guide/configuration.md b/docs/user_guide/configuration.md index fb4a5d805..82672ae30 100644 --- a/docs/user_guide/configuration.md +++ b/docs/user_guide/configuration.md @@ -40,10 +40,15 @@ export VLLM_SPYRE_WARMUP_NEW_TOKENS=1024,256 ### Continuous Batching !!! attention - Continuous batching is not currently supported on IBM Spyre Accelerators. A CPU-only implementation is available by setting `VLLM_SPYRE_DYNAMO_BACKEND=eager`. Continuous batching can be enabled with `VLLM_SPYRE_USE_CB=1`. + Continuous batching can be enabled with `VLLM_SPYRE_USE_CB=1`. Continuous batching works much more like other accelerator implementations on vLLM. Requests can be continually appended to a running batch, and requests that finish generating can be evicted from the batch to make room for more requests. Neither chunked prefill nor prefix caching are currently supported though, so when a request is added to the running batch it must first be paused for a full prefill of the incoming prompt. +Unlike static batching, no warmup shapes need to be provided for continuous batching. While the user does not have to specify the prompt lengths (see `VLLM_SPYRE_WARMUP_PROMPT_LENS` for static batching), the vLLM arguments `max_num_seqs` and `max_tokens` can be used to control the maximum batch size and the upper limit on the number of generated tokens (analogous to `VLLM_SPYRE_WARMUP_BATCH_SIZES` and `VLLM_SPYRE_WARMUP_NEW_TOKENS` for static batching). + +!!! attention + Currently the maximal context length for which continuos batching is supported on IBM Spyre Accelerators is 32K (32'768). Therefore the length of the submitted prompts plus the number of requested output tokens should be less than 32K. We strongly recommend not setting the `max_tokens` too high, such that prompt lengths plus output tokens are well below 32K. Otherwise there is a risk of performance degradation due to scheuduling constraints. + ## Caching Compiled Graphs `torch_sendnn` supports caching compiled model graphs, which can vastly speed up warmup time when loading models in a distributed setting. From 26500be16637c629ce4456be68ddc5bbb4344e46 Mon Sep 17 00:00:00 2001 From: Yannick Schnider Date: Fri, 26 Sep 2025 20:09:09 +0200 Subject: [PATCH 2/4] fix spelling Signed-off-by: Yannick Schnider --- docs/user_guide/configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/configuration.md b/docs/user_guide/configuration.md index 82672ae30..a83d7eddc 100644 --- a/docs/user_guide/configuration.md +++ b/docs/user_guide/configuration.md @@ -47,7 +47,7 @@ Continuous batching works much more like other accelerator implementations on vL Unlike static batching, no warmup shapes need to be provided for continuous batching. While the user does not have to specify the prompt lengths (see `VLLM_SPYRE_WARMUP_PROMPT_LENS` for static batching), the vLLM arguments `max_num_seqs` and `max_tokens` can be used to control the maximum batch size and the upper limit on the number of generated tokens (analogous to `VLLM_SPYRE_WARMUP_BATCH_SIZES` and `VLLM_SPYRE_WARMUP_NEW_TOKENS` for static batching). !!! attention - Currently the maximal context length for which continuos batching is supported on IBM Spyre Accelerators is 32K (32'768). Therefore the length of the submitted prompts plus the number of requested output tokens should be less than 32K. We strongly recommend not setting the `max_tokens` too high, such that prompt lengths plus output tokens are well below 32K. Otherwise there is a risk of performance degradation due to scheuduling constraints. + Currently the maximal context length for which continuous batching is supported on IBM Spyre Accelerators is 32K (32'768). Therefore the length of the submitted prompts plus the number of requested output tokens should be less than 32K. We strongly recommend not setting the `max_tokens` too high, such that prompt lengths plus output tokens are well below 32K. Otherwise there is a risk of performance degradation due to scheuduling constraints. ## Caching Compiled Graphs From 25696cc76543e5c93ff544efd3cdf86ead78e391 Mon Sep 17 00:00:00 2001 From: Yannick Schnider Date: Mon, 29 Sep 2025 16:25:55 +0200 Subject: [PATCH 3/4] Update docs/user_guide/configuration.md fix typo Co-authored-by: Rafael Vasquez Signed-off-by: Yannick Schnider --- docs/user_guide/configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/configuration.md b/docs/user_guide/configuration.md index a83d7eddc..75baf485d 100644 --- a/docs/user_guide/configuration.md +++ b/docs/user_guide/configuration.md @@ -47,7 +47,7 @@ Continuous batching works much more like other accelerator implementations on vL Unlike static batching, no warmup shapes need to be provided for continuous batching. While the user does not have to specify the prompt lengths (see `VLLM_SPYRE_WARMUP_PROMPT_LENS` for static batching), the vLLM arguments `max_num_seqs` and `max_tokens` can be used to control the maximum batch size and the upper limit on the number of generated tokens (analogous to `VLLM_SPYRE_WARMUP_BATCH_SIZES` and `VLLM_SPYRE_WARMUP_NEW_TOKENS` for static batching). !!! attention - Currently the maximal context length for which continuous batching is supported on IBM Spyre Accelerators is 32K (32'768). Therefore the length of the submitted prompts plus the number of requested output tokens should be less than 32K. We strongly recommend not setting the `max_tokens` too high, such that prompt lengths plus output tokens are well below 32K. Otherwise there is a risk of performance degradation due to scheuduling constraints. + Currently the maximal context length for which continuous batching is supported on IBM Spyre Accelerators is 32K (32'768). Therefore the length of the submitted prompts plus the number of requested output tokens should be less than 32K. We strongly recommend not setting the `max_tokens` too high, such that prompt lengths plus output tokens are well below 32K. Otherwise there is a risk of performance degradation due to scheduling constraints. ## Caching Compiled Graphs From a9fed162ba28f48dce966afe359544215704e6a4 Mon Sep 17 00:00:00 2001 From: Yannick Schnider Date: Mon, 29 Sep 2025 22:19:18 +0200 Subject: [PATCH 4/4] Update docs/user_guide/configuration.md use existing convention Co-authored-by: Prashant Gupta Signed-off-by: Yannick Schnider --- docs/user_guide/configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/configuration.md b/docs/user_guide/configuration.md index 75baf485d..843aafa67 100644 --- a/docs/user_guide/configuration.md +++ b/docs/user_guide/configuration.md @@ -47,7 +47,7 @@ Continuous batching works much more like other accelerator implementations on vL Unlike static batching, no warmup shapes need to be provided for continuous batching. While the user does not have to specify the prompt lengths (see `VLLM_SPYRE_WARMUP_PROMPT_LENS` for static batching), the vLLM arguments `max_num_seqs` and `max_tokens` can be used to control the maximum batch size and the upper limit on the number of generated tokens (analogous to `VLLM_SPYRE_WARMUP_BATCH_SIZES` and `VLLM_SPYRE_WARMUP_NEW_TOKENS` for static batching). !!! attention - Currently the maximal context length for which continuous batching is supported on IBM Spyre Accelerators is 32K (32'768). Therefore the length of the submitted prompts plus the number of requested output tokens should be less than 32K. We strongly recommend not setting the `max_tokens` too high, such that prompt lengths plus output tokens are well below 32K. Otherwise there is a risk of performance degradation due to scheduling constraints. + Currently the maximal context length for which continuous batching is supported on IBM Spyre Accelerators is 32K (32,768). Therefore the length of the submitted prompts plus the number of requested output tokens should be less than 32K. We strongly recommend not setting the `max_tokens` too high, such that prompt lengths plus output tokens are well below 32K. Otherwise there is a risk of performance degradation due to scheduling constraints. ## Caching Compiled Graphs