Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docker/Dockerfile.amd64
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# This is a reference dockerfile for vLLM Spyre support on an x86 host
ARG BASE_IMAGE_URL="quay.io/ibm-aiu/base"
ARG BASE_IMAGE_TAG="2025_05_15.amd64"
ARG BASE_IMAGE_TAG="2025_05_29.amd64"

##############################################
# Base
Expand Down
4 changes: 2 additions & 2 deletions docs/user_guide/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ When running decoder models, vLLM Spyre supports a static batching mode and a co

With static batching, graphs are pre-compiled for the configured batch shapes and each batch must finish processing before a new batch can be scheduled. This adds extra constraints on the sizes of inputs and outputs for each request, and requests that do not fit the precompiled graphs will be rejected.

Static batching mode is enabled by default, and can be explicitly enabled by setting `VLLM_USE_CB=0`.
Static batching mode is enabled by default, and can be explicitly enabled by setting `VLLM_SPYRE_USE_CB=0`.

!!! caution
There are no up-front checks that the compiled graphs will fit into the available memory on the Spyre cards. If the graphs are too large for the available memory, vllm will crash during model warmup.
Expand All @@ -40,7 +40,7 @@ export VLLM_SPYRE_WARMUP_NEW_TOKENS=1024,256
### Continuous Batching

!!! attention
Continuous batching is not currently supported on IBM Spyre Accelerators. A CPU-only implementation is available by setting `VLLM_SPYRE_DYNAMO_BACKEND=eager`. Continuous batching can be enabled with `VLLM_USE_CB=1`.
Continuous batching is not currently supported on IBM Spyre Accelerators. A CPU-only implementation is available by setting `VLLM_SPYRE_DYNAMO_BACKEND=eager`. Continuous batching can be enabled with `VLLM_SPYRE_USE_CB=1`.

Continuous batching works much more like other accelerator implementations on vLLM. Requests can be continually appended to a running batch, and requests that finish generating can be evicted from the batch to make room for more requests. Neither chunked prefill nor prefix caching are currently supported though, so when a request is added to the running batch it must first be paused for a full prefill of the incoming prompt.

Expand Down