vllm-project · joerunde · Jun 3, 2025 · Jun 3, 2025 · Jun 3, 2025 · Jun 3, 2025
@@ -1,6 +1,6 @@
 # This is a reference dockerfile for vLLM Spyre support on an x86 host
 ARG BASE_IMAGE_URL="quay.io/ibm-aiu/base"
-ARG BASE_IMAGE_TAG="2025_05_15.amd64"
+ARG BASE_IMAGE_TAG="2025_05_29.amd64"
 
 ##############################################
 # Base

@@ -24,7 +24,7 @@ When running decoder models, vLLM Spyre supports a static batching mode and a co
 
 With static batching, graphs are pre-compiled for the configured batch shapes and each batch must finish processing before a new batch can be scheduled. This adds extra constraints on the sizes of inputs and outputs for each request, and requests that do not fit the precompiled graphs will be rejected.
 
-Static batching mode is enabled by default, and can be explicitly enabled by setting `VLLM_USE_CB=0`.
+Static batching mode is enabled by default, and can be explicitly enabled by setting `VLLM_SPYRE_USE_CB=0`.
 
 !!! caution
     There are no up-front checks that the compiled graphs will fit into the available memory on the Spyre cards. If the graphs are too large for the available memory, vllm will crash during model warmup.
@@ -40,7 +40,7 @@ export VLLM_SPYRE_WARMUP_NEW_TOKENS=1024,256
 ### Continuous Batching
 
 !!! attention
-    Continuous batching is not currently supported on IBM Spyre Accelerators. A CPU-only implementation is available by setting `VLLM_SPYRE_DYNAMO_BACKEND=eager`. Continuous batching can be enabled with `VLLM_USE_CB=1`.
+    Continuous batching is not currently supported on IBM Spyre Accelerators. A CPU-only implementation is available by setting `VLLM_SPYRE_DYNAMO_BACKEND=eager`. Continuous batching can be enabled with `VLLM_SPYRE_USE_CB=1`.
 
 Continuous batching works much more like other accelerator implementations on vLLM. Requests can be continually appended to a running batch, and requests that finish generating can be evicted from the batch to make room for more requests. Neither chunked prefill nor prefix caching are currently supported though, so when a request is added to the running batch it must first be paused for a full prefill of the incoming prompt.