[Continuous Batching] Support for continuous batching on AIU Spyre #66

yannicks1 · 2025-03-31T15:02:26Z

Support for continuous batching

This is a first working implementation of continuous batching!

changes:

Introducing env variabels: VLLM_SPYRE_USE_CB (True for continuous / False for static batching), VLLM_SPYRE_MAX_BATCH_SIZE (max batch size supported by AIU Spyre), VLLM_SPYRE_MAX_CONTEXT_LENGTH (max context length supported for model on AIU Spyre)
Introducing FmsModelWrapper (continuous batching) and FmsModelPseudoWrapper (static batching) to emulate KV cache handling on AIU Spyre.
Introducing continuous batching specific classes: ContinuousBatchingSpyreScheduler and ContinuousBatchingSpyreModelRunner (and StaticBatchingSpyreModelRunner)

running example code demonstrating continuous batching:
examples/offline_inference_spyre_cb_test.py

* fms wrapper dummy for continuous batching implementation, gating via env var VLLM_SPYRE_USE_CB Signed-off-by: Yannick Schnider <[email protected]> * implementing fms wrapper with correct KV cache managment Signed-off-by: Yannick Schnider <[email protected]> * disable prints by default Signed-off-by: Yannick Schnider <[email protected]> * code refactoring fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * fix default path not using CB/ fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * correct print when TESTING_CB Signed-off-by: Yannick Schnider <[email protected]> * remove self.past_key_value_states when KV cache is managed by FMS wrapper Signed-off-by: Yannick Schnider <[email protected]> * read-out only active pages of KV cache (covers when curr batch size < max batch size) Signed-off-by: Yannick Schnider <[email protected]> * uniquely distinguishing prefills and decodes Signed-off-by: Yannick Schnider <[email protected]> * reading kv cache dimension from model config Signed-off-by: Yannick Schnider <[email protected]> * cosmetics and comments Signed-off-by: Yannick Schnider <[email protected]> * support for gpt big code models Signed-off-by: Yannick Schnider <[email protected]> * bugfix hard coded test mask Signed-off-by: Yannick Schnider <[email protected]> * change KV cache type for prefill Signed-off-by: Yannick Schnider <[email protected]> * update tkv in fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * moving fms wrapper to own class Signed-off-by: Yannick Schnider <[email protected]> * reset tkv for new prompt Signed-off-by: Yannick Schnider <[email protected]> * ignoring test_spyre_tensor_parallel.py, since FMS wrapper does not support it Signed-off-by: Yannick Schnider <[email protected]> * removing VLLM_SPYRE_USE_CB, since FMS wrapper is now used by default Signed-off-by: Yannick Schnider <[email protected]> * typing fms wrapper class Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]>

Signed-off-by: Yannick Schnider <[email protected]>

* introducing pseudo fms wrapper for static batching Signed-off-by: Yannick Schnider <[email protected]> * small bug fix Signed-off-by: Yannick Schnider <[email protected]> * bugfix idx kv cache update Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Yannick Schnider <[email protected]>

Signed-off-by: Yannick Schnider <[email protected]>

github-actions · 2025-03-31T15:02:35Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes:

pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

* introducing env variables for AIU Spyre KV cache dimensions Signed-off-by: Yannick Schnider <[email protected]> * removing prints Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]>

* initial cb test Signed-off-by: Nikolaos Papandreou <[email protected]> * make tkv, active_pages optional in SpyreCausalLM class for the V0 tests Signed-off-by: Nikolaos Papandreou <[email protected]> * format Signed-off-by: Nikolaos Papandreou <[email protected]> * remove manual testing and fix formatting Signed-off-by: Yannick Schnider <[email protected]> * remove tkv2fms Signed-off-by: Yannick Schnider <[email protected]> * remove unnecessary class variables Signed-off-by: Yannick Schnider <[email protected]> * tidy up class variables Signed-off-by: Yannick Schnider <[email protected]> * simplify code: req_ids2idx and active_pages will be reset in prepare input anyway... Signed-off-by: Yannick Schnider <[email protected]> * renaming variable Signed-off-by: Yannick Schnider <[email protected]> * removing batch padding in prefil stage Signed-off-by: Yannick Schnider <[email protected]> * indices always list of Trues since no padding or removed sequences... Signed-off-by: Yannick Schnider <[email protected]> * fix active/free page handling Signed-off-by: Yannick Schnider <[email protected]> * avoiding unnecessary tensor construction Signed-off-by: Yannick Schnider <[email protected]> * fix sorting indifference token/position_ids vs masks Signed-off-by: Yannick Schnider <[email protected]> * refactoring not requiring req_ids2idx Signed-off-by: Yannick Schnider <[email protected]> * removing unsused class variables, simplifying code Signed-off-by: Yannick Schnider <[email protected]> * use VLLM_SPYRE_MAX_BATCH_SIZE to control (decoding) batch size on AIU Spyre Signed-off-by: Yannick Schnider <[email protected]> * removing unnecessary helper functions for schedule and add_request Signed-off-by: Yannick Schnider <[email protected]> * removing unused argument Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Nikolaos Papandreou <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> Co-authored-by: Yannick Schnider <[email protected]>

Signed-off-by: Yannick Schnider <[email protected]>

tdoublep

Some minor comments, but looks basically fine imo

tdoublep · 2025-04-07T18:55:46Z

vllm_spyre/envs.py

+    VLLM_SPYRE_MAX_BATCH_SIZE: int = 0
+    VLLM_SPYRE_MAX_CONTEXT_LENGTH: int = 0


Why do we need these environment variables? Can't we use max-num-seqs and max-model-len directly?

I believe it was agreed on in the meeting with the compiler team that these should be env variables. They will be used on their end too...

🤔 🤔 🤔
But the compiler shouldn't be looking at vLLM-specific environment variables, right? That seems like coupling in the wrong way since vllm is a consumer of the compiler, not the other way around. What I would naively expect is that if the compiler requires some env vars to be set, then we would take care of setting them in the plugin code based on vLLM's configuration.

Also, IIUC these values are all currently derivable from the provided warmup shapes, right? So requiring users to configure them here is confusing, and can lead to broken configurations like

VLLM_SPYRE_WARMUP_BATCH_SIZES=1,2,3 VLLM_SPYRE_MAX_BATCH_SIZE=2

Ah, after looking at the scheduler I see that it looks like we're no longer using the static warmup shapes for scheduling with continuous batching. Are those now going to be a relic of the past?

That would be super nice, though I would still say we should be using vllm's existing --max-model-len and --max-num-seqs to keep a single source of configuration for these values

yes, warmup shapes will be a relict of the past as we move towards supporting dynamic dimensions. afaik, the way of communication between compiler and vllm is not yet fully determined and it was decided in one of the meetings with the compiler team that (for the time being) there will be two env variables used for sharing information between vllm and the compiler. I do agree, that they eventually will be set by the compiler, but as we emulate on CPU here (hence no AIU Spyre compiler involved), we simply set them ourselves.

would it be okay to address the proper args handling in another PR? To me it is not straight forward to see why we have/need two calls to check_and_update_config in platform.py and why scheduler_config.max_num_seqs varies between the two. Also this is not specific to this branch (happens on main too). Of course if anyone has an immediate solution, I am happy to include it here:)

Yes, we can address it as follow up, fine with me.

Are those now going to be a relic of the past?

And to address @joerunde's question here: yes, the warmup shapes will be a relic of the past. Things start to become much more similar to how to works on GPU.

yes, the warmup shapes will be a relic of the past

nice!

I have found the issue as to why there are two calls to check_and_update_config in platform.py - will update shortly!

Check out #114

vllm_spyre/model_executor/model_loader/spyre.py

Signed-off-by: Yannick Schnider <[email protected]>

tests/test_spyre_tensor_parallel.py

examples/offline_inference_spyre_cb_test.py

joerunde · 2025-04-08T20:45:04Z

vllm_spyre/v1/core/scheduler.py

+                         len(self.running))
+
+        outputs = super(SpyreScheduler, self).schedule()
+        return outputs


@tjohnson31415 just pushed a bugfix to the SpyreScheduler, we need to put the holdback queue back into waiting between scheduling iterations: #64

I merged main into this branch. should be fine, maybe @tjohnson31415 can still quickly check that the merge didn't violate any of his code.

The merged code brought in the change for the holdback queue, so that looks good!

I attempted to run my load-then-cancel test on the CB impl as well, though it crashes after a few requests with:

ERROR 04-10 00:08:57 [core.py:340] File "/.../v1/worker/spyre_model_runner.py", line 571, in _prepare_prompt ERROR 04-10 00:08:57 [core.py:340] free_page_idx = self.free_pages.pop(0) ERROR 04-10 00:08:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^ ERROR 04-10 00:08:57 [core.py:340] IndexError: pop from empty list ERROR 04-10 00:08:57 [core.py:340]

It seems that the max batch size limits the number of requests the server can process 🤔

joerunde · 2025-04-08T20:46:41Z

vllm_spyre/v1/core/scheduler.py

+            request.sampling_params = SamplingParams(max_tokens=1)
+
+        # delegate to super
+        super(SpyreScheduler, self).add_request(request=request)


Ah, I see here that we're delegating to the super of SpyreScheduler, so the base v1 scheduler. Might be worth a comment explaining that for those not familiar with the exact behavior of super

joerunde · 2025-04-08T20:53:11Z

vllm_spyre/v1/worker/spyre_model_runner.py

+                self._prepare_decode(scheduler_output.scheduled_cached_reqs)
+            num_reqs = len(scheduler_output.scheduled_cached_reqs)
+
+        # TODO: Build the rest of the SamplingMetadata correctly


Nice, I smell some good work for sampling master @wallashss in the near future

Sure! Count on me! I am also reviewing this code, I can make a follow up after we merge this PR to fix this.

joerunde · 2025-04-08T21:18:07Z

+1 on Tom's review, this is great work, very nicely separated implementations that will let us iterate on this going forward. I have a slightly higher care amount about making sure we agree on configuration going forward just so we don't end up flip/flopping on it, but if y'all want to merge first and figure that out later that's not a blocker

Signed-off-by: Yannick Schnider <[email protected]>

tdoublep

LGTM

…llm-project#66) * [Continuous batching] FMS model wrapper (vllm-project#18) * fms wrapper dummy for continuous batching implementation, gating via env var VLLM_SPYRE_USE_CB Signed-off-by: Yannick Schnider <[email protected]> * implementing fms wrapper with correct KV cache managment Signed-off-by: Yannick Schnider <[email protected]> * disable prints by default Signed-off-by: Yannick Schnider <[email protected]> * code refactoring fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * fix default path not using CB/ fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * correct print when TESTING_CB Signed-off-by: Yannick Schnider <[email protected]> * remove self.past_key_value_states when KV cache is managed by FMS wrapper Signed-off-by: Yannick Schnider <[email protected]> * read-out only active pages of KV cache (covers when curr batch size < max batch size) Signed-off-by: Yannick Schnider <[email protected]> * uniquely distinguishing prefills and decodes Signed-off-by: Yannick Schnider <[email protected]> * reading kv cache dimension from model config Signed-off-by: Yannick Schnider <[email protected]> * cosmetics and comments Signed-off-by: Yannick Schnider <[email protected]> * support for gpt big code models Signed-off-by: Yannick Schnider <[email protected]> * bugfix hard coded test mask Signed-off-by: Yannick Schnider <[email protected]> * change KV cache type for prefill Signed-off-by: Yannick Schnider <[email protected]> * update tkv in fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * moving fms wrapper to own class Signed-off-by: Yannick Schnider <[email protected]> * reset tkv for new prompt Signed-off-by: Yannick Schnider <[email protected]> * ignoring test_spyre_tensor_parallel.py, since FMS wrapper does not support it Signed-off-by: Yannick Schnider <[email protected]> * removing VLLM_SPYRE_USE_CB, since FMS wrapper is now used by default Signed-off-by: Yannick Schnider <[email protected]> * typing fms wrapper class Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> * moving model loading into FMS wrapper (vllm-project#35) Signed-off-by: Yannick Schnider <[email protected]> * bugfix idx kv cache update (vllm-project#40) Signed-off-by: Yannick Schnider <[email protected]> * FMS Wrapper for static batching (vllm-project#39) * introducing pseudo fms wrapper for static batching Signed-off-by: Yannick Schnider <[email protected]> * small bug fix Signed-off-by: Yannick Schnider <[email protected]> * bugfix idx kv cache update Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> * [Continuous Batching] Introducing new env variables (vllm-project#67) * introducing env variables for AIU Spyre KV cache dimensions Signed-off-by: Yannick Schnider <[email protected]> * removing prints Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> * [Continuous batching] Initial cb test (vllm-project#52) * initial cb test Signed-off-by: Nikolaos Papandreou <[email protected]> * make tkv, active_pages optional in SpyreCausalLM class for the V0 tests Signed-off-by: Nikolaos Papandreou <[email protected]> * format Signed-off-by: Nikolaos Papandreou <[email protected]> * remove manual testing and fix formatting Signed-off-by: Yannick Schnider <[email protected]> * remove tkv2fms Signed-off-by: Yannick Schnider <[email protected]> * remove unnecessary class variables Signed-off-by: Yannick Schnider <[email protected]> * tidy up class variables Signed-off-by: Yannick Schnider <[email protected]> * simplify code: req_ids2idx and active_pages will be reset in prepare input anyway... Signed-off-by: Yannick Schnider <[email protected]> * renaming variable Signed-off-by: Yannick Schnider <[email protected]> * removing batch padding in prefil stage Signed-off-by: Yannick Schnider <[email protected]> * indices always list of Trues since no padding or removed sequences... Signed-off-by: Yannick Schnider <[email protected]> * fix active/free page handling Signed-off-by: Yannick Schnider <[email protected]> * avoiding unnecessary tensor construction Signed-off-by: Yannick Schnider <[email protected]> * fix sorting indifference token/position_ids vs masks Signed-off-by: Yannick Schnider <[email protected]> * refactoring not requiring req_ids2idx Signed-off-by: Yannick Schnider <[email protected]> * removing unsused class variables, simplifying code Signed-off-by: Yannick Schnider <[email protected]> * use VLLM_SPYRE_MAX_BATCH_SIZE to control (decoding) batch size on AIU Spyre Signed-off-by: Yannick Schnider <[email protected]> * removing unnecessary helper functions for schedule and add_request Signed-off-by: Yannick Schnider <[email protected]> * removing unused argument Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Nikolaos Papandreou <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> Co-authored-by: Yannick Schnider <[email protected]> * re-enabling TP tests Signed-off-by: Yannick Schnider <[email protected]> * addressing feedback: renaming and removing unused stuff Signed-off-by: Yannick Schnider <[email protected]> * removing unnecessary getter function and other feedback Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Nikolaos Papandreou <[email protected]> Co-authored-by: Nikolaos Papandreou <[email protected]>

wallashss · 2025-04-28T20:38:04Z

vllm_spyre/v1/core/scheduler.py

+        return len(self.total_running)+len(self.waiting) <\
+                self.max_num_running_reqs and\
+                len(self.waiting) < max_prompt_batch_size


Sorry for the belated question:

What's the purpose of len(self.waiting) < max_prompt_batch_size and why max_prompt_batch_size is hardcoded to 1? The previous condition can guarantee that we have at least one slot left to schedule a new request, since we test this for each request that should be enough, right?

I guess this way, we will not be able schedule more than one request in a step even though self.max_num_running_reqs is a high number.

yannicks1 and others added 5 commits March 19, 2025 17:26

moving model loading into FMS wrapper (#35)

882433e

Signed-off-by: Yannick Schnider <[email protected]>

bugfix idx kv cache update (#40)

55b55bd

Signed-off-by: Yannick Schnider <[email protected]>

Merge branch 'main' into dev-continuous-batching

d340a81

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 and others added 3 commits March 31, 2025 15:08

Merge branch 'main' into dev-continuous-batching

ef7722d

Merge branch 'main' into dev-continuous-batching

dd9171c

yannicks1 changed the title ~~[Draft][Continuous Batching] Support for continuous batching~~ [Continuous Batching] Support for continuous batching on AIU Spyre Apr 7, 2025

yannicks1 marked this pull request as ready for review April 7, 2025 15:59

yannicks1 requested review from joerunde, sducouedic and tdoublep April 7, 2025 15:59

yannicks1 assigned yannicks1 and nikolaospapandreou and unassigned yannicks1 and nikolaospapandreou Apr 7, 2025

yannicks1 requested a review from nikolaospapandreou April 7, 2025 16:01

nikolaospapandreou and others added 2 commits April 7, 2025 20:37

re-enabling TP tests

1cbcfaf

Signed-off-by: Yannick Schnider <[email protected]>

tdoublep force-pushed the dev-continuous-batching branch from ea6076c to 1cbcfaf Compare April 7, 2025 18:38

Merge branch 'main' into dev-continuous-batching

2a69d32

tdoublep reviewed Apr 8, 2025

View reviewed changes

addressing feedback: renaming and removing unused stuff

d8dd579

Signed-off-by: Yannick Schnider <[email protected]>

prashantgupta24 reviewed Apr 8, 2025

View reviewed changes

tests/test_spyre_tensor_parallel.py Show resolved Hide resolved

maxdebayser reviewed Apr 8, 2025

View reviewed changes

examples/offline_inference_spyre_cb_test.py Show resolved Hide resolved

maxdebayser reviewed Apr 8, 2025

View reviewed changes

examples/offline_inference_spyre_cb_test.py Show resolved Hide resolved

joerunde reviewed Apr 8, 2025

View reviewed changes

yannicks1 and others added 2 commits April 9, 2025 08:22

removing unnecessary getter function and other feedback

3386b25

Signed-off-by: Yannick Schnider <[email protected]>

Merge branch 'main' into dev-continuous-batching

5b59ecd

tdoublep approved these changes Apr 9, 2025

View reviewed changes

yannicks1 merged commit 592049d into main Apr 9, 2025
7 checks passed

yannicks1 deleted the dev-continuous-batching branch April 9, 2025 13:20

joerunde mentioned this pull request Apr 16, 2025

Dig into config issues when setting vllm options like --max-num-seqs #94

Closed

wallashss reviewed Apr 28, 2025

View reviewed changes

		VLLM_SPYRE_MAX_BATCH_SIZE: int = 0
		VLLM_SPYRE_MAX_CONTEXT_LENGTH: int = 0

[Continuous Batching] Support for continuous batching on AIU Spyre #66

[Continuous Batching] Support for continuous batching on AIU Spyre #66

Uh oh!

Conversation

yannicks1 commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Support for continuous batching

Uh oh!

github-actions bot commented Mar 31, 2025

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdoublep Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joerunde commented Apr 8, 2025

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

yannicks1 commented Mar 31, 2025 •

edited

Loading

tdoublep Apr 9, 2025 •

edited

Loading