Skip to content

Conversation

yannicks1
Copy link
Collaborator

@yannicks1 yannicks1 commented Mar 31, 2025

Support for continuous batching

This is a first working implementation of continuous batching!

changes:

  • Introducing env variabels: VLLM_SPYRE_USE_CB (True for continuous / False for static batching), VLLM_SPYRE_MAX_BATCH_SIZE (max batch size supported by AIU Spyre), VLLM_SPYRE_MAX_CONTEXT_LENGTH (max context length supported for model on AIU Spyre)
  • Introducing FmsModelWrapper (continuous batching) and FmsModelPseudoWrapper (static batching) to emulate KV cache handling on AIU Spyre.
  • Introducing continuous batching specific classes: ContinuousBatchingSpyreScheduler and ContinuousBatchingSpyreModelRunner (and StaticBatchingSpyreModelRunner)

running example code demonstrating continuous batching:
examples/offline_inference_spyre_cb_test.py

yannicks1 and others added 5 commits March 19, 2025 17:26
* fms wrapper dummy for continuous batching implementation, gating via env var VLLM_SPYRE_USE_CB

Signed-off-by: Yannick Schnider <[email protected]>

* implementing fms wrapper with correct KV cache managment

Signed-off-by: Yannick Schnider <[email protected]>

* disable prints by default

Signed-off-by: Yannick Schnider <[email protected]>

* code refactoring fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* fix default path not using CB/ fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* correct print when TESTING_CB

Signed-off-by: Yannick Schnider <[email protected]>

* remove self.past_key_value_states when KV cache is managed by FMS wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* read-out only active pages of KV cache (covers when curr batch size < max batch size)

Signed-off-by: Yannick Schnider <[email protected]>

* uniquely distinguishing prefills and decodes

Signed-off-by: Yannick Schnider <[email protected]>

* reading kv cache dimension from model config

Signed-off-by: Yannick Schnider <[email protected]>

* cosmetics and comments

Signed-off-by: Yannick Schnider <[email protected]>

* support for gpt big code models

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix hard coded test mask

Signed-off-by: Yannick Schnider <[email protected]>

* change KV cache type for prefill

Signed-off-by: Yannick Schnider <[email protected]>

* update tkv in fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* moving fms wrapper to own class

Signed-off-by: Yannick Schnider <[email protected]>

* reset tkv for new prompt

Signed-off-by: Yannick Schnider <[email protected]>

* ignoring test_spyre_tensor_parallel.py, since FMS wrapper does not support it

Signed-off-by: Yannick Schnider <[email protected]>

* removing VLLM_SPYRE_USE_CB, since FMS wrapper is now used by default

Signed-off-by: Yannick Schnider <[email protected]>

* typing fms wrapper class

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
* introducing pseudo fms wrapper for static batching

Signed-off-by: Yannick Schnider <[email protected]>

* small bug fix

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix idx kv cache update

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Copy link

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes:

pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

yannicks1 and others added 3 commits March 31, 2025 15:08
* introducing env variables for AIU Spyre KV cache dimensions

Signed-off-by: Yannick Schnider <[email protected]>

* removing prints

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>
@yannicks1 yannicks1 changed the title [Draft][Continuous Batching] Support for continuous batching [Continuous Batching] Support for continuous batching on AIU Spyre Apr 7, 2025
@yannicks1 yannicks1 marked this pull request as ready for review April 7, 2025 15:59
nikolaospapandreou and others added 2 commits April 7, 2025 20:37
* initial cb test

Signed-off-by: Nikolaos Papandreou <[email protected]>

* make tkv, active_pages optional in SpyreCausalLM class for the V0 tests

Signed-off-by: Nikolaos Papandreou <[email protected]>

* format

Signed-off-by: Nikolaos Papandreou <[email protected]>

* remove manual testing and fix formatting

Signed-off-by: Yannick Schnider <[email protected]>

* remove tkv2fms

Signed-off-by: Yannick Schnider <[email protected]>

* remove unnecessary class variables

Signed-off-by: Yannick Schnider <[email protected]>

* tidy up class variables

Signed-off-by: Yannick Schnider <[email protected]>

* simplify code: req_ids2idx and active_pages will be reset in prepare input anyway...

Signed-off-by: Yannick Schnider <[email protected]>

* renaming variable

Signed-off-by: Yannick Schnider <[email protected]>

* removing batch padding in prefil stage

Signed-off-by: Yannick Schnider <[email protected]>

* indices always list of Trues since no padding or removed sequences...

Signed-off-by: Yannick Schnider <[email protected]>

* fix active/free page handling

Signed-off-by: Yannick Schnider <[email protected]>

* avoiding unnecessary tensor construction

Signed-off-by: Yannick Schnider <[email protected]>

* fix sorting indifference token/position_ids vs masks

Signed-off-by: Yannick Schnider <[email protected]>

* refactoring not requiring req_ids2idx

Signed-off-by: Yannick Schnider <[email protected]>

* removing unsused class variables, simplifying code

Signed-off-by: Yannick Schnider <[email protected]>

* use VLLM_SPYRE_MAX_BATCH_SIZE to control (decoding) batch size on AIU Spyre

Signed-off-by: Yannick Schnider <[email protected]>

* removing unnecessary helper functions for schedule and add_request

Signed-off-by: Yannick Schnider <[email protected]>

* removing unused argument

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Nikolaos Papandreou <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Co-authored-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
@tdoublep tdoublep force-pushed the dev-continuous-batching branch from ea6076c to 1cbcfaf Compare April 7, 2025 18:38
Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, but looks basically fine imo

Comment on lines +10 to +11
VLLM_SPYRE_MAX_BATCH_SIZE: int = 0
VLLM_SPYRE_MAX_CONTEXT_LENGTH: int = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need these environment variables? Can't we use max-num-seqs and max-model-len directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it was agreed on in the meeting with the compiler team that these should be env variables. They will be used on their end too...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 🤔 🤔
But the compiler shouldn't be looking at vLLM-specific environment variables, right? That seems like coupling in the wrong way since vllm is a consumer of the compiler, not the other way around. What I would naively expect is that if the compiler requires some env vars to be set, then we would take care of setting them in the plugin code based on vLLM's configuration.

Also, IIUC these values are all currently derivable from the provided warmup shapes, right? So requiring users to configure them here is confusing, and can lead to broken configurations like

VLLM_SPYRE_WARMUP_BATCH_SIZES=1,2,3
VLLM_SPYRE_MAX_BATCH_SIZE=2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, after looking at the scheduler I see that it looks like we're no longer using the static warmup shapes for scheduling with continuous batching. Are those now going to be a relic of the past?

That would be super nice, though I would still say we should be using vllm's existing --max-model-len and --max-num-seqs to keep a single source of configuration for these values

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, warmup shapes will be a relict of the past as we move towards supporting dynamic dimensions. afaik, the way of communication between compiler and vllm is not yet fully determined and it was decided in one of the meetings with the compiler team that (for the time being) there will be two env variables used for sharing information between vllm and the compiler. I do agree, that they eventually will be set by the compiler, but as we emulate on CPU here (hence no AIU Spyre compiler involved), we simply set them ourselves.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be okay to address the proper args handling in another PR? To me it is not straight forward to see why we have/need two calls to check_and_update_config in platform.py and why scheduler_config.max_num_seqs varies between the two. Also this is not specific to this branch (happens on main too). Of course if anyone has an immediate solution, I am happy to include it here:)

Copy link
Member

@tdoublep tdoublep Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can address it as follow up, fine with me.

Are those now going to be a relic of the past?

And to address @joerunde's question here: yes, the warmup shapes will be a relic of the past. Things start to become much more similar to how to works on GPU.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the warmup shapes will be a relic of the past

nice!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found the issue as to why there are two calls to check_and_update_config in platform.py - will update shortly!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out #114

len(self.running))

outputs = super(SpyreScheduler, self).schedule()
return outputs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjohnson31415 just pushed a bugfix to the SpyreScheduler, we need to put the holdback queue back into waiting between scheduling iterations: #64

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged main into this branch. should be fine, maybe @tjohnson31415 can still quickly check that the merge didn't violate any of his code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merged code brought in the change for the holdback queue, so that looks good!

I attempted to run my load-then-cancel test on the CB impl as well, though it crashes after a few requests with:

ERROR 04-10 00:08:57 [core.py:340]   File "/.../v1/worker/spyre_model_runner.py", line 571, in _prepare_prompt
ERROR 04-10 00:08:57 [core.py:340]     free_page_idx = self.free_pages.pop(0)
ERROR 04-10 00:08:57 [core.py:340]                     ^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-10 00:08:57 [core.py:340] IndexError: pop from empty list
ERROR 04-10 00:08:57 [core.py:340] 

It seems that the max batch size limits the number of requests the server can process 🤔

request.sampling_params = SamplingParams(max_tokens=1)

# delegate to super
super(SpyreScheduler, self).add_request(request=request)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see here that we're delegating to the super of SpyreScheduler, so the base v1 scheduler. Might be worth a comment explaining that for those not familiar with the exact behavior of super

self._prepare_decode(scheduler_output.scheduled_cached_reqs)
num_reqs = len(scheduler_output.scheduled_cached_reqs)

# TODO: Build the rest of the SamplingMetadata correctly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I smell some good work for sampling master @wallashss in the near future

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Count on me! I am also reviewing this code, I can make a follow up after we merge this PR to fix this.

@joerunde
Copy link
Collaborator

joerunde commented Apr 8, 2025

+1 on Tom's review, this is great work, very nicely separated implementations that will let us iterate on this going forward. I have a slightly higher care amount about making sure we agree on configuration going forward just so we don't end up flip/flopping on it, but if y'all want to merge first and figure that out later that's not a blocker

Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yannicks1 yannicks1 merged commit 592049d into main Apr 9, 2025
7 checks passed
@yannicks1 yannicks1 deleted the dev-continuous-batching branch April 9, 2025 13:20
rafvasq pushed a commit to rafvasq/vllm-spyre that referenced this pull request Apr 11, 2025
…llm-project#66)

* [Continuous batching] FMS model wrapper (vllm-project#18)

* fms wrapper dummy for continuous batching implementation, gating via env var VLLM_SPYRE_USE_CB

Signed-off-by: Yannick Schnider <[email protected]>

* implementing fms wrapper with correct KV cache managment

Signed-off-by: Yannick Schnider <[email protected]>

* disable prints by default

Signed-off-by: Yannick Schnider <[email protected]>

* code refactoring fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* fix default path not using CB/ fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* correct print when TESTING_CB

Signed-off-by: Yannick Schnider <[email protected]>

* remove self.past_key_value_states when KV cache is managed by FMS wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* read-out only active pages of KV cache (covers when curr batch size < max batch size)

Signed-off-by: Yannick Schnider <[email protected]>

* uniquely distinguishing prefills and decodes

Signed-off-by: Yannick Schnider <[email protected]>

* reading kv cache dimension from model config

Signed-off-by: Yannick Schnider <[email protected]>

* cosmetics and comments

Signed-off-by: Yannick Schnider <[email protected]>

* support for gpt big code models

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix hard coded test mask

Signed-off-by: Yannick Schnider <[email protected]>

* change KV cache type for prefill

Signed-off-by: Yannick Schnider <[email protected]>

* update tkv in fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* moving fms wrapper to own class

Signed-off-by: Yannick Schnider <[email protected]>

* reset tkv for new prompt

Signed-off-by: Yannick Schnider <[email protected]>

* ignoring test_spyre_tensor_parallel.py, since FMS wrapper does not support it

Signed-off-by: Yannick Schnider <[email protected]>

* removing VLLM_SPYRE_USE_CB, since FMS wrapper is now used by default

Signed-off-by: Yannick Schnider <[email protected]>

* typing fms wrapper class

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* moving model loading into FMS wrapper (vllm-project#35)

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix idx kv cache update (vllm-project#40)

Signed-off-by: Yannick Schnider <[email protected]>

* FMS Wrapper for static batching (vllm-project#39)

* introducing pseudo fms wrapper for static batching

Signed-off-by: Yannick Schnider <[email protected]>

* small bug fix

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix idx kv cache update

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>

* [Continuous Batching] Introducing new env variables (vllm-project#67)

* introducing env variables for AIU Spyre KV cache dimensions

Signed-off-by: Yannick Schnider <[email protected]>

* removing prints

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* [Continuous batching] Initial cb test (vllm-project#52)

* initial cb test

Signed-off-by: Nikolaos Papandreou <[email protected]>

* make tkv, active_pages optional in SpyreCausalLM class for the V0 tests

Signed-off-by: Nikolaos Papandreou <[email protected]>

* format

Signed-off-by: Nikolaos Papandreou <[email protected]>

* remove manual testing and fix formatting

Signed-off-by: Yannick Schnider <[email protected]>

* remove tkv2fms

Signed-off-by: Yannick Schnider <[email protected]>

* remove unnecessary class variables

Signed-off-by: Yannick Schnider <[email protected]>

* tidy up class variables

Signed-off-by: Yannick Schnider <[email protected]>

* simplify code: req_ids2idx and active_pages will be reset in prepare input anyway...

Signed-off-by: Yannick Schnider <[email protected]>

* renaming variable

Signed-off-by: Yannick Schnider <[email protected]>

* removing batch padding in prefil stage

Signed-off-by: Yannick Schnider <[email protected]>

* indices always list of Trues since no padding or removed sequences...

Signed-off-by: Yannick Schnider <[email protected]>

* fix active/free page handling

Signed-off-by: Yannick Schnider <[email protected]>

* avoiding unnecessary tensor construction

Signed-off-by: Yannick Schnider <[email protected]>

* fix sorting indifference token/position_ids vs masks

Signed-off-by: Yannick Schnider <[email protected]>

* refactoring not requiring req_ids2idx

Signed-off-by: Yannick Schnider <[email protected]>

* removing unsused class variables, simplifying code

Signed-off-by: Yannick Schnider <[email protected]>

* use VLLM_SPYRE_MAX_BATCH_SIZE to control (decoding) batch size on AIU Spyre

Signed-off-by: Yannick Schnider <[email protected]>

* removing unnecessary helper functions for schedule and add_request

Signed-off-by: Yannick Schnider <[email protected]>

* removing unused argument

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Nikolaos Papandreou <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Co-authored-by: Yannick Schnider <[email protected]>

* re-enabling TP tests

Signed-off-by: Yannick Schnider <[email protected]>

* addressing feedback: renaming and removing unused stuff

Signed-off-by: Yannick Schnider <[email protected]>

* removing unnecessary getter function and other feedback

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Nikolaos Papandreou <[email protected]>
Co-authored-by: Nikolaos Papandreou <[email protected]>
Comment on lines +288 to +290
return len(self.total_running)+len(self.waiting) <\
self.max_num_running_reqs and\
len(self.waiting) < max_prompt_batch_size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the belated question:

What's the purpose of len(self.waiting) < max_prompt_batch_size and why max_prompt_batch_size is hardcoded to 1? The previous condition can guarantee that we have at least one slot left to schedule a new request, since we test this for each request that should be enough, right?

I guess this way, we will not be able schedule more than one request in a step even though self.max_num_running_reqs is a high number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants