-
Notifications
You must be signed in to change notification settings - Fork 26
[Continuous Batching] Support for continuous batching on AIU Spyre #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* fms wrapper dummy for continuous batching implementation, gating via env var VLLM_SPYRE_USE_CB Signed-off-by: Yannick Schnider <[email protected]> * implementing fms wrapper with correct KV cache managment Signed-off-by: Yannick Schnider <[email protected]> * disable prints by default Signed-off-by: Yannick Schnider <[email protected]> * code refactoring fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * fix default path not using CB/ fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * correct print when TESTING_CB Signed-off-by: Yannick Schnider <[email protected]> * remove self.past_key_value_states when KV cache is managed by FMS wrapper Signed-off-by: Yannick Schnider <[email protected]> * read-out only active pages of KV cache (covers when curr batch size < max batch size) Signed-off-by: Yannick Schnider <[email protected]> * uniquely distinguishing prefills and decodes Signed-off-by: Yannick Schnider <[email protected]> * reading kv cache dimension from model config Signed-off-by: Yannick Schnider <[email protected]> * cosmetics and comments Signed-off-by: Yannick Schnider <[email protected]> * support for gpt big code models Signed-off-by: Yannick Schnider <[email protected]> * bugfix hard coded test mask Signed-off-by: Yannick Schnider <[email protected]> * change KV cache type for prefill Signed-off-by: Yannick Schnider <[email protected]> * update tkv in fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * moving fms wrapper to own class Signed-off-by: Yannick Schnider <[email protected]> * reset tkv for new prompt Signed-off-by: Yannick Schnider <[email protected]> * ignoring test_spyre_tensor_parallel.py, since FMS wrapper does not support it Signed-off-by: Yannick Schnider <[email protected]> * removing VLLM_SPYRE_USE_CB, since FMS wrapper is now used by default Signed-off-by: Yannick Schnider <[email protected]> * typing fms wrapper class Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
* introducing pseudo fms wrapper for static batching Signed-off-by: Yannick Schnider <[email protected]> * small bug fix Signed-off-by: Yannick Schnider <[email protected]> * bugfix idx kv cache update Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
👋 Hi! Thank you for contributing to vLLM support on Spyre.
Now you are good to go 🚀 |
* introducing env variables for AIU Spyre KV cache dimensions Signed-off-by: Yannick Schnider <[email protected]> * removing prints Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]>
* initial cb test Signed-off-by: Nikolaos Papandreou <[email protected]> * make tkv, active_pages optional in SpyreCausalLM class for the V0 tests Signed-off-by: Nikolaos Papandreou <[email protected]> * format Signed-off-by: Nikolaos Papandreou <[email protected]> * remove manual testing and fix formatting Signed-off-by: Yannick Schnider <[email protected]> * remove tkv2fms Signed-off-by: Yannick Schnider <[email protected]> * remove unnecessary class variables Signed-off-by: Yannick Schnider <[email protected]> * tidy up class variables Signed-off-by: Yannick Schnider <[email protected]> * simplify code: req_ids2idx and active_pages will be reset in prepare input anyway... Signed-off-by: Yannick Schnider <[email protected]> * renaming variable Signed-off-by: Yannick Schnider <[email protected]> * removing batch padding in prefil stage Signed-off-by: Yannick Schnider <[email protected]> * indices always list of Trues since no padding or removed sequences... Signed-off-by: Yannick Schnider <[email protected]> * fix active/free page handling Signed-off-by: Yannick Schnider <[email protected]> * avoiding unnecessary tensor construction Signed-off-by: Yannick Schnider <[email protected]> * fix sorting indifference token/position_ids vs masks Signed-off-by: Yannick Schnider <[email protected]> * refactoring not requiring req_ids2idx Signed-off-by: Yannick Schnider <[email protected]> * removing unsused class variables, simplifying code Signed-off-by: Yannick Schnider <[email protected]> * use VLLM_SPYRE_MAX_BATCH_SIZE to control (decoding) batch size on AIU Spyre Signed-off-by: Yannick Schnider <[email protected]> * removing unnecessary helper functions for schedule and add_request Signed-off-by: Yannick Schnider <[email protected]> * removing unused argument Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Nikolaos Papandreou <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> Co-authored-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
ea6076c
to
1cbcfaf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, but looks basically fine imo
VLLM_SPYRE_MAX_BATCH_SIZE: int = 0 | ||
VLLM_SPYRE_MAX_CONTEXT_LENGTH: int = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need these environment variables? Can't we use max-num-seqs
and max-model-len
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it was agreed on in the meeting with the compiler team that these should be env variables. They will be used on their end too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 🤔 🤔
But the compiler shouldn't be looking at vLLM-specific environment variables, right? That seems like coupling in the wrong way since vllm is a consumer of the compiler, not the other way around. What I would naively expect is that if the compiler requires some env vars to be set, then we would take care of setting them in the plugin code based on vLLM's configuration.
Also, IIUC these values are all currently derivable from the provided warmup shapes, right? So requiring users to configure them here is confusing, and can lead to broken configurations like
VLLM_SPYRE_WARMUP_BATCH_SIZES=1,2,3
VLLM_SPYRE_MAX_BATCH_SIZE=2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, after looking at the scheduler I see that it looks like we're no longer using the static warmup shapes for scheduling with continuous batching. Are those now going to be a relic of the past?
That would be super nice, though I would still say we should be using vllm's existing --max-model-len
and --max-num-seqs
to keep a single source of configuration for these values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, warmup shapes will be a relict of the past as we move towards supporting dynamic dimensions. afaik, the way of communication between compiler and vllm is not yet fully determined and it was decided in one of the meetings with the compiler team that (for the time being) there will be two env variables used for sharing information between vllm and the compiler. I do agree, that they eventually will be set by the compiler, but as we emulate on CPU here (hence no AIU Spyre compiler involved), we simply set them ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be okay to address the proper args handling in another PR? To me it is not straight forward to see why we have/need two calls to check_and_update_config
in platform.py
and why scheduler_config.max_num_seqs
varies between the two. Also this is not specific to this branch (happens on main too). Of course if anyone has an immediate solution, I am happy to include it here:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can address it as follow up, fine with me.
Are those now going to be a relic of the past?
And to address @joerunde's question here: yes, the warmup shapes will be a relic of the past. Things start to become much more similar to how to works on GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the warmup shapes will be a relic of the past
nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have found the issue as to why there are two calls to check_and_update_config
in platform.py
- will update shortly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check out #114
Signed-off-by: Yannick Schnider <[email protected]>
len(self.running)) | ||
|
||
outputs = super(SpyreScheduler, self).schedule() | ||
return outputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjohnson31415 just pushed a bugfix to the SpyreScheduler, we need to put the holdback queue back into waiting between scheduling iterations: #64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged main into this branch. should be fine, maybe @tjohnson31415 can still quickly check that the merge didn't violate any of his code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The merged code brought in the change for the holdback queue, so that looks good!
I attempted to run my load-then-cancel test on the CB impl as well, though it crashes after a few requests with:
ERROR 04-10 00:08:57 [core.py:340] File "/.../v1/worker/spyre_model_runner.py", line 571, in _prepare_prompt
ERROR 04-10 00:08:57 [core.py:340] free_page_idx = self.free_pages.pop(0)
ERROR 04-10 00:08:57 [core.py:340] ^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-10 00:08:57 [core.py:340] IndexError: pop from empty list
ERROR 04-10 00:08:57 [core.py:340]
It seems that the max batch size limits the number of requests the server can process 🤔
request.sampling_params = SamplingParams(max_tokens=1) | ||
|
||
# delegate to super | ||
super(SpyreScheduler, self).add_request(request=request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see here that we're delegating to the super of SpyreScheduler
, so the base v1 scheduler. Might be worth a comment explaining that for those not familiar with the exact behavior of super
self._prepare_decode(scheduler_output.scheduled_cached_reqs) | ||
num_reqs = len(scheduler_output.scheduled_cached_reqs) | ||
|
||
# TODO: Build the rest of the SamplingMetadata correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I smell some good work for sampling master @wallashss in the near future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Count on me! I am also reviewing this code, I can make a follow up after we merge this PR to fix this.
+1 on Tom's review, this is great work, very nicely separated implementations that will let us iterate on this going forward. I have a slightly higher care amount about making sure we agree on configuration going forward just so we don't end up flip/flopping on it, but if y'all want to merge first and figure that out later that's not a blocker |
Signed-off-by: Yannick Schnider <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…llm-project#66) * [Continuous batching] FMS model wrapper (vllm-project#18) * fms wrapper dummy for continuous batching implementation, gating via env var VLLM_SPYRE_USE_CB Signed-off-by: Yannick Schnider <[email protected]> * implementing fms wrapper with correct KV cache managment Signed-off-by: Yannick Schnider <[email protected]> * disable prints by default Signed-off-by: Yannick Schnider <[email protected]> * code refactoring fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * fix default path not using CB/ fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * correct print when TESTING_CB Signed-off-by: Yannick Schnider <[email protected]> * remove self.past_key_value_states when KV cache is managed by FMS wrapper Signed-off-by: Yannick Schnider <[email protected]> * read-out only active pages of KV cache (covers when curr batch size < max batch size) Signed-off-by: Yannick Schnider <[email protected]> * uniquely distinguishing prefills and decodes Signed-off-by: Yannick Schnider <[email protected]> * reading kv cache dimension from model config Signed-off-by: Yannick Schnider <[email protected]> * cosmetics and comments Signed-off-by: Yannick Schnider <[email protected]> * support for gpt big code models Signed-off-by: Yannick Schnider <[email protected]> * bugfix hard coded test mask Signed-off-by: Yannick Schnider <[email protected]> * change KV cache type for prefill Signed-off-by: Yannick Schnider <[email protected]> * update tkv in fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * moving fms wrapper to own class Signed-off-by: Yannick Schnider <[email protected]> * reset tkv for new prompt Signed-off-by: Yannick Schnider <[email protected]> * ignoring test_spyre_tensor_parallel.py, since FMS wrapper does not support it Signed-off-by: Yannick Schnider <[email protected]> * removing VLLM_SPYRE_USE_CB, since FMS wrapper is now used by default Signed-off-by: Yannick Schnider <[email protected]> * typing fms wrapper class Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> * moving model loading into FMS wrapper (vllm-project#35) Signed-off-by: Yannick Schnider <[email protected]> * bugfix idx kv cache update (vllm-project#40) Signed-off-by: Yannick Schnider <[email protected]> * FMS Wrapper for static batching (vllm-project#39) * introducing pseudo fms wrapper for static batching Signed-off-by: Yannick Schnider <[email protected]> * small bug fix Signed-off-by: Yannick Schnider <[email protected]> * bugfix idx kv cache update Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> * [Continuous Batching] Introducing new env variables (vllm-project#67) * introducing env variables for AIU Spyre KV cache dimensions Signed-off-by: Yannick Schnider <[email protected]> * removing prints Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> * [Continuous batching] Initial cb test (vllm-project#52) * initial cb test Signed-off-by: Nikolaos Papandreou <[email protected]> * make tkv, active_pages optional in SpyreCausalLM class for the V0 tests Signed-off-by: Nikolaos Papandreou <[email protected]> * format Signed-off-by: Nikolaos Papandreou <[email protected]> * remove manual testing and fix formatting Signed-off-by: Yannick Schnider <[email protected]> * remove tkv2fms Signed-off-by: Yannick Schnider <[email protected]> * remove unnecessary class variables Signed-off-by: Yannick Schnider <[email protected]> * tidy up class variables Signed-off-by: Yannick Schnider <[email protected]> * simplify code: req_ids2idx and active_pages will be reset in prepare input anyway... Signed-off-by: Yannick Schnider <[email protected]> * renaming variable Signed-off-by: Yannick Schnider <[email protected]> * removing batch padding in prefil stage Signed-off-by: Yannick Schnider <[email protected]> * indices always list of Trues since no padding or removed sequences... Signed-off-by: Yannick Schnider <[email protected]> * fix active/free page handling Signed-off-by: Yannick Schnider <[email protected]> * avoiding unnecessary tensor construction Signed-off-by: Yannick Schnider <[email protected]> * fix sorting indifference token/position_ids vs masks Signed-off-by: Yannick Schnider <[email protected]> * refactoring not requiring req_ids2idx Signed-off-by: Yannick Schnider <[email protected]> * removing unsused class variables, simplifying code Signed-off-by: Yannick Schnider <[email protected]> * use VLLM_SPYRE_MAX_BATCH_SIZE to control (decoding) batch size on AIU Spyre Signed-off-by: Yannick Schnider <[email protected]> * removing unnecessary helper functions for schedule and add_request Signed-off-by: Yannick Schnider <[email protected]> * removing unused argument Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Nikolaos Papandreou <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> Co-authored-by: Yannick Schnider <[email protected]> * re-enabling TP tests Signed-off-by: Yannick Schnider <[email protected]> * addressing feedback: renaming and removing unused stuff Signed-off-by: Yannick Schnider <[email protected]> * removing unnecessary getter function and other feedback Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Nikolaos Papandreou <[email protected]> Co-authored-by: Nikolaos Papandreou <[email protected]>
return len(self.total_running)+len(self.waiting) <\ | ||
self.max_num_running_reqs and\ | ||
len(self.waiting) < max_prompt_batch_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the belated question:
What's the purpose of len(self.waiting) < max_prompt_batch_size
and why max_prompt_batch_size is hardcoded to 1? The previous condition can guarantee that we have at least one slot left to schedule a new request, since we test this for each request that should be enough, right?
I guess this way, we will not be able schedule more than one request in a step even though self.max_num_running_reqs
is a high number.
Support for continuous batching
This is a first working implementation of continuous batching!
changes:
VLLM_SPYRE_USE_CB
(True
for continuous /False
for static batching),VLLM_SPYRE_MAX_BATCH_SIZE
(max batch size supported by AIU Spyre),VLLM_SPYRE_MAX_CONTEXT_LENGTH
(max context length supported for model on AIU Spyre)FmsModelWrapper
(continuous batching) andFmsModelPseudoWrapper
(static batching) to emulate KV cache handling on AIU Spyre.ContinuousBatchingSpyreScheduler
andContinuousBatchingSpyreModelRunner
(andStaticBatchingSpyreModelRunner
)running example code demonstrating continuous batching:
examples/offline_inference_spyre_cb_test.py