-
Notifications
You must be signed in to change notification settings - Fork 26
[Continuous batching] Initial cb test #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Nikolaos Papandreou <[email protected]>
👋 Hi! Thank you for contributing to vLLM support on Spyre.
Now you are good to go 🚀 |
self.tkv = 0 | ||
if not envs_spyre.VLLM_SPYRE_USE_CB: | ||
self.model.past_key_value_states = None | ||
self.tkv = tkv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we set self.tkv
here? It looks like it is not used.
only_last_token=True, | ||
tkv=self.tkv, | ||
active_pages=[i for i in range(input_ids.shape[0])], | ||
#active_pages=[i for i in range(input_ids.shape[0])], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove commented out lines
vllm_spyre/v1/core/scheduler.py
Outdated
outputs = super().schedule() | ||
return outputs | ||
|
||
def schedule_cb(self) -> "SchedulerOutput": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose have separate classes StaticBatchingSpyreScheduler
and ContinuousBatchingSpyreScheduler
and just implementing the schedule
function differently, rather than having two functions. This could be addressed in the PR to main though, rather than this one.
vllm_spyre/v1/core/scheduler.py
Outdated
available_warmup_shapes = [ | ||
shape for shape in available_warmup_shapes | ||
if request.num_prompt_tokens <= shape['prompt_length'] | ||
and max_tokens <= shape['new_tokens'] | ||
and len(self.waiting) < shape['batch_size'] | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the continuous batching logic should not depend on the warmup shapes in this way?
self._req_ids2idx_prompt: dict = {} | ||
self._req_ids2idx_decode: dict = {} | ||
self._decode_batch_size = 0 | ||
self._active_pages = [] | ||
self._free_page_idxs = [] | ||
self._position_ids_prompt: torch.Tensor = None | ||
self._mask_prompt: torch.Tensor = None | ||
self._tkv: int = 0 | ||
self._tkv2fms: int = 0 | ||
self._prev_step_dec = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, the management of pages here is implemented at the wrong level. We should not be trying to maintain all of this state in the model runner itself. We should be looking at how vLLM (V1) is implemented on GPU as a guide (e.g., we should be using something like the InputBatch class to maintain this state). I think as a first attempt it is fine, and we can iteratively improve it from here.
warmup_shapes = current_platform.get_warmup_shapes() | ||
max_prompt_length = max(shape["prompt_length"] | ||
for shape in warmup_shapes) | ||
max_batch_size = max(shape["batch_size"] for shape in warmup_shapes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuous batching should not use warmup shapes
def _prepare_prompt_cb( | ||
self, | ||
new_requests: List[NewRequestData], | ||
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, List[int]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to scheduler, I would suggest splitting into two model runner classes SpyreModelRunnner
and ContinuousBatchingSpyreModelRunner
or similar
positions=model_input.input_positions, | ||
masks=model_input.input_masks, | ||
is_prompt=model_input.is_prompt, | ||
tkv = self._tkv2fms, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is self._tkv2fms
needed? Can't we just use self.tkv
?
Signed-off-by: Nikolaos Papandreou <[email protected]>
Signed-off-by: Nikolaos Papandreou <[email protected]>
Signed-off-by: Nikolaos Papandreou <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
…input anyway... Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
… Spyre Signed-off-by: Yannick Schnider <[email protected]>
# For continuous batching we use max_num_seqs to control | ||
# the max batch size respecting AIU Spyre KV cache size | ||
scheduler_config.max_num_seqs =\ | ||
envs_spyre.VLLM_SPYRE_MAX_BATCH_SIZE | ||
# ToDo: this function check_and_update_config is called twice: | ||
# 1st time scheduler_config.max_num_seqs is what user sets | ||
# 2nd time scheduler_config.max_num_seqs is 128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to override max-num-seqs
at all?
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
LGTM |
* initial cb test Signed-off-by: Nikolaos Papandreou <[email protected]> * make tkv, active_pages optional in SpyreCausalLM class for the V0 tests Signed-off-by: Nikolaos Papandreou <[email protected]> * format Signed-off-by: Nikolaos Papandreou <[email protected]> * remove manual testing and fix formatting Signed-off-by: Yannick Schnider <[email protected]> * remove tkv2fms Signed-off-by: Yannick Schnider <[email protected]> * remove unnecessary class variables Signed-off-by: Yannick Schnider <[email protected]> * tidy up class variables Signed-off-by: Yannick Schnider <[email protected]> * simplify code: req_ids2idx and active_pages will be reset in prepare input anyway... Signed-off-by: Yannick Schnider <[email protected]> * renaming variable Signed-off-by: Yannick Schnider <[email protected]> * removing batch padding in prefil stage Signed-off-by: Yannick Schnider <[email protected]> * indices always list of Trues since no padding or removed sequences... Signed-off-by: Yannick Schnider <[email protected]> * fix active/free page handling Signed-off-by: Yannick Schnider <[email protected]> * avoiding unnecessary tensor construction Signed-off-by: Yannick Schnider <[email protected]> * fix sorting indifference token/position_ids vs masks Signed-off-by: Yannick Schnider <[email protected]> * refactoring not requiring req_ids2idx Signed-off-by: Yannick Schnider <[email protected]> * removing unsused class variables, simplifying code Signed-off-by: Yannick Schnider <[email protected]> * use VLLM_SPYRE_MAX_BATCH_SIZE to control (decoding) batch size on AIU Spyre Signed-off-by: Yannick Schnider <[email protected]> * removing unnecessary helper functions for schedule and add_request Signed-off-by: Yannick Schnider <[email protected]> * removing unused argument Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Nikolaos Papandreou <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> Co-authored-by: Yannick Schnider <[email protected]>
* [Continuous batching] FMS model wrapper (#18) * fms wrapper dummy for continuous batching implementation, gating via env var VLLM_SPYRE_USE_CB Signed-off-by: Yannick Schnider <[email protected]> * implementing fms wrapper with correct KV cache managment Signed-off-by: Yannick Schnider <[email protected]> * disable prints by default Signed-off-by: Yannick Schnider <[email protected]> * code refactoring fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * fix default path not using CB/ fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * correct print when TESTING_CB Signed-off-by: Yannick Schnider <[email protected]> * remove self.past_key_value_states when KV cache is managed by FMS wrapper Signed-off-by: Yannick Schnider <[email protected]> * read-out only active pages of KV cache (covers when curr batch size < max batch size) Signed-off-by: Yannick Schnider <[email protected]> * uniquely distinguishing prefills and decodes Signed-off-by: Yannick Schnider <[email protected]> * reading kv cache dimension from model config Signed-off-by: Yannick Schnider <[email protected]> * cosmetics and comments Signed-off-by: Yannick Schnider <[email protected]> * support for gpt big code models Signed-off-by: Yannick Schnider <[email protected]> * bugfix hard coded test mask Signed-off-by: Yannick Schnider <[email protected]> * change KV cache type for prefill Signed-off-by: Yannick Schnider <[email protected]> * update tkv in fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * moving fms wrapper to own class Signed-off-by: Yannick Schnider <[email protected]> * reset tkv for new prompt Signed-off-by: Yannick Schnider <[email protected]> * ignoring test_spyre_tensor_parallel.py, since FMS wrapper does not support it Signed-off-by: Yannick Schnider <[email protected]> * removing VLLM_SPYRE_USE_CB, since FMS wrapper is now used by default Signed-off-by: Yannick Schnider <[email protected]> * typing fms wrapper class Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> * moving model loading into FMS wrapper (#35) Signed-off-by: Yannick Schnider <[email protected]> * bugfix idx kv cache update (#40) Signed-off-by: Yannick Schnider <[email protected]> * FMS Wrapper for static batching (#39) * introducing pseudo fms wrapper for static batching Signed-off-by: Yannick Schnider <[email protected]> * small bug fix Signed-off-by: Yannick Schnider <[email protected]> * bugfix idx kv cache update Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> * [Continuous Batching] Introducing new env variables (#67) * introducing env variables for AIU Spyre KV cache dimensions Signed-off-by: Yannick Schnider <[email protected]> * removing prints Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> * [Continuous batching] Initial cb test (#52) * initial cb test Signed-off-by: Nikolaos Papandreou <[email protected]> * make tkv, active_pages optional in SpyreCausalLM class for the V0 tests Signed-off-by: Nikolaos Papandreou <[email protected]> * format Signed-off-by: Nikolaos Papandreou <[email protected]> * remove manual testing and fix formatting Signed-off-by: Yannick Schnider <[email protected]> * remove tkv2fms Signed-off-by: Yannick Schnider <[email protected]> * remove unnecessary class variables Signed-off-by: Yannick Schnider <[email protected]> * tidy up class variables Signed-off-by: Yannick Schnider <[email protected]> * simplify code: req_ids2idx and active_pages will be reset in prepare input anyway... Signed-off-by: Yannick Schnider <[email protected]> * renaming variable Signed-off-by: Yannick Schnider <[email protected]> * removing batch padding in prefil stage Signed-off-by: Yannick Schnider <[email protected]> * indices always list of Trues since no padding or removed sequences... Signed-off-by: Yannick Schnider <[email protected]> * fix active/free page handling Signed-off-by: Yannick Schnider <[email protected]> * avoiding unnecessary tensor construction Signed-off-by: Yannick Schnider <[email protected]> * fix sorting indifference token/position_ids vs masks Signed-off-by: Yannick Schnider <[email protected]> * refactoring not requiring req_ids2idx Signed-off-by: Yannick Schnider <[email protected]> * removing unsused class variables, simplifying code Signed-off-by: Yannick Schnider <[email protected]> * use VLLM_SPYRE_MAX_BATCH_SIZE to control (decoding) batch size on AIU Spyre Signed-off-by: Yannick Schnider <[email protected]> * removing unnecessary helper functions for schedule and add_request Signed-off-by: Yannick Schnider <[email protected]> * removing unused argument Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Nikolaos Papandreou <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> Co-authored-by: Yannick Schnider <[email protected]> * re-enabling TP tests Signed-off-by: Yannick Schnider <[email protected]> * addressing feedback: renaming and removing unused stuff Signed-off-by: Yannick Schnider <[email protected]> * removing unnecessary getter function and other feedback Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Nikolaos Papandreou <[email protected]> Co-authored-by: Nikolaos Papandreou <[email protected]>
…llm-project#66) * [Continuous batching] FMS model wrapper (vllm-project#18) * fms wrapper dummy for continuous batching implementation, gating via env var VLLM_SPYRE_USE_CB Signed-off-by: Yannick Schnider <[email protected]> * implementing fms wrapper with correct KV cache managment Signed-off-by: Yannick Schnider <[email protected]> * disable prints by default Signed-off-by: Yannick Schnider <[email protected]> * code refactoring fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * fix default path not using CB/ fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * correct print when TESTING_CB Signed-off-by: Yannick Schnider <[email protected]> * remove self.past_key_value_states when KV cache is managed by FMS wrapper Signed-off-by: Yannick Schnider <[email protected]> * read-out only active pages of KV cache (covers when curr batch size < max batch size) Signed-off-by: Yannick Schnider <[email protected]> * uniquely distinguishing prefills and decodes Signed-off-by: Yannick Schnider <[email protected]> * reading kv cache dimension from model config Signed-off-by: Yannick Schnider <[email protected]> * cosmetics and comments Signed-off-by: Yannick Schnider <[email protected]> * support for gpt big code models Signed-off-by: Yannick Schnider <[email protected]> * bugfix hard coded test mask Signed-off-by: Yannick Schnider <[email protected]> * change KV cache type for prefill Signed-off-by: Yannick Schnider <[email protected]> * update tkv in fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * moving fms wrapper to own class Signed-off-by: Yannick Schnider <[email protected]> * reset tkv for new prompt Signed-off-by: Yannick Schnider <[email protected]> * ignoring test_spyre_tensor_parallel.py, since FMS wrapper does not support it Signed-off-by: Yannick Schnider <[email protected]> * removing VLLM_SPYRE_USE_CB, since FMS wrapper is now used by default Signed-off-by: Yannick Schnider <[email protected]> * typing fms wrapper class Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> * moving model loading into FMS wrapper (vllm-project#35) Signed-off-by: Yannick Schnider <[email protected]> * bugfix idx kv cache update (vllm-project#40) Signed-off-by: Yannick Schnider <[email protected]> * FMS Wrapper for static batching (vllm-project#39) * introducing pseudo fms wrapper for static batching Signed-off-by: Yannick Schnider <[email protected]> * small bug fix Signed-off-by: Yannick Schnider <[email protected]> * bugfix idx kv cache update Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> * [Continuous Batching] Introducing new env variables (vllm-project#67) * introducing env variables for AIU Spyre KV cache dimensions Signed-off-by: Yannick Schnider <[email protected]> * removing prints Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> * [Continuous batching] Initial cb test (vllm-project#52) * initial cb test Signed-off-by: Nikolaos Papandreou <[email protected]> * make tkv, active_pages optional in SpyreCausalLM class for the V0 tests Signed-off-by: Nikolaos Papandreou <[email protected]> * format Signed-off-by: Nikolaos Papandreou <[email protected]> * remove manual testing and fix formatting Signed-off-by: Yannick Schnider <[email protected]> * remove tkv2fms Signed-off-by: Yannick Schnider <[email protected]> * remove unnecessary class variables Signed-off-by: Yannick Schnider <[email protected]> * tidy up class variables Signed-off-by: Yannick Schnider <[email protected]> * simplify code: req_ids2idx and active_pages will be reset in prepare input anyway... Signed-off-by: Yannick Schnider <[email protected]> * renaming variable Signed-off-by: Yannick Schnider <[email protected]> * removing batch padding in prefil stage Signed-off-by: Yannick Schnider <[email protected]> * indices always list of Trues since no padding or removed sequences... Signed-off-by: Yannick Schnider <[email protected]> * fix active/free page handling Signed-off-by: Yannick Schnider <[email protected]> * avoiding unnecessary tensor construction Signed-off-by: Yannick Schnider <[email protected]> * fix sorting indifference token/position_ids vs masks Signed-off-by: Yannick Schnider <[email protected]> * refactoring not requiring req_ids2idx Signed-off-by: Yannick Schnider <[email protected]> * removing unsused class variables, simplifying code Signed-off-by: Yannick Schnider <[email protected]> * use VLLM_SPYRE_MAX_BATCH_SIZE to control (decoding) batch size on AIU Spyre Signed-off-by: Yannick Schnider <[email protected]> * removing unnecessary helper functions for schedule and add_request Signed-off-by: Yannick Schnider <[email protected]> * removing unused argument Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Nikolaos Papandreou <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> Co-authored-by: Yannick Schnider <[email protected]> * re-enabling TP tests Signed-off-by: Yannick Schnider <[email protected]> * addressing feedback: renaming and removing unused stuff Signed-off-by: Yannick Schnider <[email protected]> * removing unnecessary getter function and other feedback Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Nikolaos Papandreou <[email protected]> Co-authored-by: Nikolaos Papandreou <[email protected]>
* [Continuous batching] FMS model wrapper (#18) * fms wrapper dummy for continuous batching implementation, gating via env var VLLM_SPYRE_USE_CB Signed-off-by: Yannick Schnider <[email protected]> * implementing fms wrapper with correct KV cache managment Signed-off-by: Yannick Schnider <[email protected]> * disable prints by default Signed-off-by: Yannick Schnider <[email protected]> * code refactoring fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * fix default path not using CB/ fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * correct print when TESTING_CB Signed-off-by: Yannick Schnider <[email protected]> * remove self.past_key_value_states when KV cache is managed by FMS wrapper Signed-off-by: Yannick Schnider <[email protected]> * read-out only active pages of KV cache (covers when curr batch size < max batch size) Signed-off-by: Yannick Schnider <[email protected]> * uniquely distinguishing prefills and decodes Signed-off-by: Yannick Schnider <[email protected]> * reading kv cache dimension from model config Signed-off-by: Yannick Schnider <[email protected]> * cosmetics and comments Signed-off-by: Yannick Schnider <[email protected]> * support for gpt big code models Signed-off-by: Yannick Schnider <[email protected]> * bugfix hard coded test mask Signed-off-by: Yannick Schnider <[email protected]> * change KV cache type for prefill Signed-off-by: Yannick Schnider <[email protected]> * update tkv in fms wrapper Signed-off-by: Yannick Schnider <[email protected]> * moving fms wrapper to own class Signed-off-by: Yannick Schnider <[email protected]> * reset tkv for new prompt Signed-off-by: Yannick Schnider <[email protected]> * ignoring test_spyre_tensor_parallel.py, since FMS wrapper does not support it Signed-off-by: Yannick Schnider <[email protected]> * removing VLLM_SPYRE_USE_CB, since FMS wrapper is now used by default Signed-off-by: Yannick Schnider <[email protected]> * typing fms wrapper class Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> * moving model loading into FMS wrapper (#35) Signed-off-by: Yannick Schnider <[email protected]> * bugfix idx kv cache update (#40) Signed-off-by: Yannick Schnider <[email protected]> * FMS Wrapper for static batching (#39) * introducing pseudo fms wrapper for static batching Signed-off-by: Yannick Schnider <[email protected]> * small bug fix Signed-off-by: Yannick Schnider <[email protected]> * bugfix idx kv cache update Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> * [Continuous Batching] Introducing new env variables (#67) * introducing env variables for AIU Spyre KV cache dimensions Signed-off-by: Yannick Schnider <[email protected]> * removing prints Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> * [Continuous batching] Initial cb test (#52) * initial cb test Signed-off-by: Nikolaos Papandreou <[email protected]> * make tkv, active_pages optional in SpyreCausalLM class for the V0 tests Signed-off-by: Nikolaos Papandreou <[email protected]> * format Signed-off-by: Nikolaos Papandreou <[email protected]> * remove manual testing and fix formatting Signed-off-by: Yannick Schnider <[email protected]> * remove tkv2fms Signed-off-by: Yannick Schnider <[email protected]> * remove unnecessary class variables Signed-off-by: Yannick Schnider <[email protected]> * tidy up class variables Signed-off-by: Yannick Schnider <[email protected]> * simplify code: req_ids2idx and active_pages will be reset in prepare input anyway... Signed-off-by: Yannick Schnider <[email protected]> * renaming variable Signed-off-by: Yannick Schnider <[email protected]> * removing batch padding in prefil stage Signed-off-by: Yannick Schnider <[email protected]> * indices always list of Trues since no padding or removed sequences... Signed-off-by: Yannick Schnider <[email protected]> * fix active/free page handling Signed-off-by: Yannick Schnider <[email protected]> * avoiding unnecessary tensor construction Signed-off-by: Yannick Schnider <[email protected]> * fix sorting indifference token/position_ids vs masks Signed-off-by: Yannick Schnider <[email protected]> * refactoring not requiring req_ids2idx Signed-off-by: Yannick Schnider <[email protected]> * removing unsused class variables, simplifying code Signed-off-by: Yannick Schnider <[email protected]> * use VLLM_SPYRE_MAX_BATCH_SIZE to control (decoding) batch size on AIU Spyre Signed-off-by: Yannick Schnider <[email protected]> * removing unnecessary helper functions for schedule and add_request Signed-off-by: Yannick Schnider <[email protected]> * removing unused argument Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Nikolaos Papandreou <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> Co-authored-by: Yannick Schnider <[email protected]> * re-enabling TP tests Signed-off-by: Yannick Schnider <[email protected]> * addressing feedback: renaming and removing unused stuff Signed-off-by: Yannick Schnider <[email protected]> * removing unnecessary getter function and other feedback Signed-off-by: Yannick Schnider <[email protected]> * integrating new FMS API on branch 'paged_attn_mock' Signed-off-by: Yannick Schnider <[email protected]> * torch dynamo: mark dynamic/static shapes Signed-off-by: Yannick Schnider <[email protected]> * bugfix key_value_states name Signed-off-by: Nikolaos Papandreou <[email protected]> * making block_table and slot_mapping args, not class vars Signed-off-by: Yannick Schnider <[email protected]> * formatting after browser merge... Signed-off-by: Yannick Schnider <[email protected]> * nicer handling of arguments continuous vs static batching Signed-off-by: Yannick Schnider <[email protected]> * Implement warmup for continuous batching (#83) * Implement warmup for continuous batching Signed-off-by: Thomas Parnell <[email protected]> * fmt Signed-off-by: Thomas Parnell <[email protected]> * freeing block directly and small things Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> Co-authored-by: Yannick Schnider <[email protected]> * initialize tkv Signed-off-by: Nikolaos Papandreou <[email protected]> * Return empty ModelRunnerOuptut if no work Signed-off-by: Nikolaos Papandreou <[email protected]> * update mask for decode Signed-off-by: Nikolaos Papandreou <[email protected]> * Fix copy/paste error Signed-off-by: Thomas Parnell <[email protected]> * adaptive loging (thx joerunde) Co-authored-by: Joe Runde <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> * remove warmup shapes for continuous batching Signed-off-by: Yannick Schnider <[email protected]> * assuring prefil lengths are multiples of block size 64 in example script Signed-off-by: Yannick Schnider <[email protected]> * revert change to warmup shape Signed-off-by: Thomas Parnell <[email protected]> * 🎨 fmt Signed-off-by: Joe Runde <[email protected]> * Added call to update_lazyhandle Signed-off-by: Thomas Parnell <[email protected]> * Right padding of prompts (#95) * right padding initial implementation Signed-off-by: Yannick Schnider <[email protected]> * fix right padding: remove the right padded logits before sampling Signed-off-by: Yannick Schnider <[email protected]> * fix typing Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> * [CB] Fix Tensor Parallelism Error (#103) * divide tensor third dimension by number of TP Signed-off-by: Sophie du Couédic <[email protected]> * Use existing method from vllm to get 'num_kv_heads' (works also for TP>1) Signed-off-by: Sophie du Couédic <[email protected]> --------- Signed-off-by: Sophie du Couédic <[email protected]> * support granite-3.2-8b-instruct (#106) Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> * comments Signed-off-by: Yannick Schnider <[email protected]> * adapt to change of arguments in fms Signed-off-by: Yannick Schnider <[email protected]> * fix mypy issue Signed-off-by: Yannick Schnider <[email protected]> * revising continuous batching scheduler Signed-off-by: Yannick Schnider <[email protected]> * [V1] Decoupling static and continuous batching (#116) * decoupling static and continuous batching scheduler Signed-off-by: Yannick Schnider <[email protected]> * fix dynamo cache for continuous batching Signed-off-by: Yannick Schnider <[email protected]> * removing warmup shape dependency for continuous batching! Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> * addressing review cosmetics Signed-off-by: Yannick Schnider <[email protected]> * fix/refactor: remove last_running and total_running (#112) Signed-off-by: Travis Johnson <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> Co-authored-by: Yannick Schnider <[email protected]> * fix comment kv cache tensor initialization Signed-off-by: Yannick Schnider <[email protected]> --------- Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Yannick Schnider <[email protected]> Signed-off-by: Nikolaos Papandreou <[email protected]> Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Sophie du Couédic <[email protected]> Signed-off-by: Travis Johnson <[email protected]> Co-authored-by: Nikolaos Papandreou <[email protected]> Co-authored-by: Thomas Parnell <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Sophie du Couédic <[email protected]> Co-authored-by: Travis Johnson <[email protected]>
Initial CB implementation (for vLLM V1). Works with FMS model wrapper.
Test with
offline_inference_spyre_cb_test.py
, setVLLM_SPYRE_USE_CB
to 1 for continuous batching or 0 for static batching.