Skip to content

Conversation

nikolaospapandreou
Copy link
Collaborator

Initial CB implementation (for vLLM V1). Works with FMS model wrapper.
Test with offline_inference_spyre_cb_test.py, set VLLM_SPYRE_USE_CB to 1 for continuous batching or 0 for static batching.

Signed-off-by: Nikolaos Papandreou <[email protected]>
Copy link

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes:

pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

self.tkv = 0
if not envs_spyre.VLLM_SPYRE_USE_CB:
self.model.past_key_value_states = None
self.tkv = tkv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we set self.tkv here? It looks like it is not used.

only_last_token=True,
tkv=self.tkv,
active_pages=[i for i in range(input_ids.shape[0])],
#active_pages=[i for i in range(input_ids.shape[0])],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove commented out lines

outputs = super().schedule()
return outputs

def schedule_cb(self) -> "SchedulerOutput":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose have separate classes StaticBatchingSpyreScheduler and ContinuousBatchingSpyreScheduler and just implementing the schedule function differently, rather than having two functions. This could be addressed in the PR to main though, rather than this one.

Comment on lines 143 to 148
available_warmup_shapes = [
shape for shape in available_warmup_shapes
if request.num_prompt_tokens <= shape['prompt_length']
and max_tokens <= shape['new_tokens']
and len(self.waiting) < shape['batch_size']
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the continuous batching logic should not depend on the warmup shapes in this way?

Comment on lines 115 to 124
self._req_ids2idx_prompt: dict = {}
self._req_ids2idx_decode: dict = {}
self._decode_batch_size = 0
self._active_pages = []
self._free_page_idxs = []
self._position_ids_prompt: torch.Tensor = None
self._mask_prompt: torch.Tensor = None
self._tkv: int = 0
self._tkv2fms: int = 0
self._prev_step_dec = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the management of pages here is implemented at the wrong level. We should not be trying to maintain all of this state in the model runner itself. We should be looking at how vLLM (V1) is implemented on GPU as a guide (e.g., we should be using something like the InputBatch class to maintain this state). I think as a first attempt it is fine, and we can iteratively improve it from here.

Comment on lines 126 to 129
warmup_shapes = current_platform.get_warmup_shapes()
max_prompt_length = max(shape["prompt_length"]
for shape in warmup_shapes)
max_batch_size = max(shape["batch_size"] for shape in warmup_shapes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuous batching should not use warmup shapes

Comment on lines 149 to 152
def _prepare_prompt_cb(
self,
new_requests: List[NewRequestData],
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, List[int]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to scheduler, I would suggest splitting into two model runner classes SpyreModelRunnner and ContinuousBatchingSpyreModelRunner or similar

positions=model_input.input_positions,
masks=model_input.input_masks,
is_prompt=model_input.is_prompt,
tkv = self._tkv2fms,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is self._tkv2fms needed? Can't we just use self.tkv?

yannicks1 added 14 commits April 4, 2025 09:56
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Comment on lines +82 to +88
# For continuous batching we use max_num_seqs to control
# the max batch size respecting AIU Spyre KV cache size
scheduler_config.max_num_seqs =\
envs_spyre.VLLM_SPYRE_MAX_BATCH_SIZE
# ToDo: this function check_and_update_config is called twice:
# 1st time scheduler_config.max_num_seqs is what user sets
# 2nd time scheduler_config.max_num_seqs is 128
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to override max-num-seqs at all?

@yannicks1
Copy link
Collaborator

LGTM

@yannicks1 yannicks1 merged commit 3ab164d into dev-continuous-batching Apr 7, 2025
2 checks passed
@yannicks1 yannicks1 deleted the npo-cb-test branch April 7, 2025 13:21
tdoublep pushed a commit that referenced this pull request Apr 7, 2025
* initial cb test

Signed-off-by: Nikolaos Papandreou <[email protected]>

* make tkv, active_pages optional in SpyreCausalLM class for the V0 tests

Signed-off-by: Nikolaos Papandreou <[email protected]>

* format

Signed-off-by: Nikolaos Papandreou <[email protected]>

* remove manual testing and fix formatting

Signed-off-by: Yannick Schnider <[email protected]>

* remove tkv2fms

Signed-off-by: Yannick Schnider <[email protected]>

* remove unnecessary class variables

Signed-off-by: Yannick Schnider <[email protected]>

* tidy up class variables

Signed-off-by: Yannick Schnider <[email protected]>

* simplify code: req_ids2idx and active_pages will be reset in prepare input anyway...

Signed-off-by: Yannick Schnider <[email protected]>

* renaming variable

Signed-off-by: Yannick Schnider <[email protected]>

* removing batch padding in prefil stage

Signed-off-by: Yannick Schnider <[email protected]>

* indices always list of Trues since no padding or removed sequences...

Signed-off-by: Yannick Schnider <[email protected]>

* fix active/free page handling

Signed-off-by: Yannick Schnider <[email protected]>

* avoiding unnecessary tensor construction

Signed-off-by: Yannick Schnider <[email protected]>

* fix sorting indifference token/position_ids vs masks

Signed-off-by: Yannick Schnider <[email protected]>

* refactoring not requiring req_ids2idx

Signed-off-by: Yannick Schnider <[email protected]>

* removing unsused class variables, simplifying code

Signed-off-by: Yannick Schnider <[email protected]>

* use VLLM_SPYRE_MAX_BATCH_SIZE to control (decoding) batch size on AIU Spyre

Signed-off-by: Yannick Schnider <[email protected]>

* removing unnecessary helper functions for schedule and add_request

Signed-off-by: Yannick Schnider <[email protected]>

* removing unused argument

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Nikolaos Papandreou <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Co-authored-by: Yannick Schnider <[email protected]>
yannicks1 added a commit that referenced this pull request Apr 9, 2025
* [Continuous batching] FMS model wrapper (#18)

* fms wrapper dummy for continuous batching implementation, gating via env var VLLM_SPYRE_USE_CB

Signed-off-by: Yannick Schnider <[email protected]>

* implementing fms wrapper with correct KV cache managment

Signed-off-by: Yannick Schnider <[email protected]>

* disable prints by default

Signed-off-by: Yannick Schnider <[email protected]>

* code refactoring fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* fix default path not using CB/ fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* correct print when TESTING_CB

Signed-off-by: Yannick Schnider <[email protected]>

* remove self.past_key_value_states when KV cache is managed by FMS wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* read-out only active pages of KV cache (covers when curr batch size < max batch size)

Signed-off-by: Yannick Schnider <[email protected]>

* uniquely distinguishing prefills and decodes

Signed-off-by: Yannick Schnider <[email protected]>

* reading kv cache dimension from model config

Signed-off-by: Yannick Schnider <[email protected]>

* cosmetics and comments

Signed-off-by: Yannick Schnider <[email protected]>

* support for gpt big code models

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix hard coded test mask

Signed-off-by: Yannick Schnider <[email protected]>

* change KV cache type for prefill

Signed-off-by: Yannick Schnider <[email protected]>

* update tkv in fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* moving fms wrapper to own class

Signed-off-by: Yannick Schnider <[email protected]>

* reset tkv for new prompt

Signed-off-by: Yannick Schnider <[email protected]>

* ignoring test_spyre_tensor_parallel.py, since FMS wrapper does not support it

Signed-off-by: Yannick Schnider <[email protected]>

* removing VLLM_SPYRE_USE_CB, since FMS wrapper is now used by default

Signed-off-by: Yannick Schnider <[email protected]>

* typing fms wrapper class

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* moving model loading into FMS wrapper (#35)

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix idx kv cache update (#40)

Signed-off-by: Yannick Schnider <[email protected]>

* FMS Wrapper for static batching (#39)

* introducing pseudo fms wrapper for static batching

Signed-off-by: Yannick Schnider <[email protected]>

* small bug fix

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix idx kv cache update

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>

* [Continuous Batching] Introducing new env variables (#67)

* introducing env variables for AIU Spyre KV cache dimensions

Signed-off-by: Yannick Schnider <[email protected]>

* removing prints

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* [Continuous batching] Initial cb test (#52)

* initial cb test

Signed-off-by: Nikolaos Papandreou <[email protected]>

* make tkv, active_pages optional in SpyreCausalLM class for the V0 tests

Signed-off-by: Nikolaos Papandreou <[email protected]>

* format

Signed-off-by: Nikolaos Papandreou <[email protected]>

* remove manual testing and fix formatting

Signed-off-by: Yannick Schnider <[email protected]>

* remove tkv2fms

Signed-off-by: Yannick Schnider <[email protected]>

* remove unnecessary class variables

Signed-off-by: Yannick Schnider <[email protected]>

* tidy up class variables

Signed-off-by: Yannick Schnider <[email protected]>

* simplify code: req_ids2idx and active_pages will be reset in prepare input anyway...

Signed-off-by: Yannick Schnider <[email protected]>

* renaming variable

Signed-off-by: Yannick Schnider <[email protected]>

* removing batch padding in prefil stage

Signed-off-by: Yannick Schnider <[email protected]>

* indices always list of Trues since no padding or removed sequences...

Signed-off-by: Yannick Schnider <[email protected]>

* fix active/free page handling

Signed-off-by: Yannick Schnider <[email protected]>

* avoiding unnecessary tensor construction

Signed-off-by: Yannick Schnider <[email protected]>

* fix sorting indifference token/position_ids vs masks

Signed-off-by: Yannick Schnider <[email protected]>

* refactoring not requiring req_ids2idx

Signed-off-by: Yannick Schnider <[email protected]>

* removing unsused class variables, simplifying code

Signed-off-by: Yannick Schnider <[email protected]>

* use VLLM_SPYRE_MAX_BATCH_SIZE to control (decoding) batch size on AIU Spyre

Signed-off-by: Yannick Schnider <[email protected]>

* removing unnecessary helper functions for schedule and add_request

Signed-off-by: Yannick Schnider <[email protected]>

* removing unused argument

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Nikolaos Papandreou <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Co-authored-by: Yannick Schnider <[email protected]>

* re-enabling TP tests

Signed-off-by: Yannick Schnider <[email protected]>

* addressing feedback: renaming and removing unused stuff

Signed-off-by: Yannick Schnider <[email protected]>

* removing unnecessary getter function and other feedback

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Nikolaos Papandreou <[email protected]>
Co-authored-by: Nikolaos Papandreou <[email protected]>
rafvasq pushed a commit to rafvasq/vllm-spyre that referenced this pull request Apr 11, 2025
…llm-project#66)

* [Continuous batching] FMS model wrapper (vllm-project#18)

* fms wrapper dummy for continuous batching implementation, gating via env var VLLM_SPYRE_USE_CB

Signed-off-by: Yannick Schnider <[email protected]>

* implementing fms wrapper with correct KV cache managment

Signed-off-by: Yannick Schnider <[email protected]>

* disable prints by default

Signed-off-by: Yannick Schnider <[email protected]>

* code refactoring fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* fix default path not using CB/ fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* correct print when TESTING_CB

Signed-off-by: Yannick Schnider <[email protected]>

* remove self.past_key_value_states when KV cache is managed by FMS wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* read-out only active pages of KV cache (covers when curr batch size < max batch size)

Signed-off-by: Yannick Schnider <[email protected]>

* uniquely distinguishing prefills and decodes

Signed-off-by: Yannick Schnider <[email protected]>

* reading kv cache dimension from model config

Signed-off-by: Yannick Schnider <[email protected]>

* cosmetics and comments

Signed-off-by: Yannick Schnider <[email protected]>

* support for gpt big code models

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix hard coded test mask

Signed-off-by: Yannick Schnider <[email protected]>

* change KV cache type for prefill

Signed-off-by: Yannick Schnider <[email protected]>

* update tkv in fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* moving fms wrapper to own class

Signed-off-by: Yannick Schnider <[email protected]>

* reset tkv for new prompt

Signed-off-by: Yannick Schnider <[email protected]>

* ignoring test_spyre_tensor_parallel.py, since FMS wrapper does not support it

Signed-off-by: Yannick Schnider <[email protected]>

* removing VLLM_SPYRE_USE_CB, since FMS wrapper is now used by default

Signed-off-by: Yannick Schnider <[email protected]>

* typing fms wrapper class

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* moving model loading into FMS wrapper (vllm-project#35)

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix idx kv cache update (vllm-project#40)

Signed-off-by: Yannick Schnider <[email protected]>

* FMS Wrapper for static batching (vllm-project#39)

* introducing pseudo fms wrapper for static batching

Signed-off-by: Yannick Schnider <[email protected]>

* small bug fix

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix idx kv cache update

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>

* [Continuous Batching] Introducing new env variables (vllm-project#67)

* introducing env variables for AIU Spyre KV cache dimensions

Signed-off-by: Yannick Schnider <[email protected]>

* removing prints

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* [Continuous batching] Initial cb test (vllm-project#52)

* initial cb test

Signed-off-by: Nikolaos Papandreou <[email protected]>

* make tkv, active_pages optional in SpyreCausalLM class for the V0 tests

Signed-off-by: Nikolaos Papandreou <[email protected]>

* format

Signed-off-by: Nikolaos Papandreou <[email protected]>

* remove manual testing and fix formatting

Signed-off-by: Yannick Schnider <[email protected]>

* remove tkv2fms

Signed-off-by: Yannick Schnider <[email protected]>

* remove unnecessary class variables

Signed-off-by: Yannick Schnider <[email protected]>

* tidy up class variables

Signed-off-by: Yannick Schnider <[email protected]>

* simplify code: req_ids2idx and active_pages will be reset in prepare input anyway...

Signed-off-by: Yannick Schnider <[email protected]>

* renaming variable

Signed-off-by: Yannick Schnider <[email protected]>

* removing batch padding in prefil stage

Signed-off-by: Yannick Schnider <[email protected]>

* indices always list of Trues since no padding or removed sequences...

Signed-off-by: Yannick Schnider <[email protected]>

* fix active/free page handling

Signed-off-by: Yannick Schnider <[email protected]>

* avoiding unnecessary tensor construction

Signed-off-by: Yannick Schnider <[email protected]>

* fix sorting indifference token/position_ids vs masks

Signed-off-by: Yannick Schnider <[email protected]>

* refactoring not requiring req_ids2idx

Signed-off-by: Yannick Schnider <[email protected]>

* removing unsused class variables, simplifying code

Signed-off-by: Yannick Schnider <[email protected]>

* use VLLM_SPYRE_MAX_BATCH_SIZE to control (decoding) batch size on AIU Spyre

Signed-off-by: Yannick Schnider <[email protected]>

* removing unnecessary helper functions for schedule and add_request

Signed-off-by: Yannick Schnider <[email protected]>

* removing unused argument

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Nikolaos Papandreou <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Co-authored-by: Yannick Schnider <[email protected]>

* re-enabling TP tests

Signed-off-by: Yannick Schnider <[email protected]>

* addressing feedback: renaming and removing unused stuff

Signed-off-by: Yannick Schnider <[email protected]>

* removing unnecessary getter function and other feedback

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Nikolaos Papandreou <[email protected]>
Co-authored-by: Nikolaos Papandreou <[email protected]>
yannicks1 added a commit that referenced this pull request Apr 28, 2025
* [Continuous batching] FMS model wrapper (#18)

* fms wrapper dummy for continuous batching implementation, gating via env var VLLM_SPYRE_USE_CB

Signed-off-by: Yannick Schnider <[email protected]>

* implementing fms wrapper with correct KV cache managment

Signed-off-by: Yannick Schnider <[email protected]>

* disable prints by default

Signed-off-by: Yannick Schnider <[email protected]>

* code refactoring fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* fix default path not using CB/ fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* correct print when TESTING_CB

Signed-off-by: Yannick Schnider <[email protected]>

* remove self.past_key_value_states when KV cache is managed by FMS wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* read-out only active pages of KV cache (covers when curr batch size < max batch size)

Signed-off-by: Yannick Schnider <[email protected]>

* uniquely distinguishing prefills and decodes

Signed-off-by: Yannick Schnider <[email protected]>

* reading kv cache dimension from model config

Signed-off-by: Yannick Schnider <[email protected]>

* cosmetics and comments

Signed-off-by: Yannick Schnider <[email protected]>

* support for gpt big code models

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix hard coded test mask

Signed-off-by: Yannick Schnider <[email protected]>

* change KV cache type for prefill

Signed-off-by: Yannick Schnider <[email protected]>

* update tkv in fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* moving fms wrapper to own class

Signed-off-by: Yannick Schnider <[email protected]>

* reset tkv for new prompt

Signed-off-by: Yannick Schnider <[email protected]>

* ignoring test_spyre_tensor_parallel.py, since FMS wrapper does not support it

Signed-off-by: Yannick Schnider <[email protected]>

* removing VLLM_SPYRE_USE_CB, since FMS wrapper is now used by default

Signed-off-by: Yannick Schnider <[email protected]>

* typing fms wrapper class

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* moving model loading into FMS wrapper (#35)

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix idx kv cache update (#40)

Signed-off-by: Yannick Schnider <[email protected]>

* FMS Wrapper for static batching (#39)

* introducing pseudo fms wrapper for static batching

Signed-off-by: Yannick Schnider <[email protected]>

* small bug fix

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix idx kv cache update

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>

* [Continuous Batching] Introducing new env variables (#67)

* introducing env variables for AIU Spyre KV cache dimensions

Signed-off-by: Yannick Schnider <[email protected]>

* removing prints

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* [Continuous batching] Initial cb test (#52)

* initial cb test

Signed-off-by: Nikolaos Papandreou <[email protected]>

* make tkv, active_pages optional in SpyreCausalLM class for the V0 tests

Signed-off-by: Nikolaos Papandreou <[email protected]>

* format

Signed-off-by: Nikolaos Papandreou <[email protected]>

* remove manual testing and fix formatting

Signed-off-by: Yannick Schnider <[email protected]>

* remove tkv2fms

Signed-off-by: Yannick Schnider <[email protected]>

* remove unnecessary class variables

Signed-off-by: Yannick Schnider <[email protected]>

* tidy up class variables

Signed-off-by: Yannick Schnider <[email protected]>

* simplify code: req_ids2idx and active_pages will be reset in prepare input anyway...

Signed-off-by: Yannick Schnider <[email protected]>

* renaming variable

Signed-off-by: Yannick Schnider <[email protected]>

* removing batch padding in prefil stage

Signed-off-by: Yannick Schnider <[email protected]>

* indices always list of Trues since no padding or removed sequences...

Signed-off-by: Yannick Schnider <[email protected]>

* fix active/free page handling

Signed-off-by: Yannick Schnider <[email protected]>

* avoiding unnecessary tensor construction

Signed-off-by: Yannick Schnider <[email protected]>

* fix sorting indifference token/position_ids vs masks

Signed-off-by: Yannick Schnider <[email protected]>

* refactoring not requiring req_ids2idx

Signed-off-by: Yannick Schnider <[email protected]>

* removing unsused class variables, simplifying code

Signed-off-by: Yannick Schnider <[email protected]>

* use VLLM_SPYRE_MAX_BATCH_SIZE to control (decoding) batch size on AIU Spyre

Signed-off-by: Yannick Schnider <[email protected]>

* removing unnecessary helper functions for schedule and add_request

Signed-off-by: Yannick Schnider <[email protected]>

* removing unused argument

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Nikolaos Papandreou <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Co-authored-by: Yannick Schnider <[email protected]>

* re-enabling TP tests

Signed-off-by: Yannick Schnider <[email protected]>

* addressing feedback: renaming and removing unused stuff

Signed-off-by: Yannick Schnider <[email protected]>

* removing unnecessary getter function and other feedback

Signed-off-by: Yannick Schnider <[email protected]>

* integrating new FMS API on branch 'paged_attn_mock'

Signed-off-by: Yannick Schnider <[email protected]>

* torch dynamo: mark dynamic/static shapes

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix key_value_states name

Signed-off-by: Nikolaos Papandreou <[email protected]>

* making block_table and slot_mapping args, not class vars

Signed-off-by: Yannick Schnider <[email protected]>

* formatting after browser merge...

Signed-off-by: Yannick Schnider <[email protected]>

* nicer handling of arguments continuous vs static batching

Signed-off-by: Yannick Schnider <[email protected]>

* Implement warmup for continuous batching (#83)

* Implement warmup for continuous batching

Signed-off-by: Thomas Parnell <[email protected]>

* fmt

Signed-off-by: Thomas Parnell <[email protected]>

* freeing block directly and small things

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Co-authored-by: Yannick Schnider <[email protected]>

* initialize tkv

Signed-off-by: Nikolaos Papandreou <[email protected]>

* Return empty ModelRunnerOuptut if no work

Signed-off-by: Nikolaos Papandreou <[email protected]>

* update mask for decode

Signed-off-by: Nikolaos Papandreou <[email protected]>

* Fix copy/paste error

Signed-off-by: Thomas Parnell <[email protected]>

* adaptive loging (thx joerunde)

Co-authored-by: Joe Runde <[email protected]>

Signed-off-by: Yannick Schnider <[email protected]>

* remove warmup shapes for continuous batching

Signed-off-by: Yannick Schnider <[email protected]>

* assuring prefil lengths are multiples of block size 64 in example script

Signed-off-by: Yannick Schnider <[email protected]>

* revert change to warmup shape

Signed-off-by: Thomas Parnell <[email protected]>

* 🎨 fmt

Signed-off-by: Joe Runde <[email protected]>

* Added call to update_lazyhandle

Signed-off-by: Thomas Parnell <[email protected]>

* Right padding of prompts (#95)

* right padding initial implementation

Signed-off-by: Yannick Schnider <[email protected]>

* fix right padding: remove the right padded logits before sampling

Signed-off-by: Yannick Schnider <[email protected]>

* fix typing

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* [CB] Fix Tensor Parallelism Error (#103)

* divide tensor third dimension by number of TP

Signed-off-by: Sophie du Couédic <[email protected]>

* Use existing method from vllm to get 'num_kv_heads' (works also for TP>1)

Signed-off-by: Sophie du Couédic <[email protected]>

---------

Signed-off-by: Sophie du Couédic <[email protected]>

* support granite-3.2-8b-instruct (#106)

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>

* comments

Signed-off-by: Yannick Schnider <[email protected]>

* adapt to change of arguments in fms

Signed-off-by: Yannick Schnider <[email protected]>

* fix mypy issue

Signed-off-by: Yannick Schnider <[email protected]>

* revising continuous batching scheduler

Signed-off-by: Yannick Schnider <[email protected]>

* [V1] Decoupling static and continuous batching  (#116)

* decoupling static and continuous batching scheduler

Signed-off-by: Yannick Schnider <[email protected]>

* fix dynamo cache for continuous batching

Signed-off-by: Yannick Schnider <[email protected]>

* removing warmup shape dependency for continuous batching!

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* addressing review cosmetics

Signed-off-by: Yannick Schnider <[email protected]>

* fix/refactor: remove last_running and total_running (#112)

Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Co-authored-by: Yannick Schnider <[email protected]>

* fix comment kv cache tensor initialization

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Nikolaos Papandreou <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Sophie du Couédic <[email protected]>
Signed-off-by: Travis Johnson <[email protected]>
Co-authored-by: Nikolaos Papandreou <[email protected]>
Co-authored-by: Thomas Parnell <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Co-authored-by: Sophie du Couédic <[email protected]>
Co-authored-by: Travis Johnson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants