Skip to content

Conversation

yannicks1
Copy link
Collaborator

Introducing new env variables

Adding two new env variable determining the dimension of the KV cache on AIU Spyre:

  • VLLM_SPYRE_MAX_BATCH_SIZE
  • VLLM_SPYRE_MAX_CONTEXT_LENGTH

Note: variables will be interacting with AIU Spyre compiler.

Copy link

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes:

pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

Copy link
Collaborator

@nikolaospapandreou nikolaospapandreou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yannicks1 yannicks1 merged commit 2134317 into dev-continuous-batching Apr 1, 2025
3 checks passed
@yannicks1 yannicks1 deleted the ysc-env-vars branch April 1, 2025 16:55
yannicks1 added a commit that referenced this pull request Apr 9, 2025
* [Continuous batching] FMS model wrapper (#18)

* fms wrapper dummy for continuous batching implementation, gating via env var VLLM_SPYRE_USE_CB

Signed-off-by: Yannick Schnider <[email protected]>

* implementing fms wrapper with correct KV cache managment

Signed-off-by: Yannick Schnider <[email protected]>

* disable prints by default

Signed-off-by: Yannick Schnider <[email protected]>

* code refactoring fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* fix default path not using CB/ fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* correct print when TESTING_CB

Signed-off-by: Yannick Schnider <[email protected]>

* remove self.past_key_value_states when KV cache is managed by FMS wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* read-out only active pages of KV cache (covers when curr batch size < max batch size)

Signed-off-by: Yannick Schnider <[email protected]>

* uniquely distinguishing prefills and decodes

Signed-off-by: Yannick Schnider <[email protected]>

* reading kv cache dimension from model config

Signed-off-by: Yannick Schnider <[email protected]>

* cosmetics and comments

Signed-off-by: Yannick Schnider <[email protected]>

* support for gpt big code models

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix hard coded test mask

Signed-off-by: Yannick Schnider <[email protected]>

* change KV cache type for prefill

Signed-off-by: Yannick Schnider <[email protected]>

* update tkv in fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* moving fms wrapper to own class

Signed-off-by: Yannick Schnider <[email protected]>

* reset tkv for new prompt

Signed-off-by: Yannick Schnider <[email protected]>

* ignoring test_spyre_tensor_parallel.py, since FMS wrapper does not support it

Signed-off-by: Yannick Schnider <[email protected]>

* removing VLLM_SPYRE_USE_CB, since FMS wrapper is now used by default

Signed-off-by: Yannick Schnider <[email protected]>

* typing fms wrapper class

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* moving model loading into FMS wrapper (#35)

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix idx kv cache update (#40)

Signed-off-by: Yannick Schnider <[email protected]>

* FMS Wrapper for static batching (#39)

* introducing pseudo fms wrapper for static batching

Signed-off-by: Yannick Schnider <[email protected]>

* small bug fix

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix idx kv cache update

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>

* [Continuous Batching] Introducing new env variables (#67)

* introducing env variables for AIU Spyre KV cache dimensions

Signed-off-by: Yannick Schnider <[email protected]>

* removing prints

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* [Continuous batching] Initial cb test (#52)

* initial cb test

Signed-off-by: Nikolaos Papandreou <[email protected]>

* make tkv, active_pages optional in SpyreCausalLM class for the V0 tests

Signed-off-by: Nikolaos Papandreou <[email protected]>

* format

Signed-off-by: Nikolaos Papandreou <[email protected]>

* remove manual testing and fix formatting

Signed-off-by: Yannick Schnider <[email protected]>

* remove tkv2fms

Signed-off-by: Yannick Schnider <[email protected]>

* remove unnecessary class variables

Signed-off-by: Yannick Schnider <[email protected]>

* tidy up class variables

Signed-off-by: Yannick Schnider <[email protected]>

* simplify code: req_ids2idx and active_pages will be reset in prepare input anyway...

Signed-off-by: Yannick Schnider <[email protected]>

* renaming variable

Signed-off-by: Yannick Schnider <[email protected]>

* removing batch padding in prefil stage

Signed-off-by: Yannick Schnider <[email protected]>

* indices always list of Trues since no padding or removed sequences...

Signed-off-by: Yannick Schnider <[email protected]>

* fix active/free page handling

Signed-off-by: Yannick Schnider <[email protected]>

* avoiding unnecessary tensor construction

Signed-off-by: Yannick Schnider <[email protected]>

* fix sorting indifference token/position_ids vs masks

Signed-off-by: Yannick Schnider <[email protected]>

* refactoring not requiring req_ids2idx

Signed-off-by: Yannick Schnider <[email protected]>

* removing unsused class variables, simplifying code

Signed-off-by: Yannick Schnider <[email protected]>

* use VLLM_SPYRE_MAX_BATCH_SIZE to control (decoding) batch size on AIU Spyre

Signed-off-by: Yannick Schnider <[email protected]>

* removing unnecessary helper functions for schedule and add_request

Signed-off-by: Yannick Schnider <[email protected]>

* removing unused argument

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Nikolaos Papandreou <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Co-authored-by: Yannick Schnider <[email protected]>

* re-enabling TP tests

Signed-off-by: Yannick Schnider <[email protected]>

* addressing feedback: renaming and removing unused stuff

Signed-off-by: Yannick Schnider <[email protected]>

* removing unnecessary getter function and other feedback

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Nikolaos Papandreou <[email protected]>
Co-authored-by: Nikolaos Papandreou <[email protected]>
rafvasq pushed a commit to rafvasq/vllm-spyre that referenced this pull request Apr 11, 2025
…llm-project#66)

* [Continuous batching] FMS model wrapper (vllm-project#18)

* fms wrapper dummy for continuous batching implementation, gating via env var VLLM_SPYRE_USE_CB

Signed-off-by: Yannick Schnider <[email protected]>

* implementing fms wrapper with correct KV cache managment

Signed-off-by: Yannick Schnider <[email protected]>

* disable prints by default

Signed-off-by: Yannick Schnider <[email protected]>

* code refactoring fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* fix default path not using CB/ fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* correct print when TESTING_CB

Signed-off-by: Yannick Schnider <[email protected]>

* remove self.past_key_value_states when KV cache is managed by FMS wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* read-out only active pages of KV cache (covers when curr batch size < max batch size)

Signed-off-by: Yannick Schnider <[email protected]>

* uniquely distinguishing prefills and decodes

Signed-off-by: Yannick Schnider <[email protected]>

* reading kv cache dimension from model config

Signed-off-by: Yannick Schnider <[email protected]>

* cosmetics and comments

Signed-off-by: Yannick Schnider <[email protected]>

* support for gpt big code models

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix hard coded test mask

Signed-off-by: Yannick Schnider <[email protected]>

* change KV cache type for prefill

Signed-off-by: Yannick Schnider <[email protected]>

* update tkv in fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* moving fms wrapper to own class

Signed-off-by: Yannick Schnider <[email protected]>

* reset tkv for new prompt

Signed-off-by: Yannick Schnider <[email protected]>

* ignoring test_spyre_tensor_parallel.py, since FMS wrapper does not support it

Signed-off-by: Yannick Schnider <[email protected]>

* removing VLLM_SPYRE_USE_CB, since FMS wrapper is now used by default

Signed-off-by: Yannick Schnider <[email protected]>

* typing fms wrapper class

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* moving model loading into FMS wrapper (vllm-project#35)

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix idx kv cache update (vllm-project#40)

Signed-off-by: Yannick Schnider <[email protected]>

* FMS Wrapper for static batching (vllm-project#39)

* introducing pseudo fms wrapper for static batching

Signed-off-by: Yannick Schnider <[email protected]>

* small bug fix

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix idx kv cache update

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>

* [Continuous Batching] Introducing new env variables (vllm-project#67)

* introducing env variables for AIU Spyre KV cache dimensions

Signed-off-by: Yannick Schnider <[email protected]>

* removing prints

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* [Continuous batching] Initial cb test (vllm-project#52)

* initial cb test

Signed-off-by: Nikolaos Papandreou <[email protected]>

* make tkv, active_pages optional in SpyreCausalLM class for the V0 tests

Signed-off-by: Nikolaos Papandreou <[email protected]>

* format

Signed-off-by: Nikolaos Papandreou <[email protected]>

* remove manual testing and fix formatting

Signed-off-by: Yannick Schnider <[email protected]>

* remove tkv2fms

Signed-off-by: Yannick Schnider <[email protected]>

* remove unnecessary class variables

Signed-off-by: Yannick Schnider <[email protected]>

* tidy up class variables

Signed-off-by: Yannick Schnider <[email protected]>

* simplify code: req_ids2idx and active_pages will be reset in prepare input anyway...

Signed-off-by: Yannick Schnider <[email protected]>

* renaming variable

Signed-off-by: Yannick Schnider <[email protected]>

* removing batch padding in prefil stage

Signed-off-by: Yannick Schnider <[email protected]>

* indices always list of Trues since no padding or removed sequences...

Signed-off-by: Yannick Schnider <[email protected]>

* fix active/free page handling

Signed-off-by: Yannick Schnider <[email protected]>

* avoiding unnecessary tensor construction

Signed-off-by: Yannick Schnider <[email protected]>

* fix sorting indifference token/position_ids vs masks

Signed-off-by: Yannick Schnider <[email protected]>

* refactoring not requiring req_ids2idx

Signed-off-by: Yannick Schnider <[email protected]>

* removing unsused class variables, simplifying code

Signed-off-by: Yannick Schnider <[email protected]>

* use VLLM_SPYRE_MAX_BATCH_SIZE to control (decoding) batch size on AIU Spyre

Signed-off-by: Yannick Schnider <[email protected]>

* removing unnecessary helper functions for schedule and add_request

Signed-off-by: Yannick Schnider <[email protected]>

* removing unused argument

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Nikolaos Papandreou <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Co-authored-by: Yannick Schnider <[email protected]>

* re-enabling TP tests

Signed-off-by: Yannick Schnider <[email protected]>

* addressing feedback: renaming and removing unused stuff

Signed-off-by: Yannick Schnider <[email protected]>

* removing unnecessary getter function and other feedback

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Nikolaos Papandreou <[email protected]>
Co-authored-by: Nikolaos Papandreou <[email protected]>
yannicks1 added a commit that referenced this pull request Apr 28, 2025
* [Continuous batching] FMS model wrapper (#18)

* fms wrapper dummy for continuous batching implementation, gating via env var VLLM_SPYRE_USE_CB

Signed-off-by: Yannick Schnider <[email protected]>

* implementing fms wrapper with correct KV cache managment

Signed-off-by: Yannick Schnider <[email protected]>

* disable prints by default

Signed-off-by: Yannick Schnider <[email protected]>

* code refactoring fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* fix default path not using CB/ fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* correct print when TESTING_CB

Signed-off-by: Yannick Schnider <[email protected]>

* remove self.past_key_value_states when KV cache is managed by FMS wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* read-out only active pages of KV cache (covers when curr batch size < max batch size)

Signed-off-by: Yannick Schnider <[email protected]>

* uniquely distinguishing prefills and decodes

Signed-off-by: Yannick Schnider <[email protected]>

* reading kv cache dimension from model config

Signed-off-by: Yannick Schnider <[email protected]>

* cosmetics and comments

Signed-off-by: Yannick Schnider <[email protected]>

* support for gpt big code models

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix hard coded test mask

Signed-off-by: Yannick Schnider <[email protected]>

* change KV cache type for prefill

Signed-off-by: Yannick Schnider <[email protected]>

* update tkv in fms wrapper

Signed-off-by: Yannick Schnider <[email protected]>

* moving fms wrapper to own class

Signed-off-by: Yannick Schnider <[email protected]>

* reset tkv for new prompt

Signed-off-by: Yannick Schnider <[email protected]>

* ignoring test_spyre_tensor_parallel.py, since FMS wrapper does not support it

Signed-off-by: Yannick Schnider <[email protected]>

* removing VLLM_SPYRE_USE_CB, since FMS wrapper is now used by default

Signed-off-by: Yannick Schnider <[email protected]>

* typing fms wrapper class

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* moving model loading into FMS wrapper (#35)

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix idx kv cache update (#40)

Signed-off-by: Yannick Schnider <[email protected]>

* FMS Wrapper for static batching (#39)

* introducing pseudo fms wrapper for static batching

Signed-off-by: Yannick Schnider <[email protected]>

* small bug fix

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix idx kv cache update

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>

* [Continuous Batching] Introducing new env variables (#67)

* introducing env variables for AIU Spyre KV cache dimensions

Signed-off-by: Yannick Schnider <[email protected]>

* removing prints

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* [Continuous batching] Initial cb test (#52)

* initial cb test

Signed-off-by: Nikolaos Papandreou <[email protected]>

* make tkv, active_pages optional in SpyreCausalLM class for the V0 tests

Signed-off-by: Nikolaos Papandreou <[email protected]>

* format

Signed-off-by: Nikolaos Papandreou <[email protected]>

* remove manual testing and fix formatting

Signed-off-by: Yannick Schnider <[email protected]>

* remove tkv2fms

Signed-off-by: Yannick Schnider <[email protected]>

* remove unnecessary class variables

Signed-off-by: Yannick Schnider <[email protected]>

* tidy up class variables

Signed-off-by: Yannick Schnider <[email protected]>

* simplify code: req_ids2idx and active_pages will be reset in prepare input anyway...

Signed-off-by: Yannick Schnider <[email protected]>

* renaming variable

Signed-off-by: Yannick Schnider <[email protected]>

* removing batch padding in prefil stage

Signed-off-by: Yannick Schnider <[email protected]>

* indices always list of Trues since no padding or removed sequences...

Signed-off-by: Yannick Schnider <[email protected]>

* fix active/free page handling

Signed-off-by: Yannick Schnider <[email protected]>

* avoiding unnecessary tensor construction

Signed-off-by: Yannick Schnider <[email protected]>

* fix sorting indifference token/position_ids vs masks

Signed-off-by: Yannick Schnider <[email protected]>

* refactoring not requiring req_ids2idx

Signed-off-by: Yannick Schnider <[email protected]>

* removing unsused class variables, simplifying code

Signed-off-by: Yannick Schnider <[email protected]>

* use VLLM_SPYRE_MAX_BATCH_SIZE to control (decoding) batch size on AIU Spyre

Signed-off-by: Yannick Schnider <[email protected]>

* removing unnecessary helper functions for schedule and add_request

Signed-off-by: Yannick Schnider <[email protected]>

* removing unused argument

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Nikolaos Papandreou <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Co-authored-by: Yannick Schnider <[email protected]>

* re-enabling TP tests

Signed-off-by: Yannick Schnider <[email protected]>

* addressing feedback: renaming and removing unused stuff

Signed-off-by: Yannick Schnider <[email protected]>

* removing unnecessary getter function and other feedback

Signed-off-by: Yannick Schnider <[email protected]>

* integrating new FMS API on branch 'paged_attn_mock'

Signed-off-by: Yannick Schnider <[email protected]>

* torch dynamo: mark dynamic/static shapes

Signed-off-by: Yannick Schnider <[email protected]>

* bugfix key_value_states name

Signed-off-by: Nikolaos Papandreou <[email protected]>

* making block_table and slot_mapping args, not class vars

Signed-off-by: Yannick Schnider <[email protected]>

* formatting after browser merge...

Signed-off-by: Yannick Schnider <[email protected]>

* nicer handling of arguments continuous vs static batching

Signed-off-by: Yannick Schnider <[email protected]>

* Implement warmup for continuous batching (#83)

* Implement warmup for continuous batching

Signed-off-by: Thomas Parnell <[email protected]>

* fmt

Signed-off-by: Thomas Parnell <[email protected]>

* freeing block directly and small things

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Co-authored-by: Yannick Schnider <[email protected]>

* initialize tkv

Signed-off-by: Nikolaos Papandreou <[email protected]>

* Return empty ModelRunnerOuptut if no work

Signed-off-by: Nikolaos Papandreou <[email protected]>

* update mask for decode

Signed-off-by: Nikolaos Papandreou <[email protected]>

* Fix copy/paste error

Signed-off-by: Thomas Parnell <[email protected]>

* adaptive loging (thx joerunde)

Co-authored-by: Joe Runde <[email protected]>

Signed-off-by: Yannick Schnider <[email protected]>

* remove warmup shapes for continuous batching

Signed-off-by: Yannick Schnider <[email protected]>

* assuring prefil lengths are multiples of block size 64 in example script

Signed-off-by: Yannick Schnider <[email protected]>

* revert change to warmup shape

Signed-off-by: Thomas Parnell <[email protected]>

* 🎨 fmt

Signed-off-by: Joe Runde <[email protected]>

* Added call to update_lazyhandle

Signed-off-by: Thomas Parnell <[email protected]>

* Right padding of prompts (#95)

* right padding initial implementation

Signed-off-by: Yannick Schnider <[email protected]>

* fix right padding: remove the right padded logits before sampling

Signed-off-by: Yannick Schnider <[email protected]>

* fix typing

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* [CB] Fix Tensor Parallelism Error (#103)

* divide tensor third dimension by number of TP

Signed-off-by: Sophie du Couédic <[email protected]>

* Use existing method from vllm to get 'num_kv_heads' (works also for TP>1)

Signed-off-by: Sophie du Couédic <[email protected]>

---------

Signed-off-by: Sophie du Couédic <[email protected]>

* support granite-3.2-8b-instruct (#106)

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>

* comments

Signed-off-by: Yannick Schnider <[email protected]>

* adapt to change of arguments in fms

Signed-off-by: Yannick Schnider <[email protected]>

* fix mypy issue

Signed-off-by: Yannick Schnider <[email protected]>

* revising continuous batching scheduler

Signed-off-by: Yannick Schnider <[email protected]>

* [V1] Decoupling static and continuous batching  (#116)

* decoupling static and continuous batching scheduler

Signed-off-by: Yannick Schnider <[email protected]>

* fix dynamo cache for continuous batching

Signed-off-by: Yannick Schnider <[email protected]>

* removing warmup shape dependency for continuous batching!

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>

* addressing review cosmetics

Signed-off-by: Yannick Schnider <[email protected]>

* fix/refactor: remove last_running and total_running (#112)

Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Co-authored-by: Yannick Schnider <[email protected]>

* fix comment kv cache tensor initialization

Signed-off-by: Yannick Schnider <[email protected]>

---------

Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Nikolaos Papandreou <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Sophie du Couédic <[email protected]>
Signed-off-by: Travis Johnson <[email protected]>
Co-authored-by: Nikolaos Papandreou <[email protected]>
Co-authored-by: Thomas Parnell <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Co-authored-by: Sophie du Couédic <[email protected]>
Co-authored-by: Travis Johnson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants