[do not merge][CB] get number of blocks from compiler mock implementation #205

yannicks1 · 2025-06-03T20:17:56Z

[CB] get number of blocks from compiler mock implementation

This is a first draft how the message passing from the Spyre compiler to vLLM could work on vLLM side.

The process consists of the following steps:

For warmup vLLM reserves a required minimum of 4 pages/blocks (num_blocks=4).
The num_blocks (4) dimension is marked as dynamic (torch._dynamo.mark_dynamic()) for warmup forward calls only.
The Spyre compiler calculates the maximum number of pages/blocks it can accomodate and write this to a .json file.
torch_sendnn reads the value from the .json file and can return it via a function
vLLM calls the above mentioned torch_sendnn function to get the number of available blocks/pages and sets num_blocks=N
For actual inference vLLM will then adjust the list of free blocks/pages and the KV cache size accordingly (num_blocks=N).

Signed-off-by: Yannick Schnider <[email protected]>

github-actions · 2025-06-03T20:18:11Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, first install the linting requirements, then run format.sh and commit the changes. This can be done with uv directly:

uv sync --frozen --group lint --active --inexact

Or this can be done with pip:

uv pip compile --group lint > requirements-lint.txt
pip install -r requirements-lint.txt
bash format.sh

Now you are good to go 🚀

vllm_spyre/model_executor/model_loader/spyre.py

vllm_spyre/v1/worker/spyre_model_runner.py

vllm_spyre/v1/worker/spyre_worker.py

sducouedic · 2025-06-04T09:09:55Z

I realised all the changes in this PR only applies to sendnn_decoder backend and we still need the old logic for the other backends.

Maybe everything works by renaming function get_num_blocks_from_compiler_mock to get_num_blocks and in that function either return max_batch_size * max_model_len // self.BLOCK_SIZE if cpu backend or call the torch_sendnn function if torch_sendnn backend

sducouedic · 2025-06-04T09:15:55Z

I guess we will probably need to adapt max_model_len or BLOCK_SIZE to be consistent with the number of blocks? To be confirmed with the compiler team.

Rephrasing and update after meeting today: in the previous implementation, the number of blocks were set based on max_model_len, max_batch_size and BLOCK_SIZE and the input request were thoroughly checked against those values. Probably there are new checks to be done in order to avoid obscure errors (e.g. check max_model_len input).

Signed-off-by: Yannick Schnider <[email protected]>

Co-authored-by: Sophie du Couédic <[email protected]> Signed-off-by: Yannick Schnider <[email protected]>

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 · 2025-06-04T13:00:12Z

Rephrasing and update after meeting today: in the previous implementation, the number of blocks were set based on max_model_len, max_batch_size and BLOCK_SIZE and the input request were thoroughly checked against those values. Probably there are new checks to be done in order to avoid obscure errors (e.g. check max_model_len input).

Not entirely sure what you mean here. This code does not change anything, except of setting n_blocks = 4 for the warmup. This should always be enough, since we only do 2 prompts (block size is fixed to 64, Spyre constraint). After warmup it is set to what it was before.

yannicks1 · 2025-06-04T13:00:49Z

thanks for the great feedback @sducouedic , I addressed all:)

sducouedic · 2025-06-04T13:24:26Z

Not entirely sure what you mean here. This code does not change anything, except of setting n_blocks = 4 for the warmup. This should always be enough, since we only do 2 prompts (block size is fixed to 64, Spyre constraint). After warmup it is set to what it was before.

The idea is that there is a dependency between the max_model_len, max_batch_size, and the number of blocks. Depending on the values of max_model_len and max_num_seqs set by the user, we might not have enough blocks to serve a batch of sequences. This issue couldn't arise before because the number of blocks were set based on these values, but it is not the case anymore. Somehow we need to enforce that max_model_len and max_num_seqs can be served. Tom suggested to do a check and raise an error if that is not the case.

sducouedic · 2025-06-04T13:30:47Z

But you're right as long as your temporary function is used that error won't happen, the comment is about when we will start using the function of torch_sendnn

yannicks1 · 2025-06-05T08:27:25Z

I think I get you now: you mean to check something like
num_blocks_spyre >= max_batch_size * max_model_len // block_size, right?
We can already incorporate this for the future, yes!

sducouedic · 2025-06-05T08:31:00Z

I think I get you now: you mean to check something like num_blocks_spyre >= max_batch_size * max_model_len // block_size, right? We can already incorporate this for the future, yes!

Yes correct

Signed-off-by: Yannick Schnider <[email protected]>

yannicks1 · 2025-06-05T09:58:53Z

@sducouedic I have incorporated your suggested check already here

joerunde · 2025-06-05T18:39:23Z

vllm_spyre/v1/worker/spyre_worker.py

+            self.model_runner.vllm_config.scheduler_config.max_model_len
+        block_size = self.model_runner.BLOCK_SIZE  # type: ignore[union-attr]
+
+        min_req_num_blocks = max_batch_size * max_model_len // block_size


Upstream vllm has this check as:

min_req_num_blocks = max_model_len // block_size

I think it's more correct to only ensure you have enough blocks to run a single full-size request, because ideally you want to be able to set a high batch size to be able to run many smaller requests on a long-context model. For example say a model has a context length of 1m tokens, and with your current hardware you can only deploy it in a way where you have enough kv-cache available for 1m tokens. You wouldn't want to be forced to set the max batch size to only 1, because in practice very few requests will use the full context length. Ideally, you'd be able to set the max batch size much higher like 256, to still run many smaller requests in parallel.

That requires scheduling with preemption though to kick out request(s) from the batch when you run out of kv-cache blocks. I think we'd need to hook up our kv cache management with the scheduler and implement preemption in the model runner to make that happen. I haven't looked at how hard that would be to do yet

Thanks Joe, this makes a lot of sense. changed it.

FYI: we need more than the above min_req_num_blocks for one of the test cases...

joerunde · 2025-06-05T18:54:43Z

vllm_spyre/v1/worker/spyre_worker.py

+
+        min_req_num_blocks = max_batch_size * max_model_len // block_size
+
+        if envs_spyre.VLLM_SPYRE_DYNAMO_BACKEND == 'sendnn_decoder':


sendnn_decoder no longer exists since #186

Signed-off-by: Yannick Schnider <[email protected]>

JRosenkranz · 2025-06-13T05:40:26Z

vllm_spyre/v1/worker/spyre_worker.py

+            self.model_runner.vllm_config.scheduler_config.max_model_len
+        block_size = self.model_runner.BLOCK_SIZE  # type: ignore[union-attr]
+
+        min_req_num_blocks = max_model_len // block_size


Is this always what the min_req_num_blocks will be? Is there a case where we will not handle full max_model_len requests?

Joe's comment here says that this is how it is handled upstream. More logic in the scheduler will follow to handle e.g. running out of blocks...

JRosenkranz · 2025-06-13T05:43:35Z

vllm_spyre/v1/worker/spyre_worker.py

+            # TODO: replace num_blocks_spyre by calling a function in
+            # torch_sendnn which returns the value set by the Spyre compiler
+            num_blocks_spyre = max_batch_size * min_req_num_blocks
+            assert num_blocks_spyre >= min_req_num_blocks, (


If min_req_num_blocks stays same as above, we may hit a case where we can handle only a portion of the max_model_len. Should we fail in this case, or just let the user know we may not be able to support up to the full max_model_len for each request?

I think Joe's comment covers this as well. Preemption will hopefully soon be implemented (or reused from upstream)...

JRosenkranz · 2025-06-13T05:57:24Z

vllm_spyre/v1/worker/spyre_worker.py

+            return num_blocks_spyre
+        else:  # dynamo backend 'eager'
+            # TODO: how do we get a meaningful value for CPU here
+            num_blocks_cpu = max_batch_size * min_req_num_blocks


At least with cuda, I believe you would profile a run and see what the peak memory usage was. Then use some percentage of whatever else was left over to determine the number of blocks that could fit.

https://github.com/vllm-project/vllm/blob/ace5cdaff0cf021ff02ddbe39ea814f2ed2e56b7/vllm/worker/worker.py#L232

There is also a method of this for cpu that calculates based on this:

num_cpu_blocks = int(self.cache_config.cpu_kvcache_space_bytes // cache_block_size)

https://github.com/vllm-project/vllm/blob/ace5cdaff0cf021ff02ddbe39ea814f2ed2e56b7/vllm/worker/cpu_worker.py#L239

@JRosenkranz as far as I can tell self.cache_config.cpu_kvcache_space_bytes is set by the user here:
https://github.com/vllm-project/vllm/blob/7e8d97dd3f0aaf05265f947997310ca3827d3c06/vllm/platforms/cpu.py#L128

I also think it is not critical to have a super meaningful value here obtained with profiling as this is merely the CPU path to test/validate the Spyre plugin code, and not an actual CPU worker.

Thanks for the review!

yannicks1 · 2025-06-13T11:12:18Z

Do you guys think we can merge this PR now and insert the torch_sendnn function once it is available?

Note: The PR in the current form does not change any behavior, but as it is quite some refactoring, it couldn't hurt to merge instead of waiting for the one line change inserting the function...

@JRosenkranz @tdoublep @joerunde @sducouedic @nikolaospapandreou

sducouedic · 2025-06-13T12:00:08Z

I agree this should be merged

JRosenkranz · 2025-06-13T12:39:47Z

Do you guys think we can merge this PR now and insert the torch_sendnn function once it is available?

Note: The PR in the current form does not change any behavior, but as it is quite some refactoring, it couldn't hurt to merge instead of waiting for the one line change inserting the function...

@JRosenkranz @tdoublep @joerunde @sducouedic @nikolaospapandreou

Yes, this looks good to me to be merged

yannicks1 added 3 commits June 3, 2025 14:36

mock function to set the number of free pages after warmup

b3469cf

Signed-off-by: Yannick Schnider <[email protected]>

adapt kv cache size with read out number of blocks

c88b9c4

Signed-off-by: Yannick Schnider <[email protected]>

mark num_blocks dimension dynamic for warmup

c83f3b3

Signed-off-by: Yannick Schnider <[email protected]>

sducouedic reviewed Jun 4, 2025

View reviewed changes

vllm_spyre/model_executor/model_loader/spyre.py Outdated Show resolved Hide resolved

vllm_spyre/v1/worker/spyre_model_runner.py Outdated Show resolved Hide resolved

vllm_spyre/v1/worker/spyre_worker.py Outdated Show resolved Hide resolved

yannicks1 and others added 4 commits June 4, 2025 12:43

Merge branch 'main' into ysc-mock-read-n-pages

6c2c9f4

Signed-off-by: Yannick Schnider <[email protected]>

Update vllm_spyre/model_executor/model_loader/spyre.py

7d2b946

Co-authored-by: Sophie du Couédic <[email protected]> Signed-off-by: Yannick Schnider <[email protected]>

move num blocks calculation into mock function

dbc9adf

Signed-off-by: Yannick Schnider <[email protected]>

renaming, separating Spyre backend

ae03021

Signed-off-by: Yannick Schnider <[email protected]>

Merge branch 'main' into ysc-mock-read-n-pages

5e5d43a

Merge branch 'main' into ysc-mock-read-n-pages

d561032

assertion to have enough pages on Spyre to serve the current model

aa5754a

Signed-off-by: Yannick Schnider <[email protected]>

joerunde reviewed Jun 5, 2025

View reviewed changes

yannicks1 and others added 5 commits June 6, 2025 00:14

Merge branch 'main' into ysc-mock-read-n-pages

97ecba3

Signed-off-by: Yannick Schnider <[email protected]>

less restrictive assertion, backend switch

924052e

Signed-off-by: Yannick Schnider <[email protected]>

increase number of blocks for CPU to pass all CB tests

42cb899

Signed-off-by: Yannick Schnider <[email protected]>

printing max concurrency based on available blocks

8987891

Signed-off-by: Yannick Schnider <[email protected]>

adapt upstream vllm logging style

7ba0c09

Signed-off-by: Yannick Schnider <[email protected]>

JRosenkranz reviewed Jun 13, 2025

View reviewed changes

yannicks1 marked this pull request as ready for review June 13, 2025 11:12

yannicks1 requested review from nikolaospapandreou and tdoublep as code owners June 13, 2025 11:12

sducouedic approved these changes Jun 13, 2025

View reviewed changes

yannicks1 merged commit a2c68c3 into main Jun 13, 2025
20 checks passed

yannicks1 deleted the ysc-mock-read-n-pages branch June 13, 2025 14:39


		min_req_num_blocks = max_batch_size * max_model_len // block_size

		if envs_spyre.VLLM_SPYRE_DYNAMO_BACKEND == 'sendnn_decoder':

[do not merge][CB] get number of blocks from compiler mock implementation #205

[do not merge][CB] get number of blocks from compiler mock implementation #205

Uh oh!

Conversation

yannicks1 commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[CB] get number of blocks from compiler mock implementation

Uh oh!

github-actions bot commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sducouedic commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sducouedic commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yannicks1 commented Jun 4, 2025

Uh oh!

yannicks1 commented Jun 4, 2025

Uh oh!

sducouedic commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sducouedic commented Jun 4, 2025

Uh oh!

yannicks1 commented Jun 5, 2025

Uh oh!

sducouedic commented Jun 5, 2025

Uh oh!

yannicks1 commented Jun 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yannicks1 commented Jun 13, 2025

Uh oh!

sducouedic commented Jun 13, 2025

Uh oh!

JRosenkranz commented Jun 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yannicks1 commented Jun 3, 2025 •

edited

Loading

sducouedic commented Jun 4, 2025 •

edited

Loading

sducouedic commented Jun 4, 2025 •

edited

Loading

sducouedic commented Jun 4, 2025 •

edited

Loading