-
Notifications
You must be signed in to change notification settings - Fork 26
[do not merge][CB] get number of blocks from compiler mock implementation #205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
👋 Hi! Thank you for contributing to vLLM support on Spyre.
Or this can be done with
Now you are good to go 🚀 |
I realised all the changes in this PR only applies to sendnn_decoder backend and we still need the old logic for the other backends. Maybe everything works by renaming function |
I guess we will probably need to adapt Rephrasing and update after meeting today: in the previous implementation, the number of blocks were set based on |
Signed-off-by: Yannick Schnider <[email protected]>
Co-authored-by: Sophie du Couédic <[email protected]> Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Not entirely sure what you mean here. This code does not change anything, except of setting n_blocks = 4 for the warmup. This should always be enough, since we only do 2 prompts (block size is fixed to 64, Spyre constraint). After warmup it is set to what it was before. |
thanks for the great feedback @sducouedic , I addressed all:) |
The idea is that there is a dependency between the |
But you're right as long as your temporary function is used that error won't happen, the comment is about when we will start using the function of torch_sendnn |
I think I get you now: you mean to check something like |
Yes correct |
Signed-off-by: Yannick Schnider <[email protected]>
@sducouedic I have incorporated your suggested check already here |
vllm_spyre/v1/worker/spyre_worker.py
Outdated
self.model_runner.vllm_config.scheduler_config.max_model_len | ||
block_size = self.model_runner.BLOCK_SIZE # type: ignore[union-attr] | ||
|
||
min_req_num_blocks = max_batch_size * max_model_len // block_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream vllm has this check as:
min_req_num_blocks = max_model_len // block_size
I think it's more correct to only ensure you have enough blocks to run a single full-size request, because ideally you want to be able to set a high batch size to be able to run many smaller requests on a long-context model. For example say a model has a context length of 1m tokens, and with your current hardware you can only deploy it in a way where you have enough kv-cache available for 1m tokens. You wouldn't want to be forced to set the max batch size to only 1, because in practice very few requests will use the full context length. Ideally, you'd be able to set the max batch size much higher like 256, to still run many smaller requests in parallel.
That requires scheduling with preemption though to kick out request(s) from the batch when you run out of kv-cache blocks. I think we'd need to hook up our kv cache management with the scheduler and implement preemption in the model runner to make that happen. I haven't looked at how hard that would be to do yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Joe, this makes a lot of sense. changed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: we need more than the above min_req_num_blocks
for one of the test cases...
vllm_spyre/v1/worker/spyre_worker.py
Outdated
|
||
min_req_num_blocks = max_batch_size * max_model_len // block_size | ||
|
||
if envs_spyre.VLLM_SPYRE_DYNAMO_BACKEND == 'sendnn_decoder': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sendnn_decoder
no longer exists since #186
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
Signed-off-by: Yannick Schnider <[email protected]>
self.model_runner.vllm_config.scheduler_config.max_model_len | ||
block_size = self.model_runner.BLOCK_SIZE # type: ignore[union-attr] | ||
|
||
min_req_num_blocks = max_model_len // block_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always what the min_req_num_blocks
will be? Is there a case where we will not handle full max_model_len requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Joe's comment here says that this is how it is handled upstream. More logic in the scheduler will follow to handle e.g. running out of blocks...
# TODO: replace num_blocks_spyre by calling a function in | ||
# torch_sendnn which returns the value set by the Spyre compiler | ||
num_blocks_spyre = max_batch_size * min_req_num_blocks | ||
assert num_blocks_spyre >= min_req_num_blocks, ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If min_req_num_blocks
stays same as above, we may hit a case where we can handle only a portion of the max_model_len. Should we fail in this case, or just let the user know we may not be able to support up to the full max_model_len for each request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Joe's comment covers this as well. Preemption will hopefully soon be implemented (or reused from upstream)...
return num_blocks_spyre | ||
else: # dynamo backend 'eager' | ||
# TODO: how do we get a meaningful value for CPU here | ||
num_blocks_cpu = max_batch_size * min_req_num_blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least with cuda, I believe you would profile a run and see what the peak memory usage was. Then use some percentage of whatever else was left over to determine the number of blocks that could fit.
There is also a method of this for cpu that calculates based on this:
num_cpu_blocks = int(self.cache_config.cpu_kvcache_space_bytes //
cache_block_size)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JRosenkranz as far as I can tell self.cache_config.cpu_kvcache_space_bytes
is set by the user here:
https://github.com/vllm-project/vllm/blob/7e8d97dd3f0aaf05265f947997310ca3827d3c06/vllm/platforms/cpu.py#L128
I also think it is not critical to have a super meaningful value here obtained with profiling as this is merely the CPU path to test/validate the Spyre plugin code, and not an actual CPU worker.
Thanks for the review!
Do you guys think we can merge this PR now and insert the Note: The PR in the current form does not change any behavior, but as it is quite some refactoring, it couldn't hurt to merge instead of waiting for the one line change inserting the function... @JRosenkranz @tdoublep @joerunde @sducouedic @nikolaospapandreou |
I agree this should be merged |
Yes, this looks good to me to be merged |
[CB] get number of blocks from compiler mock implementation
This is a first draft how the message passing from the Spyre compiler to vLLM could work on vLLM side.
The process consists of the following steps:
num_blocks=4
).num_blocks
(4) dimension is marked as dynamic (torch._dynamo.mark_dynamic()
) for warmup forward calls only..json
file.torch_sendnn
reads the value from the.json
file and can return it via a functiontorch_sendnn
function to get the number of available blocks/pages and setsnum_blocks=N
num_blocks=N
).