-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Add BlockTable class #11693
base: main
Are you sure you want to change the base?
[V1] Add BlockTable class #11693
Conversation
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
vllm/v1/worker/gpu_block_table.py
Outdated
self.block_table.fill_(0) | ||
self.block_table_cpu.fill_(0) | ||
|
||
def cuda(self) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might make sense to call this something other than cuda()
since I think this class can be shared across all backends ideally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe like to_device()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was about to ask a similar question lol
The file name is "gpu"_block_table.py. Does this BlockTable is supposed to only be used by GPUs or it's actually a general purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertgshaw2-neuralmagic @comaniac Good point.
I renamed gpu_block_table.py
to block_table.py
and cuda
to to_device
as you suggested.
That being said, I plan to add a GPU-specific optimization to optimize the block table copy from CPU to GPU. Since this optimization will involve a CUDA kernel, it will not be shared with other hardware.
Also, please note that the shape of the block table is actually dependent on the attention kernel. For example, FlashInfer requires a different layout than the current PR. Likewise, other hardwares might want different layouts and therefore possibly different implementations of append_row
and move_row
.
nice! |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Woosuk Kwon <[email protected]>
def to_device(self) -> torch.Tensor: | ||
"""Ruturns the device tensor of the block table.""" | ||
return self.block_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at this API again and found it's a bit weird to call it to_device
, because we are not actually transferring tensors in this call (like torch_tensor.to("cuda")
). Following the naming convention of .cpu()
, this API should be name .device()
, but I'm not sure if this makes sense to others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
I actually named it device
first, and then found that the class already had the device
attribute 😂 and PyTorch's convention is x.device
returns the device x
lives in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's true...then another way is naming everything with verb, like to_device
, to_cpu
, to_numpy
. Although we don't actually do any transfer in these calls, this may be less confusion. @robertgshaw2-neuralmagic WDYT?
This PR adds the
BlockTable
class, a thin wrapper of the GPU & CPU block table tensors. It will help reduce the complexity of the input preparation logic.Also,
BlockTable
optimizes the memory movement for switching the rows of the CPU block table tensor, by tracking the actual number of blocks per row and only doing necessary copies (instead of blindly copying the entire rows).NOTE: This PR is a precursor to #11401 which optimizes the block table copy from CPU to GPU.