-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1][Model] Add V1 support for Qwen2-VL #11668
Conversation
Signed-off-by: imkero <[email protected]>
Signed-off-by: imkero <[email protected]>
Signed-off-by: imkero <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
@@ -791,6 +791,7 @@ def _parse_video_data( | |||
|
|||
|
|||
class Qwen2VLMultiModalProcessor(BaseMultiModalProcessor): | |||
_placeholder_map: Optional[dict[str, list[int]]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should initialize this in the init method to avoid confusing it with a static class variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from this, the processor-related changes in the model file LGTM.
Hello @imkero! Much appreciated that you made this PR! The reason why I haven't spent too much on Qwen2-VL is that I want to see if there's a way to move MRope inside model file for Qwen2-VL since it is so specific to this model. You would also need to change the implementation of Feel free to take changes from here into this PR. |
if not self._placeholder_map: | ||
# NOTE: Only Qwen2VLProcessor in transformers 4.47.0 has | ||
# image_token and video_token registered | ||
encode_fn = hf_processor.tokenizer.encode | ||
self._placeholder_map = { | ||
"image": encode_fn(hf_processor.image_token), | ||
"video": encode_fn(hf_processor.video_token), | ||
} | ||
placeholder = self._placeholder_map | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we can set this at initialization time.
encoder_outputs.append(( | ||
encoder_output[0] | ||
[start_idx:end_idx], # embedding tensor | ||
encoder_output[1], # modality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought is we don't necessarily need to have the modality key here.
We can leverage the fact that any two mm_position
's from any modalities cannot possibily have overlaps, and now that
vllm/vllm/model_executor/models/utils.py
Lines 408 to 423 in 11d8a09
def merge_multimodal_embeddings( | |
input_ids: torch.Tensor, | |
inputs_embeds: torch.Tensor, | |
multimodal_embeddings: NestedTensors, | |
placeholder_token_id: Union[int, List[int]], | |
) -> torch.Tensor: | |
""" | |
Merge ``multimodal_embeddings`` into ``inputs_embeds`` by overwriting the | |
positions in ``inputs_embeds`` corresponding to placeholder tokens in | |
``input_ids``. | |
``placeholder_token_id`` can be a list of token ids (e.g, token ids | |
of img_start, img_break, and img_end tokens) when needed: This means | |
the order of these tokens in the ``input_ids`` MUST MATCH the order of | |
their embeddings in ``multimodal_embeddings`` since we need to | |
slice-merge instead of individually scattering. |
can apply the embedding replacement based on a list of token ids (so we can simply have
[self.config.image_token_id, self.config.video_token_id]
here)
Therefore, all we need to do should be just sorting mm_position
's and their correpsonding mm_inputs
in the following code(which also needs to be modified to support video modality for Qwen2VL in this PR)
Lines 51 to 59 in 11d8a09
# Multi-modal input metadata. | |
mm_positions = self.inputs.multi_modal_placeholders | |
if mm_positions: | |
# FIXME(woosuk): Support other modalities. | |
self.mm_positions = mm_positions.get("image", []) | |
else: | |
self.mm_positions = [] | |
# Output of the mm input mapper (e.g., image tensors). | |
self.mm_inputs: List[MultiModalKwargs] = [] |
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought - let me actually work on this design for llava-onevision
too
Hello @imkero! Please feel free to take a look at the updated code in https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_onevision.py for dealing with multiple modalities. In particular, I think you can pretty much adopt the same code below to Qwen-2VL without changing the interface for model runner and encoder cache. Let me know if you need any help and I'm happy to work on this PR as well if you don't have the bandwidth! vllm/vllm/model_executor/models/llava_onevision.py Lines 547 to 560 in cf5f000
vllm/vllm/model_executor/models/llava_onevision.py Lines 812 to 834 in cf5f000
vllm/vllm/model_executor/models/llava_onevision.py Lines 844 to 846 in cf5f000
|
@ywang96 Sorry for the late response. I'll continue working on this PR soon. |
This pull request has merge conflicts that must be resolved before it can be |
Hi,
It seems the dummy data in profile running is not correct. Then, I print some values.
Could you help me? I would appreciate it and hope that Qwen2-VL will be supported by v1 in time. Thank you Best regards |
@baifanxxx I'll start taking a look at this PR tomorrow. @imkero has already done a great job of adding MRoPE in v1 with torch compile support, so it shouldn't take us too long to get this PR into a functional stage! |
dynamic_arg_dims={ | ||
"input_ids": 0, | ||
# dim 1 for mrope in shape (3, seq_len), else dim 0 in shape (seq_len, ) | ||
"positions": lambda tensor: tensor.ndim - 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does -1
work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value here will be passthrough
to pytorch's impl torch._dynamo.mark_dynamic(tensor, dim)
, and it seems to assume that dim
is a non-negative integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can do the conversion here:
vllm/vllm/compilation/decorators.py
Line 177 in d06e824
for k, dims in dynamic_arg_dims.items(): |
iterate over the dims
, and conver -1
to tensor.ndim - 1
hi, when I running qwen2-vl(2b) use this pr, I get error:
Is there a problem with the profile_run? |
@baifanxxx @Zhiy-Zhang Actually I commented this assertion ( Also I modified the value of encoder_budget, and Qwen2-VL's image processor's |
Thank you very much for your reply. Has this change(“modified the value of encoder_budget, and Qwen2-VL's image processor's |
What's changed:
torch.compile
(M-RoPE uses a 2d position tensor which differs from common RoPE, and they share same impl in Qwen2 LM'sforward
fn)profile_run
for Qwen2-VL launchgpu_model_runner
(embeddings: torch.Tensor, modality: str)
ingpu_model_runner
for Qwen2-VLimage_token
andvideo_token
in Qwen2-VL's preprocessing for better performanceThis PR should make Qwen2-VL works in V1 with chunked prefill and prefix caching enabled.