Skip to content

fix: ensure vision tokens present in text for Qwen2-VL pretraining#3

Open
vra wants to merge 1 commit into
mainfrom
fix/issue-1-vision-token-mismatch
Open

fix: ensure vision tokens present in text for Qwen2-VL pretraining#3
vra wants to merge 1 commit into
mainfrom
fix/issue-1-vision-token-mismatch

Conversation

@vra

@vra vra commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Problem

Fixes #1 - Pretraining crashes with ValueError: Image features and image tokens do not match: tokens: 0, features 288.

Root Cause

Qwen2-VL requires <|vision_start|><|image_pad|><|vision_end|> tokens in the input text to match the image features produced by the vision encoder. When apply_chat_template fails to insert these tokens (either because the template doesn't handle multimodal content on certain model versions, or because the fallback _fallback_format strips image references), the processor produces pixel_values + image_grid_thw from the images, but input_ids contain 0 <|image_pad|> tokens, causing the mismatch in get_placeholder_mask.

Fix

After apply_chat_template produces the text, check whether <|vision_start|> is present. If missing and the sample has a real image, manually insert the vision tokens after <|im_start|>user\n.

Qwen2-VL requires <|vision_start|><|image_pad|><|vision_end|> tokens in
the input text to match image features from the vision encoder. When
apply_chat_template fails to insert these tokens (either because the
template doesn't handle multimodal content on certain model versions,
or because the fallback _fallback_format strips image references),
the processor produces pixel_values without corresponding image tokens
in input_ids, causing 'Image features and image tokens do not match:
tokens: 0, features 288'.

The fix checks for missing vision tokens after apply_chat_template and
inserts them manually after the user turn start marker.
@vra vra mentioned this pull request Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Pretraining ERROR]

1 participant