fix: ensure vision tokens present in text for Qwen2-VL pretraining by vra · Pull Request #3 · vra/Thinking-with-Visual-Primitives-pytorch

vra · 2026-06-11T11:21:02Z

Problem

Fixes #1 - Pretraining crashes with ValueError: Image features and image tokens do not match: tokens: 0, features 288.

Root Cause

Qwen2-VL requires <|vision_start|><|image_pad|><|vision_end|> tokens in the input text to match the image features produced by the vision encoder. When apply_chat_template fails to insert these tokens (either because the template doesn't handle multimodal content on certain model versions, or because the fallback _fallback_format strips image references), the processor produces pixel_values + image_grid_thw from the images, but input_ids contain 0 <|image_pad|> tokens, causing the mismatch in get_placeholder_mask.

Fix

After apply_chat_template produces the text, check whether <|vision_start|> is present. If missing and the sample has a real image, manually insert the vision tokens after <|im_start|>user\n.

Qwen2-VL requires <|vision_start|><|image_pad|><|vision_end|> tokens in the input text to match image features from the vision encoder. When apply_chat_template fails to insert these tokens (either because the template doesn't handle multimodal content on certain model versions, or because the fallback _fallback_format strips image references), the processor produces pixel_values without corresponding image tokens in input_ids, causing 'Image features and image tokens do not match: tokens: 0, features 288'. The fix checks for missing vision tokens after apply_chat_template and inserts them manually after the user turn start marker.

vra mentioned this pull request Jun 11, 2026

[Pretraining ERROR] #1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: ensure vision tokens present in text for Qwen2-VL pretraining#3

fix: ensure vision tokens present in text for Qwen2-VL pretraining#3
vra wants to merge 1 commit into
mainfrom
fix/issue-1-vision-token-mismatch

vra commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vra commented Jun 11, 2026

Problem

Root Cause

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant