Skip to content

refactor(multimodal): Multimodal ABI generalization#1602

Open
yechank-nvidia wants to merge 13 commits into
lightseekorg:mainfrom
yechank-nvidia:yechan/mm-abi-generalization
Open

refactor(multimodal): Multimodal ABI generalization#1602
yechank-nvidia wants to merge 13 commits into
lightseekorg:mainfrom
yechank-nvidia:yechan/mm-abi-generalization

Conversation

@yechank-nvidia

@yechank-nvidia yechank-nvidia commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Description

Problem

SMG multimodal support was still structured around an image-only payload shape. This makes it hard to extend the existing precomputed multimodal path to additional modalities such as video and audio without adding more image-specific fields and special cases.

The gateway also needed a cleaner foundation for passing preprocessed multimodal encoder inputs, model-specific metadata, placeholder ranges, and modality information to TokenSpeed.

Solution

This PR generalizes the TokenSpeed multimodal ABI from a single image payload into itemized multimodal inputs. Each multimodal item now carries its own modality, encoder input tensor, model-specific side tensors, placeholder metadata, placeholder token id, and content hash.

This keeps the current precomputed multimodal design, but removes the image-only payload assumption so later PRs can add video and transport optimizations on top of the same structure.

Changes

  • Generalized TokenSpeed MultimodalInputs to contain repeated MultimodalItems.

  • Added modality metadata for image, video, and audio.

  • Replaced image-specific TokenSpeed fields with per-item:

    • encoder_input
    • model_specific_tensors
    • placeholders
    • placeholder_token_id
    • content_hash
  • Updated SMG TokenSpeed proto conversion to emit itemized multimodal payloads.

  • Updated the TokenSpeed Python servicer to consume the itemized multimodal payload.

  • Preserved existing image behavior on top of the new generalized payload structure.

  • Added model/tokenizer config loading support needed for multimodal preprocessor metadata.

  • Avoided warning spam when processor_config.json exists but does not contain video_processor.

  • Added regression coverage for video_processor config loading behavior.

Test Plan

  • Verified the stacked branch diff against the original implementation branch.

  • Verified whitespace/diff hygiene:

    git diff --check

Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg (https://slack.lightseek.org) to discuss, review, and merge PRs

Summary by CodeRabbit

  • New Features

    • Full video support alongside images: fetching, decoding, frame sampling, and preprocessing.
    • Modality-aware multimodal requests and placeholder handling (image vs video).
    • Itemized encoder-input payloads with flexible tensor transport (inline, shared memory, remote).
    • Model info now reports multimodal capability, supported modalities, and encoder dtypes.
    • Deterministic media content hashing and improved preprocessor config loading.
  • Bug Fixes

    • Rejects mixed image+video requests; stronger multimodal payload validation and decoding.

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds itemized, typed TokenSpeed multimodal protobufs; first-class video ingestion/ffmpeg decoding; refactors vision preprocessing to encoder-inputs; makes registry and placeholder logic modality-aware; implements gateway TokenSpeed itemized assembly and servicer itemized decoding; updates tests and manifests.

Changes

Protobuf contracts and multimodal types

Layer / File(s) Summary
Protobuf schema and multimodal types
crates/grpc_client/proto/tokenspeed_scheduler.proto, crates/multimodal/src/types.rs
TensorData becomes a typed tensor descriptor with inline/shm/remote payloads; MultimodalInputs is itemized with MultimodalItem and Modality; GetModelInfoResponse adds multimodal capability fields; VideoClip, VideoSource, and video content-part variants added.

Vision preprocessing interface refactor

Layer / File(s) Summary
Vision processor and output contract
crates/multimodal/src/vision/processor.rs, crates/multimodal/src/vision/mod.rs, crates/multimodal/src/lib.rs
Renamed ImagePreProcessorVisionPreProcessor; PreprocessedImagesPreprocessedEncoderInputs (fields: encoder_input, feature_token_counts, item_sizes); registry renamed to VisionProcessorRegistry; added default preprocess_video.

All vision processors updated

Layer / File(s) Summary
Processors and tests
crates/multimodal/src/vision/processors/*, crates/multimodal/tests/vision_golden_tests.rs
Processors (Kimi, LLaVA, Llama4, Phi3/Phi4, Pixtral, Qwen2/Qwen3, etc.) updated to VisionPreProcessor and to return PreprocessedEncoderInputs; unit tests adapted to new contract and added video tests.

Video ingestion and decoding

Layer / File(s) Summary
MediaConnector video fetch and decode
crates/multimodal/src/media.rs, crates/multimodal/src/hasher.rs, crates/multimodal/src/error.rs, crates/multimodal/Cargo.toml
Added MediaConnector::fetch_video, decode_video using ffmpeg/ffprobe with PNG stream splitting, VideoFetchConfig, hash_video, and MediaConnectorError::VideoDecode; moved tempfile to normal deps.

Tracker and multimodal extraction

Layer / File(s) Summary
Tracker enqueue and types
crates/multimodal/src/tracker.rs, crates/multimodal/src/types.rs
Tracker supports VideoUrl/VideoData and enqueues video fetch tasks; VideoClip is a concrete tracked-media variant; MediaContentPart extended for video.

Registry modality-aware APIs and Qwen video handling

Layer / File(s) Summary
Registry trait and Qwen-specific logic
crates/multimodal/src/registry/traits.rs, crates/multimodal/src/registry/qwen3_vl.rs
ModelProcessorSpec gains modality-gated defaults (placeholder_token_for, placeholder_token_id_for, prompt_replacements_for); UnsupportedModality error added; Qwen3 VL adds video pad/token handling including temporal-grid splitting.

Gateway multimodal orchestration and TokenSpeed assembly

Layer / File(s) Summary
Gateway multimodal flow and TokenSpeed proto wrapper
model_gateway/src/routers/grpc/multimodal.rs, model_gateway/src/routers/grpc/proto_wrapper.rs
Adds video_preprocessor_config, changes intermediate to PrecomputedMultimodalIntermediate with modality and placeholder_token_id, selects single modality, rejects mixed image+video, assembles itemized TokenSpeed multimodal items, serializes encoder inputs into requested dtype (float16/bfloat16 handling), computes per-item content_hash, and emits tokenspeed::MultimodalInputs.items. Tests added.

Servicer decode, GetModelInfo, and CI

Layer / File(s) Summary
Servicer reconstruction and CI check
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py, .github/workflows/pr-test-rust.yml
Servicer adds _mm_inputs_from_itemized_proto and _tensor_payload_bytes helpers, validates tensor payload shapes and dtypes, conditionally includes multimodal fields in GetModelInfo, and CI now asserts TensorData descriptor layout.

Request-stage wiring and utils

Layer / File(s) Summary
Message/chat stages and utils
model_gateway/src/routers/grpc/regular/stages/*, model_gateway/src/routers/grpc/utils/chat_utils.rs
ChatPreparation extracts modalities and rejects mixed requests; message/request stages pass modality and workers into assembly and map assembly errors to bad_request; transform_content_field treats video_url like image_url for placeholder replacement; tests added.

Supporting updates

Layer / File(s) Summary
Imports, manifests, and tokenizer preload
model_gateway/Cargo.toml, model_gateway/src/workflow/tokenizer_registration.rs, crates/multimodal/src/lib.rs, crates/multimodal/benches/*
Added ndarray to model_gateway deps; tokenizer preload loads video_preprocessor_config; lib re-exports video types; minor import/bench formatting changes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • slin1237
  • key4ng
  • XinyueZhang369

"I hopped through frames, a whiskered guide,
I split the PNG stream and kept each stride,
I hashed the bytes and carried clips,
Assembled items, packed the ships.
A rabbit's nod — multimodal, wide."

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@yechank-nvidia yechank-nvidia marked this pull request as ready for review June 5, 2026 02:41
@github-actions github-actions Bot added dependencies Dependency updates grpc gRPC client and router changes multimodal Multimodal crate changes model-gateway Model gateway crate changes labels Jun 5, 2026
@mergify

mergify Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Hi @yechank-nvidia, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

  • Use git commit -s every time, or
  • VSCode: enable Git: Always Sign Off in Settings
  • PyCharm: enable Sign-off commit in the Commit tool window

@yechank-nvidia yechank-nvidia force-pushed the yechan/mm-abi-generalization branch from 40c0739 to 70a013f Compare June 5, 2026 02:43

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces video processing support to the multimodal pipeline, specifically targeting Qwen3-VL models. It updates the gRPC protobuf definitions to support itemized multimodal inputs, implements video fetching and decoding using FFmpeg, and adds video preprocessing, smart resizing, and patchification. The model gateway and Python gRPC servicer are also enhanced to route and validate video payloads. The review feedback identifies a potential division-by-zero bug in smart_resize_video when num_frames is zero, an integer overflow vulnerability during PNG stream splitting, performance overhead from uncached environment variable lookups on the hot path, and a recommendation to use == instead of is for Python enum comparisons.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread crates/multimodal/src/vision/processors/qwen_vl_base.rs
Comment thread crates/multimodal/src/media.rs Outdated
Comment thread model_gateway/src/routers/grpc/multimodal.rs
Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 40c0739860

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/grpc_client/proto/tokenspeed_scheduler.proto
@yechank-nvidia yechank-nvidia changed the title Yechan/mm abi generalization Multimodal ABI generalization Jun 5, 2026
@yechank-nvidia yechank-nvidia changed the title Multimodal ABI generalization refactor(multimodal) Multimodal ABI generalization Jun 5, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs (1)

98-108: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Map placeholder-resolution failures to client error in multimodal flow.

For unsupported modality/model combinations, this path now returns internal_error (5xx). These are request-level failures and should be returned as bad_request (4xx), consistent with the rest of multimodal preparation failures.

💡 Suggested change
-                    error::internal_error(
-                        "multimodal_placeholder_resolution_failed",
-                        format!("Failed to resolve multimodal placeholder token: {e}"),
-                    )
+                    error::bad_request(
+                        "multimodal_not_supported",
+                        format!("Failed to resolve multimodal placeholder token: {e}"),
+                    )

Based on learnings: multimodal preparation failures are currently intentionally surfaced as 400 due generic multimodal error typing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs` around
lines 98 - 108, The map_err branch in ChatPreparationStage::execute currently
converts placeholder-resolution failures into an internal_error (5xx); change
this to return a bad_request (4xx) so unsupported modality/model combos are
surfaced as request errors: in the map_err closure that logs "Failed to resolve
multimodal placeholder token" (inside ChatPreparationStage::execute), replace
error::internal_error(...) with error::bad_request(...) (keeping the same error
key/message formatting) so multimodal placeholder resolution failures are
returned as 400-level client errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/multimodal/src/media.rs`:
- Around line 373-387: The ffmpeg/ffprobe Command::output() calls can hang;
update the calls inside decode_video_with_ffmpeg, probe_video_duration_seconds,
and probe_video_duration_seconds_with_ffmpeg to spawn the child and wait with a
timeout (e.g., tokio::time::timeout) and if the timeout elapses kill the child
process and return a timed-out error; specifically replace the
Command::new(...).output() usage with child =
Command::new(...).stdout(Stdio::piped()).spawn(), await the child's output with
a timeout, on timeout call child.kill().await (or child.kill() and then await)
and then collect stdout/stderr into the same Output-like result or error path so
callers handle failures consistently.

In `@crates/multimodal/src/vision/processors/qwen_vl_base.rs`:
- Around line 235-245: resized_pixels is computed using the padded frame count
(t_bar) but beta is computed with the unpadded num_frames, causing inconsistent
scaling; update both beta calculations in the branches that set h_bar/w_bar to
use the padded frame count (t_bar) instead of num_frames so the pixel budget
math matches resized_pixels (affecting the blocks that compute beta for
max_pixels and min_pixels and the subsequent h_bar/w_bar adjustments).
- Around line 600-613: preprocess_video() is emitting the extra key
"patches_per_video" while Qwen3VLVisionSpec::field_layouts() expects the
flat-field name "patches_per_image", causing a layout mismatch; update the
PreprocessedImages builder in
PreprocessedImages::new_dynamic(...).with_extra(...) to use the same key
"patches_per_image" (or alternatively change Qwen3VLVisionSpec::field_layouts()
to reference "patches_per_video") so the
flat(FieldLayout::flat("patches_per_image")) contract is satisfied; ensure any
downstream consumers that read the extra use the chosen canonical name.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 374-383: GetModelInfo is advertising AUDIO while
_mm_inputs_from_itemized_proto does not carry an audio placeholder token
through, causing audio placeholder replacement to break; either stop
advertising/accepting AUDIO or thread an audio placeholder field into
MultimodalInputs. Fix by removing AUDIO from supported_modalities (the list
built from image_modality/video_modality/audio_modality) and any code paths that
accept it in _modality_from_proto/GetModelInfo until
_mm_inputs_from_itemized_proto is extended, or alternatively add plumbing:
extend _mm_inputs_from_itemized_proto to accept a new
audio_token_id/placeholder_token_id parameter and propagate that into the
MultimodalInputs construction and any call sites so audio placeholder_token_id
survives reconstruction. Ensure you update places that build
supported_modalities (the code block using supports_vision and hf_config) and
the functions _modality_from_proto and _mm_inputs_from_itemized_proto
consistently.

In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 681-686: The code clones every frame into a new Vec (frames)
before calling preprocess_video, causing unnecessary CPU/memory overhead;
instead, remove the intermediate clone and pass a slice of the existing frames
directly to processor.preprocess_video (use the existing videos_for_preprocess
-> video variable and pass &video.frames or &video.frames[..] as the first
argument), keeping pp_config as before so no data copy occurs.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs`:
- Around line 179-180: Add a regression test that verifies the template
substitution inserts the video placeholder string correctly when the template
input is formatted as a string: exercise the code path that maps "video_url" =>
image_placeholder.map(String::from) (the template substitution logic in
chat_utils.rs), supply a template containing the video_url placeholder and a
non-empty image_placeholder, then assert the rendered output contains the
expected video placeholder string (and that it matches the behavior for
"image_url"); place this test alongside the other unit tests for chat_utils
template handling.

---

Outside diff comments:
In `@model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs`:
- Around line 98-108: The map_err branch in ChatPreparationStage::execute
currently converts placeholder-resolution failures into an internal_error (5xx);
change this to return a bad_request (4xx) so unsupported modality/model combos
are surfaced as request errors: in the map_err closure that logs "Failed to
resolve multimodal placeholder token" (inside ChatPreparationStage::execute),
replace error::internal_error(...) with error::bad_request(...) (keeping the
same error key/message formatting) so multimodal placeholder resolution failures
are returned as 400-level client errors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 786f49fa-f73d-4c2d-87a6-c808ddd70b91

📥 Commits

Reviewing files that changed from the base of the PR and between f21917b and 70a013f.

📒 Files selected for processing (24)
  • crates/grpc_client/proto/tokenspeed_scheduler.proto
  • crates/multimodal/Cargo.toml
  • crates/multimodal/src/error.rs
  • crates/multimodal/src/hasher.rs
  • crates/multimodal/src/hub.rs
  • crates/multimodal/src/lib.rs
  • crates/multimodal/src/media.rs
  • crates/multimodal/src/registry/qwen3_vl.rs
  • crates/multimodal/src/registry/traits.rs
  • crates/multimodal/src/tracker.rs
  • crates/multimodal/src/types.rs
  • crates/multimodal/src/vision/image_processor.rs
  • crates/multimodal/src/vision/preprocessor_config.rs
  • crates/multimodal/src/vision/processors/qwen3_vl.rs
  • crates/multimodal/src/vision/processors/qwen_vl_base.rs
  • grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs
  • model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs
  • model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
  • model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs
  • model_gateway/src/routers/grpc/regular/stages/messages/request_building.rs
  • model_gateway/src/routers/grpc/utils/chat_utils.rs
  • model_gateway/src/workflow/tokenizer_registration.rs

Comment thread crates/multimodal/src/media.rs Outdated
Comment thread crates/multimodal/src/vision/processors/qwen_vl_base.rs
Comment thread crates/multimodal/src/vision/processors/qwen_vl_base.rs Outdated
Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py Outdated
Comment thread model_gateway/src/routers/grpc/multimodal.rs Outdated
Comment thread model_gateway/src/routers/grpc/utils/chat_utils.rs
tokens.push(vision_end);
tokens.extend(Self::encode_plain_text(
metadata,
&format!("<{grid_idx}.0 seconds>"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These timestamps seem to not match HF Qwen3-VL. HF emits real seconds(frame_idx / fps, averaged per temporal patch, :.1f) but we emit the integer grid_idx; HF emits a timestamp before every group including frame 0 but we skip the first (grid_idx > 0).
e.g. at the default 2fps with uniform sampling, HF gives 0.2/1.2/2.2/... but here (none)/1.0/2.0/....

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx. Addressed

/// Implementations that support video should emit the same primary
/// `pixel_values` tensor shape used by the image path, plus video-specific
/// model metadata such as `video_grid_thw`.
fn preprocess_video(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this trait preprocesses video too, the image-centric names are misleading: ImagePreProcessor, the PreprocessedImages return type, and its num_img_tokens / image_sizes fields.
Also the file/module name image_processor.rs itself.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed into more general format.

placeholder_token_id: self.placeholder_token_id,
};

tokenspeed::MultimodalInputs { items: vec![item] }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultimodalInputs is repeated MultimodalItem, but the producer always emits exactly one item (all images/videos are folded into a single item, and mixed modality is rejected), so the repeated field isn't really used. It would be better to emit one MultimodalItem per image / per video.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx. Addressed

@yechank-nvidia yechank-nvidia changed the title refactor(multimodal) Multimodal ABI generalization refactor(multimodal): Multimodal ABI generalization Jun 9, 2026
@github-actions github-actions Bot added ci CI/CD configuration changes tests Test changes labels Jun 9, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 732f59bf83

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/grpc_client/proto/tokenspeed_scheduler.proto

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
crates/multimodal/src/media.rs (1)

572-587: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

PNG chunk length overflow risk on 32-bit targets.

Casting u32 chunk length to usize is safe on 64-bit but could truncate on 32-bit platforms if len > usize::MAX. While unlikely in practice, the subsequent arithmetic could behave unexpectedly.

Consider adding an explicit check or using usize::try_from if 32-bit support is intended.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/multimodal/src/media.rs` around lines 572 - 587, The cast from u32 to
usize is unsafe on 32-bit targets; replace the direct cast in the PNG parsing
block (the line using len_bytes -> len and subsequent cursor arithmetic) with a
fallible conversion (e.g., usize::try_from(u32::from_be_bytes(len_bytes))) and
return a MediaConnectorError::VideoDecode if the conversion fails or if len
exceeds the available remaining bytes; also use checked arithmetic
(checked_add/checked_sub or validate cursor + 12 + len won't overflow) before
advancing cursor to avoid overflow when computing cursor += 12 + len.
crates/multimodal/src/registry/qwen3_vl.rs (1)

66-74: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Accept video_grid_thw as UintTensor (not only IntTensor).

Line 66 currently reads only IntTensor, but the pipeline also uses UintTensor for video_grid_thw (see model_gateway/src/routers/grpc/multimodal.rs Line 1854-1957). That makes video_grid_t return None and Line 259 falls back to pad-only tokens, skipping the Qwen3.5 split path.

Proposed fix
     fn video_grid_t(preprocessed: &PreprocessedEncoderInputs) -> Option<usize> {
         match preprocessed.model_specific.get("video_grid_thw") {
             Some(ModelSpecificValue::IntTensor { data, shape })
-                if shape == &[1, 3] && !data.is_empty() =>
+                if shape.len() == 2 && shape[1] == 3 && !data.is_empty() =>
             {
                 usize::try_from(data[0]).ok()
             }
+            Some(ModelSpecificValue::UintTensor { data, shape })
+                if shape.len() == 2 && shape[1] == 3 && !data.is_empty() =>
+            {
+                Some(data[0] as usize)
+            }
             _ => None,
         }
     }

Also applies to: 244-259

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/multimodal/src/registry/qwen3_vl.rs` around lines 66 - 74, The pattern
match in video_grid_t currently accepts only ModelSpecificValue::IntTensor,
which causes video_grid_thw encoded as UintTensor to be ignored; update the
match in fn video_grid_t (and the similar match around the other occurrence at
the 244-259 region) to also accept ModelSpecificValue::UintTensor, extract the
first element from either IntTensor or UintTensor, and convert it to usize via
usize::try_from (or equivalent) so both signed and unsigned tensor
representations of video_grid_thw are handled consistently.
model_gateway/src/routers/grpc/multimodal.rs (1)

557-603: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject partial multimodal fetches before prompt expansion.

This path only rejects the all-failed case. If a request contains multiple media parts and one fetch/decode drops out while another survives, preprocessing continues with the surviving items only. prompt_replacements will then cover fewer items than the prompt still contains, leaving unresolved placeholder tokens in expanded_token_ids and sending a mismatched prompt/mm payload downstream. Cache the requested per-modality counts up front and fail when the fetched count is smaller.

💡 Suggested change
+    let requested_images = content_parts
+        .iter()
+        .filter(|part| matches!(part, MediaContentPart::ImageUrl { .. }))
+        .count();
+    let requested_videos = content_parts
+        .iter()
+        .filter(|part| matches!(part, MediaContentPart::VideoUrl { .. }))
+        .count();
+
     let mut tracker = AsyncMultiModalTracker::new(components.media_connector.clone());
@@
     let modality = match (images.is_empty(), videos.is_empty()) {
         (false, true) => Modality::Image,
         (true, false) => Modality::Video,
@@
         }
     };
+
+    match modality {
+        Modality::Image if images.len() != requested_images => {
+            anyhow::bail!("Failed to fetch every requested image for multimodal request");
+        }
+        Modality::Video if videos.len() != requested_videos => {
+            anyhow::bail!("Failed to fetch every requested video for multimodal request");
+        }
+        _ => {}
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@model_gateway/src/routers/grpc/multimodal.rs` around lines 557 - 603, Record
the requested per-modality counts before calling tracker.finalize() (e.g.,
capture the expected number of image and video parts from the incoming
multimodal request or tracker initialization), then after building
`tracker_output` and extracting `images` and `videos` compare the fetched
lengths against those cached expected counts (using the `images.len()` and
`videos.len()` vs the stored expected image/video counts); if any fetched count
is smaller than requested, return an error (similar to the existing errors)
instead of continuing to prompt expansion so you never proceed with partial
multimodal fetches that would leave `prompt_replacements`/`expanded_token_ids`
mismatched. Ensure you reference `tracker.finalize()`, `tracker_output`,
`images`, `videos`, `prompt_replacements`, and `expanded_token_ids` when adding
this check.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/multimodal/src/registry/llava.rs`:
- Around line 61-71: The code currently uses the first value from
preprocessed.feature_token_counts and replicates it for all images via
PromptReplacement::repeated, relying on debug_assert! to catch mismatches;
change this to a real runtime-safe implementation by either (a) mapping over
preprocessed.feature_token_counts and creating a
PromptReplacement::repeated(Modality::Image, &token, token_id, count) for each
count so each image uses its actual token count, or (b) if divergent counts are
invalid, replace the debug_assert! with a runtime check that returns an Err with
a clear message when any count differs; update the block around
preprocessed.feature_token_counts, PromptReplacement::repeated, token and
token_id accordingly.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 800-804: In _mm_inputs_from_itemized_proto(), the loop that
collapses per-item placeholder_token_id into im_token_id/video_token_id should
validate consistency instead of letting the last value win; when you see
item_proto.placeholder_token_id, if modality is Modality.IMAGE compare
int(item_proto.placeholder_token_id) against any existing im_token_id and if
im_token_id is unset assign it, otherwise if it differs raise an error (e.g.,
ValueError or appropriate RPC error); do the same for Modality.VIDEO with
video_token_id so conflicting placeholder IDs are rejected rather than silently
overwritten.

---

Outside diff comments:
In `@crates/multimodal/src/media.rs`:
- Around line 572-587: The cast from u32 to usize is unsafe on 32-bit targets;
replace the direct cast in the PNG parsing block (the line using len_bytes ->
len and subsequent cursor arithmetic) with a fallible conversion (e.g.,
usize::try_from(u32::from_be_bytes(len_bytes))) and return a
MediaConnectorError::VideoDecode if the conversion fails or if len exceeds the
available remaining bytes; also use checked arithmetic (checked_add/checked_sub
or validate cursor + 12 + len won't overflow) before advancing cursor to avoid
overflow when computing cursor += 12 + len.

In `@crates/multimodal/src/registry/qwen3_vl.rs`:
- Around line 66-74: The pattern match in video_grid_t currently accepts only
ModelSpecificValue::IntTensor, which causes video_grid_thw encoded as UintTensor
to be ignored; update the match in fn video_grid_t (and the similar match around
the other occurrence at the 244-259 region) to also accept
ModelSpecificValue::UintTensor, extract the first element from either IntTensor
or UintTensor, and convert it to usize via usize::try_from (or equivalent) so
both signed and unsigned tensor representations of video_grid_thw are handled
consistently.

In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 557-603: Record the requested per-modality counts before calling
tracker.finalize() (e.g., capture the expected number of image and video parts
from the incoming multimodal request or tracker initialization), then after
building `tracker_output` and extracting `images` and `videos` compare the
fetched lengths against those cached expected counts (using the `images.len()`
and `videos.len()` vs the stored expected image/video counts); if any fetched
count is smaller than requested, return an error (similar to the existing
errors) instead of continuing to prompt expansion so you never proceed with
partial multimodal fetches that would leave
`prompt_replacements`/`expanded_token_ids` mismatched. Ensure you reference
`tracker.finalize()`, `tracker_output`, `images`, `videos`,
`prompt_replacements`, and `expanded_token_ids` when adding this check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 695ee3ec-6c35-4e27-a461-1f43faa3f2c7

📥 Commits

Reviewing files that changed from the base of the PR and between 70a013f and 732f59b.

📒 Files selected for processing (34)
  • .github/workflows/pr-test-rust.yml
  • crates/grpc_client/proto/tokenspeed_scheduler.proto
  • crates/grpc_client/src/tokenspeed_scheduler.rs
  • crates/multimodal/benches/image_preprocess.rs
  • crates/multimodal/src/lib.rs
  • crates/multimodal/src/media.rs
  • crates/multimodal/src/registry/kimi_k25.rs
  • crates/multimodal/src/registry/llama4.rs
  • crates/multimodal/src/registry/llava.rs
  • crates/multimodal/src/registry/mod.rs
  • crates/multimodal/src/registry/phi3_v.rs
  • crates/multimodal/src/registry/qwen3_vl.rs
  • crates/multimodal/src/registry/qwen_vl.rs
  • crates/multimodal/src/registry/traits.rs
  • crates/multimodal/src/types.rs
  • crates/multimodal/src/vision/mod.rs
  • crates/multimodal/src/vision/preprocessor_config.rs
  • crates/multimodal/src/vision/processor.rs
  • crates/multimodal/src/vision/processors/kimi_k25.rs
  • crates/multimodal/src/vision/processors/llama4_vision.rs
  • crates/multimodal/src/vision/processors/llava.rs
  • crates/multimodal/src/vision/processors/mod.rs
  • crates/multimodal/src/vision/processors/phi3_vision.rs
  • crates/multimodal/src/vision/processors/phi4_vision.rs
  • crates/multimodal/src/vision/processors/pixtral.rs
  • crates/multimodal/src/vision/processors/qwen2_vl.rs
  • crates/multimodal/src/vision/processors/qwen3_vl.rs
  • crates/multimodal/src/vision/processors/qwen_vl_base.rs
  • crates/multimodal/tests/vision_golden_tests.rs
  • grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
  • model_gateway/Cargo.toml
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs
  • model_gateway/src/routers/grpc/utils/chat_utils.rs

Comment thread crates/multimodal/src/registry/llava.rs Outdated
Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7864781423

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/multimodal/src/vision/processors/qwen_vl_base.rs
Ok(TokenSpeedMultimodalData { items })
}

fn tokenspeed_item_count(intermediate: &PrecomputedMultimodalIntermediate) -> Result<usize> {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this specifically for tokenspeed? or it is universal to other engines?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helper is currently only used by the TokenSpeed assembler because TokenSpeed is the backend that needs itemized multimodal payloads here, but the invariant itself is not TokenSpeed-specific. It validates that the precomputed media/token/placeholder counts agree before itemization.

I renamed it to precomputed_multimodal_item_count and made the error messages backend-neutral.

}
}

fn serialize_model_specific_for_item(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this helper be in multimodal crate instead?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.The item slicing semantics are not gateway-specific, so I moved that part onto ModelSpecificValue in the multimodal crate via as_flat_sizes() and slice_first_dim().

The gateway now only performs transport serialization after slicing.

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
@yechank-nvidia yechank-nvidia force-pushed the yechan/mm-abi-generalization branch from eded36c to 3ceda86 Compare June 11, 2026 05:18

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
crates/multimodal/src/vision/processors/qwen2_vl.rs (1)

209-214: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

preprocess() drops config-driven resize/token overrides.

This delegates straight to self.inner.preprocess(images, config), but QwenVLProcessorBase::preprocess() reads patch size / merge size / pixel limits from self.config. So Qwen2VLProcessor::new().preprocess(..., config_with_patch_or_pixel_overrides) still runs with the constructor defaults instead of the request config.

Suggested fix
     fn preprocess(
         &self,
         images: &[DynamicImage],
         config: &PreProcessorConfig,
     ) -> Result<PreprocessedEncoderInputs, TransformError> {
-        self.inner.preprocess(images, config)
+        let processor = if config.patch_size.is_some()
+            || config.merge_size.is_some()
+            || config.min_pixels.is_some()
+            || config.max_pixels.is_some()
+            || config.temporal_patch_size.is_some()
+        {
+            Self::from_preprocessor_config(config)
+        } else {
+            self.clone()
+        };
+
+        processor.inner.preprocess(images, config)
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/multimodal/src/vision/processors/qwen2_vl.rs` around lines 209 - 214,
preprocess currently delegates to self.inner.preprocess(images, config) but
QwenVLProcessorBase::preprocess reads from self.config so per-request overrides
in the passed config are ignored; fix by applying the incoming
PreProcessorConfig overrides onto the inner processor's config before calling
preprocess (e.g. clone or make a mutable copy of self.inner, copy/override
relevant fields such as patch_size, merge_size, max_pixels/min_pixels or any
pixel/resize/token overrides from the incoming config into inner.config), then
call that inner's preprocess (or adjust QwenVLProcessorBase::preprocess to honor
the passed config) so the request-level config takes effect.
crates/multimodal/src/vision/processors/phi4_vision.rs (1)

466-470: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor image_std-only overrides when selecting the processor.

This branch only rebuilds from PreProcessorConfig when dynamic_hd or image_mean is set. A request that overrides just image_std still takes self.clone(), so normalization uses the default std and produces the wrong encoder_input.

Suggested fix
-        let processor = if config.dynamic_hd.is_some() || config.image_mean.is_some() {
+        let processor = if config.dynamic_hd.is_some()
+            || config.image_mean.is_some()
+            || config.image_std.is_some()
+        {
             Self::from_preprocessor_config(config)
         } else {
             self.clone()
         };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/multimodal/src/vision/processors/phi4_vision.rs` around lines 466 -
470, The processor selection branch doesn't account for image_std-only
overrides, so when only config.image_std is set it wrongly uses self.clone();
update the conditional that builds processor (the let processor = ... block) to
also check config.image_std (i.e., rebuild via Self::from_preprocessor_config
when config.dynamic_hd.is_some() || config.image_mean.is_some() ||
config.image_std.is_some()) so normalization uses the overridden std instead of
the clone's defaults.
model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs (1)

79-111: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid hard-coding image modality in placeholder resolution.

resolve_placeholder_token is modality-specific, but Line 110 always passes Modality::Image. For video message requests, this can resolve the wrong placeholder token and break downstream token-expansion matching. Align this with the chat preparation flow by deriving request modality from message parts (and rejecting mixed modalities if still unsupported) before calling resolve_placeholder_token.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs` around
lines 79 - 111, The code currently always passes Modality::Image to
multimodal::resolve_placeholder_token in MessagePreparationStage::execute which
breaks non-image multimodal requests; update the logic to derive the modality
from the incoming request.messages (inspect each message part to determine
Modality enum values), ensure you reject mixed-modal requests if unsupported,
and pass the determined modality to multimodal::resolve_placeholder_token
instead of Modality::Image; use multimodal::has_multimodal_content_messages,
ctx.components.multimodal, and the tokenizer lookup
(tokenizer_registry/get_by_name) as-is and only change the modality selection
and validation before calling resolve_placeholder_token.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/multimodal/src/media.rs`:
- Around line 214-216: The code decodes base64 blobs and fully buffers them into
memory via BASE64_STANDARD.decode and then calls self.decode_video(...,
VideoSource::DataUrl) which can OOM on large inputs; before decoding or before
creating a VideoClip, enforce a maximum allowed payload size (e.g. reject or
truncate if decoded length > MAX_VIDEO_BYTES) by first checking the base64
length/estimate or streaming/limiting the decode, return an error on oversize
inputs, and document MAX_VIDEO_BYTES; apply the same bounding check to the other
call sites that create VideoClip (the calls into self.decode_video at the other
referenced locations) so no code path retains unbounded byte buffers.

In `@crates/multimodal/src/registry/qwen3_vl.rs`:
- Around line 365-400: The test qwen3_5_video_replacement_splits_temporal_grid
currently relies on TestTokenizer which returns empty IDs from
TestTokenizer::encode so timestamp tokens are never asserted; replace or extend
the tokenizer used in the test with a tokenizer test-double that maps timestamp
strings (e.g. "<0.0 seconds>", "<1.0 seconds>") to deterministic, unique token
IDs so encoding of timestamp text actually produces IDs, then update the
assertions on replacements (obtained via
ModelRegistry::lookup(...).prompt_replacements_for(...)) to check for those
timestamp token IDs at the expected temporal segment boundaries (use the same
test helper test_preprocessed_with_tokens and ModelSpecificValue::int_2d to
produce segments). Ensure the tokenizer double is referenced where
metadata.tokenizer is constructed so prompt_replacements_for sees the real
encoded timestamp IDs.

In `@crates/multimodal/src/vision/processors/llama4_vision.rs`:
- Around line 453-465: The code pushes image dimensions into item_sizes as
(height, width), which is inconsistent with other processors that use (width,
height); change the push in the loop that uses
processor.process_single_image(...) so item_sizes.push((image.width(),
image.height())) to match the ordering expected by PreprocessedEncoderInputs and
avoid flipping portrait/landscape metadata for Llama4Vision.

In `@crates/multimodal/src/vision/processors/qwen3_vl.rs`:
- Around line 124-131: from_preprocessor_config() maps
size.shortest_edge/longest_edge into min_pixels/max_pixels but preprocess() and
preprocess_video() still delegate to self.inner.*, so the new pixel limits are
never used; update the constructor/path so the inner processor uses the mapped
config — either initialize self.inner with the modified config (ensure
Qwen3VLProcessor::new or from_preprocessor_config creates the inner
QwenVLProcessorBase using the mapped min_pixels/max_pixels) or override
preprocess()/preprocess_video() in Qwen3VLProcessor to call the base
implementation that reads self.config (QwenVLProcessorBase) so the size-edge
fallback takes effect. Ensure the change is mirrored for the other occurrence
around lines 216-230.

In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 605-608: The code currently rejects multi-video requests in the
check using Modality::Video and by only preprocessing
videos_for_preprocess.first(), which prevents the new per-item contract from
working; remove or relax the single-video guard (the if checking modality ==
Modality::Video && videos.len() != 1) and change the preprocessing logic that
uses videos_for_preprocess.first() to iterate over all entries in
videos_for_preprocess (perform preprocessing per item and collect results),
ensuring downstream TokenSpeed assembly stays item-oriented and consumes the
per-video processed outputs; also update any error messages to reflect per-item
failures rather than a single-video-only error.

---

Outside diff comments:
In `@crates/multimodal/src/vision/processors/phi4_vision.rs`:
- Around line 466-470: The processor selection branch doesn't account for
image_std-only overrides, so when only config.image_std is set it wrongly uses
self.clone(); update the conditional that builds processor (the let processor =
... block) to also check config.image_std (i.e., rebuild via
Self::from_preprocessor_config when config.dynamic_hd.is_some() ||
config.image_mean.is_some() || config.image_std.is_some()) so normalization uses
the overridden std instead of the clone's defaults.

In `@crates/multimodal/src/vision/processors/qwen2_vl.rs`:
- Around line 209-214: preprocess currently delegates to
self.inner.preprocess(images, config) but QwenVLProcessorBase::preprocess reads
from self.config so per-request overrides in the passed config are ignored; fix
by applying the incoming PreProcessorConfig overrides onto the inner processor's
config before calling preprocess (e.g. clone or make a mutable copy of
self.inner, copy/override relevant fields such as patch_size, merge_size,
max_pixels/min_pixels or any pixel/resize/token overrides from the incoming
config into inner.config), then call that inner's preprocess (or adjust
QwenVLProcessorBase::preprocess to honor the passed config) so the request-level
config takes effect.

In `@model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs`:
- Around line 79-111: The code currently always passes Modality::Image to
multimodal::resolve_placeholder_token in MessagePreparationStage::execute which
breaks non-image multimodal requests; update the logic to derive the modality
from the incoming request.messages (inspect each message part to determine
Modality enum values), ensure you reject mixed-modal requests if unsupported,
and pass the determined modality to multimodal::resolve_placeholder_token
instead of Modality::Image; use multimodal::has_multimodal_content_messages,
ctx.components.multimodal, and the tokenizer lookup
(tokenizer_registry/get_by_name) as-is and only change the modality selection
and validation before calling resolve_placeholder_token.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a453fcdf-941a-409f-9be4-b2e6fe6f3b7f

📥 Commits

Reviewing files that changed from the base of the PR and between 7864781 and 3ceda86.

📒 Files selected for processing (44)
  • .github/workflows/pr-test-rust.yml
  • crates/grpc_client/proto/tokenspeed_scheduler.proto
  • crates/grpc_client/src/tokenspeed_scheduler.rs
  • crates/multimodal/Cargo.toml
  • crates/multimodal/benches/image_preprocess.rs
  • crates/multimodal/src/error.rs
  • crates/multimodal/src/hasher.rs
  • crates/multimodal/src/hub.rs
  • crates/multimodal/src/lib.rs
  • crates/multimodal/src/media.rs
  • crates/multimodal/src/registry/kimi_k25.rs
  • crates/multimodal/src/registry/llama4.rs
  • crates/multimodal/src/registry/llava.rs
  • crates/multimodal/src/registry/mod.rs
  • crates/multimodal/src/registry/phi3_v.rs
  • crates/multimodal/src/registry/qwen3_vl.rs
  • crates/multimodal/src/registry/qwen_vl.rs
  • crates/multimodal/src/registry/traits.rs
  • crates/multimodal/src/tracker.rs
  • crates/multimodal/src/types.rs
  • crates/multimodal/src/vision/mod.rs
  • crates/multimodal/src/vision/preprocessor_config.rs
  • crates/multimodal/src/vision/processor.rs
  • crates/multimodal/src/vision/processors/kimi_k25.rs
  • crates/multimodal/src/vision/processors/llama4_vision.rs
  • crates/multimodal/src/vision/processors/llava.rs
  • crates/multimodal/src/vision/processors/mod.rs
  • crates/multimodal/src/vision/processors/phi3_vision.rs
  • crates/multimodal/src/vision/processors/phi4_vision.rs
  • crates/multimodal/src/vision/processors/pixtral.rs
  • crates/multimodal/src/vision/processors/qwen2_vl.rs
  • crates/multimodal/src/vision/processors/qwen3_vl.rs
  • crates/multimodal/src/vision/processors/qwen_vl_base.rs
  • crates/multimodal/tests/vision_golden_tests.rs
  • grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
  • model_gateway/Cargo.toml
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs
  • model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs
  • model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
  • model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs
  • model_gateway/src/routers/grpc/regular/stages/messages/request_building.rs
  • model_gateway/src/routers/grpc/utils/chat_utils.rs
  • model_gateway/src/workflow/tokenizer_registration.rs

Comment thread crates/multimodal/src/media.rs
Comment thread crates/multimodal/src/registry/qwen3_vl.rs
Comment on lines 281 to 284
/// Get the batch size.
pub fn batch_size(&self) -> usize {
self.pixel_values.shape()[0]
self.encoder_input.shape()[0]
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

batch_size() no longer matches the generalized encoder-input contract.

After this refactor, some processors flatten encoder patches/tiles into axis 0 instead of keeping “one item per batch row”. For example, KimiK25Processor now emits [total_patches, 3, patch, patch] and Llama4VisionProcessor emits [total_tiles, 3, H, W], so batch_size() returns the patch/tile count there instead of the number of media items. Any generic caller using batch_size() will misalign metadata or placeholder expansion on those models.

Suggested fix
     /// Get the batch size.
     pub fn batch_size(&self) -> usize {
-        self.encoder_input.shape()[0]
+        self.item_sizes.len()
     }

Also applies to: 342-345

Comment on lines 453 to +465
let mut all_outputs = Vec::new();
let mut all_aspect_ratios = Vec::new();
let mut image_sizes = Vec::new();
let mut num_img_tokens = Vec::new();
let mut item_sizes = Vec::new();
let mut feature_token_counts = Vec::new();

for image in images {
let (output, aspect_ratio) = processor.process_single_image(image);
let tokens = processor.calculate_num_tokens_for_aspect_ratio(aspect_ratio);

all_outputs.push(output);
all_aspect_ratios.push(aspect_ratio);
image_sizes.push((image.height(), image.width()));
num_img_tokens.push(tokens);
item_sizes.push((image.height(), image.width()));
feature_token_counts.push(tokens);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep item_sizes in the same tuple order as the other processors.

This path pushes (height, width), but the other processors in this cohort populate item_sizes from DynamicImage::dimensions(), i.e. (width, height). Because PreprocessedEncoderInputs.item_sizes is shared across processors and carries no per-model ordering flag, this will flip portrait/landscape metadata for generic consumers on LLaMA 4 only.

Suggested fix
-            item_sizes.push((image.height(), image.width()));
+            item_sizes.push(image.dimensions());
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/multimodal/src/vision/processors/llama4_vision.rs` around lines 453 -
465, The code pushes image dimensions into item_sizes as (height, width), which
is inconsistent with other processors that use (width, height); change the push
in the loop that uses processor.process_single_image(...) so
item_sizes.push((image.width(), image.height())) to match the ordering expected
by PreprocessedEncoderInputs and avoid flipping portrait/landscape metadata for
Llama4Vision.

Comment on lines +124 to +131
min_pixels: config
.min_pixels
.or_else(|| config.get_shortest_edge())
.unwrap_or(DEFAULT_MIN_PIXELS),
max_pixels: config
.max_pixels
.or_else(|| config.get_longest_edge())
.unwrap_or(DEFAULT_MAX_PIXELS),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The new size-edge fallback never reaches preprocess() / preprocess_video().

from_preprocessor_config() now maps size.shortest_edge / size.longest_edge into min_pixels / max_pixels, but both trait entrypoints still forward to self.inner.*. Since QwenVLProcessorBase computes image/video resize limits from self.config, Qwen3VLProcessor::new().preprocess_video(..., config_with_size) still uses the default budgets instead of the values you just loaded here.

Suggested fix
     fn preprocess(
         &self,
         images: &[DynamicImage],
         config: &PreProcessorConfig,
     ) -> Result<PreprocessedEncoderInputs, TransformError> {
-        self.inner.preprocess(images, config)
+        let processor = if config.patch_size.is_some()
+            || config.merge_size.is_some()
+            || config.min_pixels.is_some()
+            || config.max_pixels.is_some()
+            || config.temporal_patch_size.is_some()
+            || config.size.is_some()
+        {
+            Self::from_preprocessor_config(config)
+        } else {
+            self.clone()
+        };
+
+        processor.inner.preprocess(images, config)
     }

     fn preprocess_video(
         &self,
         frames: &[DynamicImage],
         config: &PreProcessorConfig,
     ) -> Result<PreprocessedEncoderInputs, TransformError> {
-        self.inner.preprocess_video(frames, config)
+        let processor = if config.patch_size.is_some()
+            || config.merge_size.is_some()
+            || config.min_pixels.is_some()
+            || config.max_pixels.is_some()
+            || config.temporal_patch_size.is_some()
+            || config.size.is_some()
+        {
+            Self::from_preprocessor_config(config)
+        } else {
+            self.clone()
+        };
+
+        processor.inner.preprocess_video(frames, config)
     }

Also applies to: 216-230

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/multimodal/src/vision/processors/qwen3_vl.rs` around lines 124 - 131,
from_preprocessor_config() maps size.shortest_edge/longest_edge into
min_pixels/max_pixels but preprocess() and preprocess_video() still delegate to
self.inner.*, so the new pixel limits are never used; update the
constructor/path so the inner processor uses the mapped config — either
initialize self.inner with the modified config (ensure Qwen3VLProcessor::new or
from_preprocessor_config creates the inner QwenVLProcessorBase using the mapped
min_pixels/max_pixels) or override preprocess()/preprocess_video() in
Qwen3VLProcessor to call the base implementation that reads self.config
(QwenVLProcessorBase) so the size-edge fallback takes effect. Ensure the change
is mirrored for the other occurrence around lines 216-230.

Comment on lines +605 to 608
if modality == Modality::Video && videos.len() != 1 {
return Err(anyhow::anyhow!(
"No images were successfully fetched for multimodal request"
"Exactly one video is supported per request for the initial video path"
));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

The video path is still single-item despite the new itemized contract.

Lines 605-608 reject any request with more than one video, and Lines 684-689 only ever preprocess videos_for_preprocess.first(). That means multi-video prompts still fail here even though the downstream TokenSpeed assembly in this file is already item-oriented.

Also applies to: 684-689

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@model_gateway/src/routers/grpc/multimodal.rs` around lines 605 - 608, The
code currently rejects multi-video requests in the check using Modality::Video
and by only preprocessing videos_for_preprocess.first(), which prevents the new
per-item contract from working; remove or relax the single-video guard (the if
checking modality == Modality::Video && videos.len() != 1) and change the
preprocessing logic that uses videos_for_preprocess.first() to iterate over all
entries in videos_for_preprocess (perform preprocessing per item and collect
results), ensuring downstream TokenSpeed assembly stays item-oriented and
consumes the per-video processed outputs; also update any error messages to
reflect per-item failures rather than a single-video-only error.

@mergify

mergify Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Hi @yechank-nvidia, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch:

git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease

@mergify mergify Bot added the needs-rebase PR has merge conflicts that need to be resolved label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci CI/CD configuration changes dependencies Dependency updates grpc gRPC client and router changes model-gateway Model gateway crate changes multimodal Multimodal crate changes needs-rebase PR has merge conflicts that need to be resolved tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants