refactor(multimodal): Multimodal ABI generalization by yechank-nvidia · Pull Request #1602 · lightseekorg/smg

yechank-nvidia · 2026-06-05T02:41:28Z

Description

Problem

SMG multimodal support was still structured around an image-only payload shape. This makes it hard to extend the existing precomputed multimodal path to additional modalities such as video and audio without adding more image-specific fields and special cases.

The gateway also needed a cleaner foundation for passing preprocessed multimodal encoder inputs, model-specific metadata, placeholder ranges, and modality information to TokenSpeed.

Solution

This PR generalizes the TokenSpeed multimodal ABI from a single image payload into itemized multimodal inputs. Each multimodal item now carries its own modality, encoder input tensor, model-specific side tensors, placeholder metadata, placeholder token id, and content hash.

This keeps the current precomputed multimodal design, but removes the image-only payload assumption so later PRs can add video and transport optimizations on top of the same structure.

Changes

Generalized TokenSpeed MultimodalInputs to contain repeated MultimodalItems.
Added modality metadata for image, video, and audio.
Replaced image-specific TokenSpeed fields with per-item:
- encoder_input
- model_specific_tensors
- placeholders
- placeholder_token_id
- content_hash
Updated SMG TokenSpeed proto conversion to emit itemized multimodal payloads.
Updated the TokenSpeed Python servicer to consume the itemized multimodal payload.
Preserved existing image behavior on top of the new generalized payload structure.
Added model/tokenizer config loading support needed for multimodal preprocessor metadata.
Avoided warning spam when processor_config.json exists but does not contain video_processor.
Added regression coverage for video_processor config loading behavior.

Test Plan

Verified the stacked branch diff against the original implementation branch.
Verified whitespace/diff hygiene:

git diff --check

Checklist

cargo +nightly fmt passes
cargo clippy --all-targets --all-features -- -D warnings passes
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg (https://slack.lightseek.org) to discuss, review, and merge PRs

Summary by CodeRabbit

New Features
- Full video support alongside images: fetching, decoding, frame sampling, and preprocessing.
- Modality-aware multimodal requests and placeholder handling (image vs video).
- Itemized encoder-input payloads with flexible tensor transport (inline, shared memory, remote).
- Model info now reports multimodal capability, supported modalities, and encoder dtypes.
- Deterministic media content hashing and improved preprocessor config loading.
Bug Fixes
- Rejects mixed image+video requests; stronger multimodal payload validation and decoding.

coderabbitai · 2026-06-05T02:41:36Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds itemized, typed TokenSpeed multimodal protobufs; first-class video ingestion/ffmpeg decoding; refactors vision preprocessing to encoder-inputs; makes registry and placeholder logic modality-aware; implements gateway TokenSpeed itemized assembly and servicer itemized decoding; updates tests and manifests.

Changes

Protobuf contracts and multimodal types

Layer / File(s)	Summary
Protobuf schema and multimodal types `crates/grpc_client/proto/tokenspeed_scheduler.proto`, `crates/multimodal/src/types.rs`	`TensorData` becomes a typed tensor descriptor with inline/shm/remote payloads; `MultimodalInputs` is itemized with `MultimodalItem` and `Modality`; `GetModelInfoResponse` adds multimodal capability fields; `VideoClip`, `VideoSource`, and video content-part variants added.

Vision preprocessing interface refactor

Layer / File(s)	Summary
Vision processor and output contract `crates/multimodal/src/vision/processor.rs`, `crates/multimodal/src/vision/mod.rs`, `crates/multimodal/src/lib.rs`	Renamed `ImagePreProcessor` → `VisionPreProcessor`; `PreprocessedImages` → `PreprocessedEncoderInputs` (fields: `encoder_input`, `feature_token_counts`, `item_sizes`); registry renamed to `VisionProcessorRegistry`; added default `preprocess_video`.

All vision processors updated

Layer / File(s)	Summary
Processors and tests `crates/multimodal/src/vision/processors/*`, `crates/multimodal/tests/vision_golden_tests.rs`	Processors (Kimi, LLaVA, Llama4, Phi3/Phi4, Pixtral, Qwen2/Qwen3, etc.) updated to `VisionPreProcessor` and to return `PreprocessedEncoderInputs`; unit tests adapted to new contract and added video tests.

Video ingestion and decoding

Layer / File(s)	Summary
MediaConnector video fetch and decode `crates/multimodal/src/media.rs`, `crates/multimodal/src/hasher.rs`, `crates/multimodal/src/error.rs`, `crates/multimodal/Cargo.toml`	Added `MediaConnector::fetch_video`, `decode_video` using ffmpeg/ffprobe with PNG stream splitting, `VideoFetchConfig`, `hash_video`, and `MediaConnectorError::VideoDecode`; moved `tempfile` to normal deps.

Tracker and multimodal extraction

Layer / File(s)	Summary
Tracker enqueue and types `crates/multimodal/src/tracker.rs`, `crates/multimodal/src/types.rs`	Tracker supports `VideoUrl`/`VideoData` and enqueues video fetch tasks; `VideoClip` is a concrete tracked-media variant; `MediaContentPart` extended for video.

Registry modality-aware APIs and Qwen video handling

Layer / File(s)	Summary
Registry trait and Qwen-specific logic `crates/multimodal/src/registry/traits.rs`, `crates/multimodal/src/registry/qwen3_vl.rs`	`ModelProcessorSpec` gains modality-gated defaults (`placeholder_token_for`, `placeholder_token_id_for`, `prompt_replacements_for`); `UnsupportedModality` error added; Qwen3 VL adds video pad/token handling including temporal-grid splitting.

Gateway multimodal orchestration and TokenSpeed assembly

Layer / File(s)	Summary
Gateway multimodal flow and TokenSpeed proto wrapper `model_gateway/src/routers/grpc/multimodal.rs`, `model_gateway/src/routers/grpc/proto_wrapper.rs`	Adds `video_preprocessor_config`, changes intermediate to `PrecomputedMultimodalIntermediate` with `modality` and `placeholder_token_id`, selects single modality, rejects mixed image+video, assembles itemized TokenSpeed multimodal items, serializes encoder inputs into requested dtype (float16/bfloat16 handling), computes per-item `content_hash`, and emits `tokenspeed::MultimodalInputs.items`. Tests added.

Servicer decode, GetModelInfo, and CI

Layer / File(s)	Summary
Servicer reconstruction and CI check `grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`, `.github/workflows/pr-test-rust.yml`	Servicer adds `_mm_inputs_from_itemized_proto` and `_tensor_payload_bytes` helpers, validates tensor payload shapes and dtypes, conditionally includes multimodal fields in `GetModelInfo`, and CI now asserts `TensorData` descriptor layout.

Request-stage wiring and utils

Layer / File(s)	Summary
Message/chat stages and utils `model_gateway/src/routers/grpc/regular/stages/*`, `model_gateway/src/routers/grpc/utils/chat_utils.rs`	ChatPreparation extracts modalities and rejects mixed requests; message/request stages pass `modality` and `workers` into assembly and map assembly errors to `bad_request`; `transform_content_field` treats `video_url` like `image_url` for placeholder replacement; tests added.

Supporting updates

Layer / File(s)	Summary
Imports, manifests, and tokenizer preload `model_gateway/Cargo.toml`, `model_gateway/src/workflow/tokenizer_registration.rs`, `crates/multimodal/src/lib.rs`, `crates/multimodal/benches/*`	Added `ndarray` to model_gateway deps; tokenizer preload loads `video_preprocessor_config`; lib re-exports video types; minor import/bench formatting changes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

[Feature]: Eliminate redundant multimodal preprocessing for gRPC backends #566: Overlaps with multimodal assembly and TokenSpeed/tensor handling concerns addressed here.

Possibly related PRs

lightseekorg/smg#958: Related changes to Llava prompt_replacements and preprocessed token-count sourcing.
feat(grpc): add TokenSpeed multimodal (VLM) input support #1515: Earlier TokenSpeed multimodal protobuf work this PR extends with itemized typed-tensor contract and servicer updates.
feat(multimodal): add Kimi-K2.5 vision support for gRPC router #1044: Overlap on Kimi processor prompt-replacement adaptation to encoder-input API.

Suggested reviewers

slin1237
key4ng
XinyueZhang369

"I hopped through frames, a whiskered guide,
I split the PNG stream and kept each stride,
I hashed the bytes and carried clips,
Assembled items, packed the ships.
A rabbit's nod — multimodal, wide."

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

mergify · 2026-06-05T02:42:08Z

Hi @yechank-nvidia, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

Use git commit -s every time, or
VSCode: enable Git: Always Sign Off in Settings
PyCharm: enable Sign-off commit in the Commit tool window

gemini-code-assist

Code Review

This pull request introduces video processing support to the multimodal pipeline, specifically targeting Qwen3-VL models. It updates the gRPC protobuf definitions to support itemized multimodal inputs, implements video fetching and decoding using FFmpeg, and adds video preprocessing, smart resizing, and patchification. The model gateway and Python gRPC servicer are also enhanced to route and validate video payloads. The review feedback identifies a potential division-by-zero bug in smart_resize_video when num_frames is zero, an integer overflow vulnerability during PNG stream splitting, performance overhead from uncached environment variable lookups on the hot path, and a recommendation to use == instead of is for Python enum comparisons.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 40c0739860

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs (1)

98-108: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Map placeholder-resolution failures to client error in multimodal flow.

For unsupported modality/model combinations, this path now returns internal_error (5xx). These are request-level failures and should be returned as bad_request (4xx), consistent with the rest of multimodal preparation failures.

💡 Suggested change

-                    error::internal_error(
-                        "multimodal_placeholder_resolution_failed",
-                        format!("Failed to resolve multimodal placeholder token: {e}"),
-                    )
+                    error::bad_request(
+                        "multimodal_not_supported",
+                        format!("Failed to resolve multimodal placeholder token: {e}"),
+                    )

Based on learnings: multimodal preparation failures are currently intentionally surfaced as 400 due generic multimodal error typing.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs` around
lines 98 - 108, The map_err branch in ChatPreparationStage::execute currently
converts placeholder-resolution failures into an internal_error (5xx); change
this to return a bad_request (4xx) so unsupported modality/model combos are
surfaced as request errors: in the map_err closure that logs "Failed to resolve
multimodal placeholder token" (inside ChatPreparationStage::execute), replace
error::internal_error(...) with error::bad_request(...) (keeping the same error
key/message formatting) so multimodal placeholder resolution failures are
returned as 400-level client errors.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/multimodal/src/media.rs`:
- Around line 373-387: The ffmpeg/ffprobe Command::output() calls can hang;
update the calls inside decode_video_with_ffmpeg, probe_video_duration_seconds,
and probe_video_duration_seconds_with_ffmpeg to spawn the child and wait with a
timeout (e.g., tokio::time::timeout) and if the timeout elapses kill the child
process and return a timed-out error; specifically replace the
Command::new(...).output() usage with child =
Command::new(...).stdout(Stdio::piped()).spawn(), await the child's output with
a timeout, on timeout call child.kill().await (or child.kill() and then await)
and then collect stdout/stderr into the same Output-like result or error path so
callers handle failures consistently.

In `@crates/multimodal/src/vision/processors/qwen_vl_base.rs`:
- Around line 235-245: resized_pixels is computed using the padded frame count
(t_bar) but beta is computed with the unpadded num_frames, causing inconsistent
scaling; update both beta calculations in the branches that set h_bar/w_bar to
use the padded frame count (t_bar) instead of num_frames so the pixel budget
math matches resized_pixels (affecting the blocks that compute beta for
max_pixels and min_pixels and the subsequent h_bar/w_bar adjustments).
- Around line 600-613: preprocess_video() is emitting the extra key
"patches_per_video" while Qwen3VLVisionSpec::field_layouts() expects the
flat-field name "patches_per_image", causing a layout mismatch; update the
PreprocessedImages builder in
PreprocessedImages::new_dynamic(...).with_extra(...) to use the same key
"patches_per_image" (or alternatively change Qwen3VLVisionSpec::field_layouts()
to reference "patches_per_video") so the
flat(FieldLayout::flat("patches_per_image")) contract is satisfied; ensure any
downstream consumers that read the extra use the chosen canonical name.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 374-383: GetModelInfo is advertising AUDIO while
_mm_inputs_from_itemized_proto does not carry an audio placeholder token
through, causing audio placeholder replacement to break; either stop
advertising/accepting AUDIO or thread an audio placeholder field into
MultimodalInputs. Fix by removing AUDIO from supported_modalities (the list
built from image_modality/video_modality/audio_modality) and any code paths that
accept it in _modality_from_proto/GetModelInfo until
_mm_inputs_from_itemized_proto is extended, or alternatively add plumbing:
extend _mm_inputs_from_itemized_proto to accept a new
audio_token_id/placeholder_token_id parameter and propagate that into the
MultimodalInputs construction and any call sites so audio placeholder_token_id
survives reconstruction. Ensure you update places that build
supported_modalities (the code block using supports_vision and hf_config) and
the functions _modality_from_proto and _mm_inputs_from_itemized_proto
consistently.

In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 681-686: The code clones every frame into a new Vec (frames)
before calling preprocess_video, causing unnecessary CPU/memory overhead;
instead, remove the intermediate clone and pass a slice of the existing frames
directly to processor.preprocess_video (use the existing videos_for_preprocess
-> video variable and pass &video.frames or &video.frames[..] as the first
argument), keeping pp_config as before so no data copy occurs.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs`:
- Around line 179-180: Add a regression test that verifies the template
substitution inserts the video placeholder string correctly when the template
input is formatted as a string: exercise the code path that maps "video_url" =>
image_placeholder.map(String::from) (the template substitution logic in
chat_utils.rs), supply a template containing the video_url placeholder and a
non-empty image_placeholder, then assert the rendered output contains the
expected video placeholder string (and that it matches the behavior for
"image_url"); place this test alongside the other unit tests for chat_utils
template handling.

---

Outside diff comments:
In `@model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs`:
- Around line 98-108: The map_err branch in ChatPreparationStage::execute
currently converts placeholder-resolution failures into an internal_error (5xx);
change this to return a bad_request (4xx) so unsupported modality/model combos
are surfaced as request errors: in the map_err closure that logs "Failed to
resolve multimodal placeholder token" (inside ChatPreparationStage::execute),
replace error::internal_error(...) with error::bad_request(...) (keeping the
same error key/message formatting) so multimodal placeholder resolution failures
are returned as 400-level client errors.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 786f49fa-f73d-4c2d-87a6-c808ddd70b91

📥 Commits

Reviewing files that changed from the base of the PR and between f21917b and 70a013f.

📒 Files selected for processing (24)

crates/grpc_client/proto/tokenspeed_scheduler.proto
crates/multimodal/Cargo.toml
crates/multimodal/src/error.rs
crates/multimodal/src/hasher.rs
crates/multimodal/src/hub.rs
crates/multimodal/src/lib.rs
crates/multimodal/src/media.rs
crates/multimodal/src/registry/qwen3_vl.rs
crates/multimodal/src/registry/traits.rs
crates/multimodal/src/tracker.rs
crates/multimodal/src/types.rs
crates/multimodal/src/vision/image_processor.rs
crates/multimodal/src/vision/preprocessor_config.rs
crates/multimodal/src/vision/processors/qwen3_vl.rs
crates/multimodal/src/vision/processors/qwen_vl_base.rs
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
model_gateway/src/routers/grpc/multimodal.rs
model_gateway/src/routers/grpc/proto_wrapper.rs
model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs
model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs
model_gateway/src/routers/grpc/regular/stages/messages/request_building.rs
model_gateway/src/routers/grpc/utils/chat_utils.rs
model_gateway/src/workflow/tokenizer_registration.rs

chenht2022 · 2026-06-08T16:52:26Z

+                tokens.push(vision_end);
+                tokens.extend(Self::encode_plain_text(
+                    metadata,
+                    &format!("<{grid_idx}.0 seconds>"),


These timestamps seem to not match HF Qwen3-VL. HF emits real seconds(frame_idx / fps, averaged per temporal patch, :.1f) but we emit the integer grid_idx; HF emits a timestamp before every group including frame 0 but we skip the first (grid_idx > 0).
e.g. at the default 2fps with uniform sampling, HF gives 0.2/1.2/2.2/... but here (none)/1.0/2.0/....

Thx. Addressed

chenht2022 · 2026-06-08T16:52:26Z

+    /// Implementations that support video should emit the same primary
+    /// `pixel_values` tensor shape used by the image path, plus video-specific
+    /// model metadata such as `video_grid_thw`.
+    fn preprocess_video(


Now that this trait preprocesses video too, the image-centric names are misleading: ImagePreProcessor, the PreprocessedImages return type, and its num_img_tokens / image_sizes fields.
Also the file/module name image_processor.rs itself.

Changed into more general format.

chenht2022 · 2026-06-08T16:52:26Z

+            placeholder_token_id: self.placeholder_token_id,
+        };
+
+        tokenspeed::MultimodalInputs { items: vec![item] }


MultimodalInputs is repeated MultimodalItem, but the producer always emits exactly one item (all images/videos are folded into a single item, and mixed modality is rejected), so the repeated field isn't really used. It would be better to emit one MultimodalItem per image / per video.

Thx. Addressed

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 732f59bf83

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

crates/multimodal/src/media.rs (1)

572-587: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

PNG chunk length overflow risk on 32-bit targets.

Casting u32 chunk length to usize is safe on 64-bit but could truncate on 32-bit platforms if len > usize::MAX. While unlikely in practice, the subsequent arithmetic could behave unexpectedly.

Consider adding an explicit check or using usize::try_from if 32-bit support is intended.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/multimodal/src/media.rs` around lines 572 - 587, The cast from u32 to
usize is unsafe on 32-bit targets; replace the direct cast in the PNG parsing
block (the line using len_bytes -> len and subsequent cursor arithmetic) with a
fallible conversion (e.g., usize::try_from(u32::from_be_bytes(len_bytes))) and
return a MediaConnectorError::VideoDecode if the conversion fails or if len
exceeds the available remaining bytes; also use checked arithmetic
(checked_add/checked_sub or validate cursor + 12 + len won't overflow) before
advancing cursor to avoid overflow when computing cursor += 12 + len.

crates/multimodal/src/registry/qwen3_vl.rs (1)

66-74: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Accept video_grid_thw as UintTensor (not only IntTensor).

Line 66 currently reads only IntTensor, but the pipeline also uses UintTensor for video_grid_thw (see model_gateway/src/routers/grpc/multimodal.rs Line 1854-1957). That makes video_grid_t return None and Line 259 falls back to pad-only tokens, skipping the Qwen3.5 split path.

Proposed fix

     fn video_grid_t(preprocessed: &PreprocessedEncoderInputs) -> Option<usize> {
         match preprocessed.model_specific.get("video_grid_thw") {
             Some(ModelSpecificValue::IntTensor { data, shape })
-                if shape == &[1, 3] && !data.is_empty() =>
+                if shape.len() == 2 && shape[1] == 3 && !data.is_empty() =>
             {
                 usize::try_from(data[0]).ok()
             }
+            Some(ModelSpecificValue::UintTensor { data, shape })
+                if shape.len() == 2 && shape[1] == 3 && !data.is_empty() =>
+            {
+                Some(data[0] as usize)
+            }
             _ => None,
         }
     }

Also applies to: 244-259

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/multimodal/src/registry/qwen3_vl.rs` around lines 66 - 74, The pattern
match in video_grid_t currently accepts only ModelSpecificValue::IntTensor,
which causes video_grid_thw encoded as UintTensor to be ignored; update the
match in fn video_grid_t (and the similar match around the other occurrence at
the 244-259 region) to also accept ModelSpecificValue::UintTensor, extract the
first element from either IntTensor or UintTensor, and convert it to usize via
usize::try_from (or equivalent) so both signed and unsigned tensor
representations of video_grid_thw are handled consistently.

model_gateway/src/routers/grpc/multimodal.rs (1)

557-603: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject partial multimodal fetches before prompt expansion.

This path only rejects the all-failed case. If a request contains multiple media parts and one fetch/decode drops out while another survives, preprocessing continues with the surviving items only. prompt_replacements will then cover fewer items than the prompt still contains, leaving unresolved placeholder tokens in expanded_token_ids and sending a mismatched prompt/mm payload downstream. Cache the requested per-modality counts up front and fail when the fetched count is smaller.

💡 Suggested change

+    let requested_images = content_parts
+        .iter()
+        .filter(|part| matches!(part, MediaContentPart::ImageUrl { .. }))
+        .count();
+    let requested_videos = content_parts
+        .iter()
+        .filter(|part| matches!(part, MediaContentPart::VideoUrl { .. }))
+        .count();
+
     let mut tracker = AsyncMultiModalTracker::new(components.media_connector.clone());
@@
     let modality = match (images.is_empty(), videos.is_empty()) {
         (false, true) => Modality::Image,
         (true, false) => Modality::Video,
@@
         }
     };
+
+    match modality {
+        Modality::Image if images.len() != requested_images => {
+            anyhow::bail!("Failed to fetch every requested image for multimodal request");
+        }
+        Modality::Video if videos.len() != requested_videos => {
+            anyhow::bail!("Failed to fetch every requested video for multimodal request");
+        }
+        _ => {}
+    }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@model_gateway/src/routers/grpc/multimodal.rs` around lines 557 - 603, Record
the requested per-modality counts before calling tracker.finalize() (e.g.,
capture the expected number of image and video parts from the incoming
multimodal request or tracker initialization), then after building
`tracker_output` and extracting `images` and `videos` compare the fetched
lengths against those cached expected counts (using the `images.len()` and
`videos.len()` vs the stored expected image/video counts); if any fetched count
is smaller than requested, return an error (similar to the existing errors)
instead of continuing to prompt expansion so you never proceed with partial
multimodal fetches that would leave `prompt_replacements`/`expanded_token_ids`
mismatched. Ensure you reference `tracker.finalize()`, `tracker_output`,
`images`, `videos`, `prompt_replacements`, and `expanded_token_ids` when adding
this check.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/multimodal/src/registry/llava.rs`:
- Around line 61-71: The code currently uses the first value from
preprocessed.feature_token_counts and replicates it for all images via
PromptReplacement::repeated, relying on debug_assert! to catch mismatches;
change this to a real runtime-safe implementation by either (a) mapping over
preprocessed.feature_token_counts and creating a
PromptReplacement::repeated(Modality::Image, &token, token_id, count) for each
count so each image uses its actual token count, or (b) if divergent counts are
invalid, replace the debug_assert! with a runtime check that returns an Err with
a clear message when any count differs; update the block around
preprocessed.feature_token_counts, PromptReplacement::repeated, token and
token_id accordingly.

In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 800-804: In _mm_inputs_from_itemized_proto(), the loop that
collapses per-item placeholder_token_id into im_token_id/video_token_id should
validate consistency instead of letting the last value win; when you see
item_proto.placeholder_token_id, if modality is Modality.IMAGE compare
int(item_proto.placeholder_token_id) against any existing im_token_id and if
im_token_id is unset assign it, otherwise if it differs raise an error (e.g.,
ValueError or appropriate RPC error); do the same for Modality.VIDEO with
video_token_id so conflicting placeholder IDs are rejected rather than silently
overwritten.

---

Outside diff comments:
In `@crates/multimodal/src/media.rs`:
- Around line 572-587: The cast from u32 to usize is unsafe on 32-bit targets;
replace the direct cast in the PNG parsing block (the line using len_bytes ->
len and subsequent cursor arithmetic) with a fallible conversion (e.g.,
usize::try_from(u32::from_be_bytes(len_bytes))) and return a
MediaConnectorError::VideoDecode if the conversion fails or if len exceeds the
available remaining bytes; also use checked arithmetic (checked_add/checked_sub
or validate cursor + 12 + len won't overflow) before advancing cursor to avoid
overflow when computing cursor += 12 + len.

In `@crates/multimodal/src/registry/qwen3_vl.rs`:
- Around line 66-74: The pattern match in video_grid_t currently accepts only
ModelSpecificValue::IntTensor, which causes video_grid_thw encoded as UintTensor
to be ignored; update the match in fn video_grid_t (and the similar match around
the other occurrence at the 244-259 region) to also accept
ModelSpecificValue::UintTensor, extract the first element from either IntTensor
or UintTensor, and convert it to usize via usize::try_from (or equivalent) so
both signed and unsigned tensor representations of video_grid_thw are handled
consistently.

In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 557-603: Record the requested per-modality counts before calling
tracker.finalize() (e.g., capture the expected number of image and video parts
from the incoming multimodal request or tracker initialization), then after
building `tracker_output` and extracting `images` and `videos` compare the
fetched lengths against those cached expected counts (using the `images.len()`
and `videos.len()` vs the stored expected image/video counts); if any fetched
count is smaller than requested, return an error (similar to the existing
errors) instead of continuing to prompt expansion so you never proceed with
partial multimodal fetches that would leave
`prompt_replacements`/`expanded_token_ids` mismatched. Ensure you reference
`tracker.finalize()`, `tracker_output`, `images`, `videos`,
`prompt_replacements`, and `expanded_token_ids` when adding this check.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 695ee3ec-6c35-4e27-a461-1f43faa3f2c7

📥 Commits

Reviewing files that changed from the base of the PR and between 70a013f and 732f59b.

📒 Files selected for processing (34)

.github/workflows/pr-test-rust.yml
crates/grpc_client/proto/tokenspeed_scheduler.proto
crates/grpc_client/src/tokenspeed_scheduler.rs
crates/multimodal/benches/image_preprocess.rs
crates/multimodal/src/lib.rs
crates/multimodal/src/media.rs
crates/multimodal/src/registry/kimi_k25.rs
crates/multimodal/src/registry/llama4.rs
crates/multimodal/src/registry/llava.rs
crates/multimodal/src/registry/mod.rs
crates/multimodal/src/registry/phi3_v.rs
crates/multimodal/src/registry/qwen3_vl.rs
crates/multimodal/src/registry/qwen_vl.rs
crates/multimodal/src/registry/traits.rs
crates/multimodal/src/types.rs
crates/multimodal/src/vision/mod.rs
crates/multimodal/src/vision/preprocessor_config.rs
crates/multimodal/src/vision/processor.rs
crates/multimodal/src/vision/processors/kimi_k25.rs
crates/multimodal/src/vision/processors/llama4_vision.rs
crates/multimodal/src/vision/processors/llava.rs
crates/multimodal/src/vision/processors/mod.rs
crates/multimodal/src/vision/processors/phi3_vision.rs
crates/multimodal/src/vision/processors/phi4_vision.rs
crates/multimodal/src/vision/processors/pixtral.rs
crates/multimodal/src/vision/processors/qwen2_vl.rs
crates/multimodal/src/vision/processors/qwen3_vl.rs
crates/multimodal/src/vision/processors/qwen_vl_base.rs
crates/multimodal/tests/vision_golden_tests.rs
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
model_gateway/Cargo.toml
model_gateway/src/routers/grpc/multimodal.rs
model_gateway/src/routers/grpc/proto_wrapper.rs
model_gateway/src/routers/grpc/utils/chat_utils.rs

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7864781423

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

CatherineSue · 2026-06-11T04:48:31Z

+    Ok(TokenSpeedMultimodalData { items })
+}
+
+fn tokenspeed_item_count(intermediate: &PrecomputedMultimodalIntermediate) -> Result<usize> {


Is this specifically for tokenspeed? or it is universal to other engines?

The helper is currently only used by the TokenSpeed assembler because TokenSpeed is the backend that needs itemized multimodal payloads here, but the invariant itself is not TokenSpeed-specific. It validates that the precomputed media/token/placeholder counts agree before itemization.

I renamed it to precomputed_multimodal_item_count and made the error messages backend-neutral.

CatherineSue · 2026-06-11T04:49:15Z

+    }
+}
+
+fn serialize_model_specific_for_item(


Should this helper be in multimodal crate instead?

Good point.The item slicing semantics are not gateway-specific, so I moved that part onto ModelSpecificValue in the multimodal crate via as_flat_sizes() and slice_first_dim().

The gateway now only performs transport serialization after slicing.

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

crates/multimodal/src/vision/processors/qwen2_vl.rs (1)

209-214: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

preprocess() drops config-driven resize/token overrides.

This delegates straight to self.inner.preprocess(images, config), but QwenVLProcessorBase::preprocess() reads patch size / merge size / pixel limits from self.config. So Qwen2VLProcessor::new().preprocess(..., config_with_patch_or_pixel_overrides) still runs with the constructor defaults instead of the request config.

Suggested fix

     fn preprocess(
         &self,
         images: &[DynamicImage],
         config: &PreProcessorConfig,
     ) -> Result<PreprocessedEncoderInputs, TransformError> {
-        self.inner.preprocess(images, config)
+        let processor = if config.patch_size.is_some()
+            || config.merge_size.is_some()
+            || config.min_pixels.is_some()
+            || config.max_pixels.is_some()
+            || config.temporal_patch_size.is_some()
+        {
+            Self::from_preprocessor_config(config)
+        } else {
+            self.clone()
+        };
+
+        processor.inner.preprocess(images, config)
     }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/multimodal/src/vision/processors/qwen2_vl.rs` around lines 209 - 214,
preprocess currently delegates to self.inner.preprocess(images, config) but
QwenVLProcessorBase::preprocess reads from self.config so per-request overrides
in the passed config are ignored; fix by applying the incoming
PreProcessorConfig overrides onto the inner processor's config before calling
preprocess (e.g. clone or make a mutable copy of self.inner, copy/override
relevant fields such as patch_size, merge_size, max_pixels/min_pixels or any
pixel/resize/token overrides from the incoming config into inner.config), then
call that inner's preprocess (or adjust QwenVLProcessorBase::preprocess to honor
the passed config) so the request-level config takes effect.

crates/multimodal/src/vision/processors/phi4_vision.rs (1)

466-470: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor image_std-only overrides when selecting the processor.

This branch only rebuilds from PreProcessorConfig when dynamic_hd or image_mean is set. A request that overrides just image_std still takes self.clone(), so normalization uses the default std and produces the wrong encoder_input.

Suggested fix

-        let processor = if config.dynamic_hd.is_some() || config.image_mean.is_some() {
+        let processor = if config.dynamic_hd.is_some()
+            || config.image_mean.is_some()
+            || config.image_std.is_some()
+        {
             Self::from_preprocessor_config(config)
         } else {
             self.clone()
         };

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/multimodal/src/vision/processors/phi4_vision.rs` around lines 466 -
470, The processor selection branch doesn't account for image_std-only
overrides, so when only config.image_std is set it wrongly uses self.clone();
update the conditional that builds processor (the let processor = ... block) to
also check config.image_std (i.e., rebuild via Self::from_preprocessor_config
when config.dynamic_hd.is_some() || config.image_mean.is_some() ||
config.image_std.is_some()) so normalization uses the overridden std instead of
the clone's defaults.

model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs (1)

79-111: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid hard-coding image modality in placeholder resolution.

resolve_placeholder_token is modality-specific, but Line 110 always passes Modality::Image. For video message requests, this can resolve the wrong placeholder token and break downstream token-expansion matching. Align this with the chat preparation flow by deriving request modality from message parts (and rejecting mixed modalities if still unsupported) before calling resolve_placeholder_token.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs` around
lines 79 - 111, The code currently always passes Modality::Image to
multimodal::resolve_placeholder_token in MessagePreparationStage::execute which
breaks non-image multimodal requests; update the logic to derive the modality
from the incoming request.messages (inspect each message part to determine
Modality enum values), ensure you reject mixed-modal requests if unsupported,
and pass the determined modality to multimodal::resolve_placeholder_token
instead of Modality::Image; use multimodal::has_multimodal_content_messages,
ctx.components.multimodal, and the tokenizer lookup
(tokenizer_registry/get_by_name) as-is and only change the modality selection
and validation before calling resolve_placeholder_token.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/multimodal/src/media.rs`:
- Around line 214-216: The code decodes base64 blobs and fully buffers them into
memory via BASE64_STANDARD.decode and then calls self.decode_video(...,
VideoSource::DataUrl) which can OOM on large inputs; before decoding or before
creating a VideoClip, enforce a maximum allowed payload size (e.g. reject or
truncate if decoded length > MAX_VIDEO_BYTES) by first checking the base64
length/estimate or streaming/limiting the decode, return an error on oversize
inputs, and document MAX_VIDEO_BYTES; apply the same bounding check to the other
call sites that create VideoClip (the calls into self.decode_video at the other
referenced locations) so no code path retains unbounded byte buffers.

In `@crates/multimodal/src/registry/qwen3_vl.rs`:
- Around line 365-400: The test qwen3_5_video_replacement_splits_temporal_grid
currently relies on TestTokenizer which returns empty IDs from
TestTokenizer::encode so timestamp tokens are never asserted; replace or extend
the tokenizer used in the test with a tokenizer test-double that maps timestamp
strings (e.g. "<0.0 seconds>", "<1.0 seconds>") to deterministic, unique token
IDs so encoding of timestamp text actually produces IDs, then update the
assertions on replacements (obtained via
ModelRegistry::lookup(...).prompt_replacements_for(...)) to check for those
timestamp token IDs at the expected temporal segment boundaries (use the same
test helper test_preprocessed_with_tokens and ModelSpecificValue::int_2d to
produce segments). Ensure the tokenizer double is referenced where
metadata.tokenizer is constructed so prompt_replacements_for sees the real
encoded timestamp IDs.

In `@crates/multimodal/src/vision/processors/llama4_vision.rs`:
- Around line 453-465: The code pushes image dimensions into item_sizes as
(height, width), which is inconsistent with other processors that use (width,
height); change the push in the loop that uses
processor.process_single_image(...) so item_sizes.push((image.width(),
image.height())) to match the ordering expected by PreprocessedEncoderInputs and
avoid flipping portrait/landscape metadata for Llama4Vision.

In `@crates/multimodal/src/vision/processors/qwen3_vl.rs`:
- Around line 124-131: from_preprocessor_config() maps
size.shortest_edge/longest_edge into min_pixels/max_pixels but preprocess() and
preprocess_video() still delegate to self.inner.*, so the new pixel limits are
never used; update the constructor/path so the inner processor uses the mapped
config — either initialize self.inner with the modified config (ensure
Qwen3VLProcessor::new or from_preprocessor_config creates the inner
QwenVLProcessorBase using the mapped min_pixels/max_pixels) or override
preprocess()/preprocess_video() in Qwen3VLProcessor to call the base
implementation that reads self.config (QwenVLProcessorBase) so the size-edge
fallback takes effect. Ensure the change is mirrored for the other occurrence
around lines 216-230.

In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 605-608: The code currently rejects multi-video requests in the
check using Modality::Video and by only preprocessing
videos_for_preprocess.first(), which prevents the new per-item contract from
working; remove or relax the single-video guard (the if checking modality ==
Modality::Video && videos.len() != 1) and change the preprocessing logic that
uses videos_for_preprocess.first() to iterate over all entries in
videos_for_preprocess (perform preprocessing per item and collect results),
ensuring downstream TokenSpeed assembly stays item-oriented and consumes the
per-video processed outputs; also update any error messages to reflect per-item
failures rather than a single-video-only error.

---

Outside diff comments:
In `@crates/multimodal/src/vision/processors/phi4_vision.rs`:
- Around line 466-470: The processor selection branch doesn't account for
image_std-only overrides, so when only config.image_std is set it wrongly uses
self.clone(); update the conditional that builds processor (the let processor =
... block) to also check config.image_std (i.e., rebuild via
Self::from_preprocessor_config when config.dynamic_hd.is_some() ||
config.image_mean.is_some() || config.image_std.is_some()) so normalization uses
the overridden std instead of the clone's defaults.

In `@crates/multimodal/src/vision/processors/qwen2_vl.rs`:
- Around line 209-214: preprocess currently delegates to
self.inner.preprocess(images, config) but QwenVLProcessorBase::preprocess reads
from self.config so per-request overrides in the passed config are ignored; fix
by applying the incoming PreProcessorConfig overrides onto the inner processor's
config before calling preprocess (e.g. clone or make a mutable copy of
self.inner, copy/override relevant fields such as patch_size, merge_size,
max_pixels/min_pixels or any pixel/resize/token overrides from the incoming
config into inner.config), then call that inner's preprocess (or adjust
QwenVLProcessorBase::preprocess to honor the passed config) so the request-level
config takes effect.

In `@model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs`:
- Around line 79-111: The code currently always passes Modality::Image to
multimodal::resolve_placeholder_token in MessagePreparationStage::execute which
breaks non-image multimodal requests; update the logic to derive the modality
from the incoming request.messages (inspect each message part to determine
Modality enum values), ensure you reject mixed-modal requests if unsupported,
and pass the determined modality to multimodal::resolve_placeholder_token
instead of Modality::Image; use multimodal::has_multimodal_content_messages,
ctx.components.multimodal, and the tokenizer lookup
(tokenizer_registry/get_by_name) as-is and only change the modality selection
and validation before calling resolve_placeholder_token.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a453fcdf-941a-409f-9be4-b2e6fe6f3b7f

📥 Commits

Reviewing files that changed from the base of the PR and between 7864781 and 3ceda86.

📒 Files selected for processing (44)

.github/workflows/pr-test-rust.yml
crates/grpc_client/proto/tokenspeed_scheduler.proto
crates/grpc_client/src/tokenspeed_scheduler.rs
crates/multimodal/Cargo.toml
crates/multimodal/benches/image_preprocess.rs
crates/multimodal/src/error.rs
crates/multimodal/src/hasher.rs
crates/multimodal/src/hub.rs
crates/multimodal/src/lib.rs
crates/multimodal/src/media.rs
crates/multimodal/src/registry/kimi_k25.rs
crates/multimodal/src/registry/llama4.rs
crates/multimodal/src/registry/llava.rs
crates/multimodal/src/registry/mod.rs
crates/multimodal/src/registry/phi3_v.rs
crates/multimodal/src/registry/qwen3_vl.rs
crates/multimodal/src/registry/qwen_vl.rs
crates/multimodal/src/registry/traits.rs
crates/multimodal/src/tracker.rs
crates/multimodal/src/types.rs
crates/multimodal/src/vision/mod.rs
crates/multimodal/src/vision/preprocessor_config.rs
crates/multimodal/src/vision/processor.rs
crates/multimodal/src/vision/processors/kimi_k25.rs
crates/multimodal/src/vision/processors/llama4_vision.rs
crates/multimodal/src/vision/processors/llava.rs
crates/multimodal/src/vision/processors/mod.rs
crates/multimodal/src/vision/processors/phi3_vision.rs
crates/multimodal/src/vision/processors/phi4_vision.rs
crates/multimodal/src/vision/processors/pixtral.rs
crates/multimodal/src/vision/processors/qwen2_vl.rs
crates/multimodal/src/vision/processors/qwen3_vl.rs
crates/multimodal/src/vision/processors/qwen_vl_base.rs
crates/multimodal/tests/vision_golden_tests.rs
grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py
model_gateway/Cargo.toml
model_gateway/src/routers/grpc/multimodal.rs
model_gateway/src/routers/grpc/proto_wrapper.rs
model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs
model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs
model_gateway/src/routers/grpc/regular/stages/messages/request_building.rs
model_gateway/src/routers/grpc/utils/chat_utils.rs
model_gateway/src/workflow/tokenizer_registration.rs

coderabbitai · 2026-06-11T05:35:06Z

    /// Get the batch size.
    pub fn batch_size(&self) -> usize {
-        self.pixel_values.shape()[0]
+        self.encoder_input.shape()[0]
    }


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

batch_size() no longer matches the generalized encoder-input contract.

After this refactor, some processors flatten encoder patches/tiles into axis 0 instead of keeping “one item per batch row”. For example, KimiK25Processor now emits [total_patches, 3, patch, patch] and Llama4VisionProcessor emits [total_tiles, 3, H, W], so batch_size() returns the patch/tile count there instead of the number of media items. Any generic caller using batch_size() will misalign metadata or placeholder expansion on those models.

Suggested fix

/// Get the batch size. pub fn batch_size(&self) -> usize { - self.encoder_input.shape()[0] + self.item_sizes.len() }

Also applies to: 342-345

coderabbitai · 2026-06-11T05:35:06Z

        let mut all_outputs = Vec::new();
        let mut all_aspect_ratios = Vec::new();
-        let mut image_sizes = Vec::new();
-        let mut num_img_tokens = Vec::new();
+        let mut item_sizes = Vec::new();
+        let mut feature_token_counts = Vec::new();

        for image in images {
            let (output, aspect_ratio) = processor.process_single_image(image);
            let tokens = processor.calculate_num_tokens_for_aspect_ratio(aspect_ratio);

            all_outputs.push(output);
            all_aspect_ratios.push(aspect_ratio);
-            image_sizes.push((image.height(), image.width()));
-            num_img_tokens.push(tokens);
+            item_sizes.push((image.height(), image.width()));
+            feature_token_counts.push(tokens);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep item_sizes in the same tuple order as the other processors.

This path pushes (height, width), but the other processors in this cohort populate item_sizes from DynamicImage::dimensions(), i.e. (width, height). Because PreprocessedEncoderInputs.item_sizes is shared across processors and carries no per-model ordering flag, this will flip portrait/landscape metadata for generic consumers on LLaMA 4 only.

Suggested fix

- item_sizes.push((image.height(), image.width())); + item_sizes.push(image.dimensions());

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/multimodal/src/vision/processors/llama4_vision.rs` around lines 453 - 465, The code pushes image dimensions into item_sizes as (height, width), which is inconsistent with other processors that use (width, height); change the push in the loop that uses processor.process_single_image(...) so item_sizes.push((image.width(), image.height())) to match the ordering expected by PreprocessedEncoderInputs and avoid flipping portrait/landscape metadata for Llama4Vision.

coderabbitai · 2026-06-11T05:35:06Z

+                min_pixels: config
+                    .min_pixels
+                    .or_else(|| config.get_shortest_edge())
+                    .unwrap_or(DEFAULT_MIN_PIXELS),
+                max_pixels: config
+                    .max_pixels
+                    .or_else(|| config.get_longest_edge())
+                    .unwrap_or(DEFAULT_MAX_PIXELS),


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The new size-edge fallback never reaches preprocess() / preprocess_video().

from_preprocessor_config() now maps size.shortest_edge / size.longest_edge into min_pixels / max_pixels, but both trait entrypoints still forward to self.inner.*. Since QwenVLProcessorBase computes image/video resize limits from self.config, Qwen3VLProcessor::new().preprocess_video(..., config_with_size) still uses the default budgets instead of the values you just loaded here.

Suggested fix

fn preprocess( &self, images: &[DynamicImage], config: &PreProcessorConfig, ) -> Result<PreprocessedEncoderInputs, TransformError> { - self.inner.preprocess(images, config) + let processor = if config.patch_size.is_some() + || config.merge_size.is_some() + || config.min_pixels.is_some() + || config.max_pixels.is_some() + || config.temporal_patch_size.is_some() + || config.size.is_some() + { + Self::from_preprocessor_config(config) + } else { + self.clone() + }; + + processor.inner.preprocess(images, config) } fn preprocess_video( &self, frames: &[DynamicImage], config: &PreProcessorConfig, ) -> Result<PreprocessedEncoderInputs, TransformError> { - self.inner.preprocess_video(frames, config) + let processor = if config.patch_size.is_some() + || config.merge_size.is_some() + || config.min_pixels.is_some() + || config.max_pixels.is_some() + || config.temporal_patch_size.is_some() + || config.size.is_some() + { + Self::from_preprocessor_config(config) + } else { + self.clone() + }; + + processor.inner.preprocess_video(frames, config) }

Also applies to: 216-230

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/multimodal/src/vision/processors/qwen3_vl.rs` around lines 124 - 131, from_preprocessor_config() maps size.shortest_edge/longest_edge into min_pixels/max_pixels but preprocess() and preprocess_video() still delegate to self.inner.*, so the new pixel limits are never used; update the constructor/path so the inner processor uses the mapped config — either initialize self.inner with the modified config (ensure Qwen3VLProcessor::new or from_preprocessor_config creates the inner QwenVLProcessorBase using the mapped min_pixels/max_pixels) or override preprocess()/preprocess_video() in Qwen3VLProcessor to call the base implementation that reads self.config (QwenVLProcessorBase) so the size-edge fallback takes effect. Ensure the change is mirrored for the other occurrence around lines 216-230.

coderabbitai · 2026-06-11T05:35:06Z

+    if modality == Modality::Video && videos.len() != 1 {
        return Err(anyhow::anyhow!(
-            "No images were successfully fetched for multimodal request"
+            "Exactly one video is supported per request for the initial video path"
        ));


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

The video path is still single-item despite the new itemized contract.

Lines 605-608 reject any request with more than one video, and Lines 684-689 only ever preprocess videos_for_preprocess.first(). That means multi-video prompts still fail here even though the downstream TokenSpeed assembly in this file is already item-oriented.

Also applies to: 684-689

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@model_gateway/src/routers/grpc/multimodal.rs` around lines 605 - 608, The code currently rejects multi-video requests in the check using Modality::Video and by only preprocessing videos_for_preprocess.first(), which prevents the new per-item contract from working; remove or relax the single-video guard (the if checking modality == Modality::Video && videos.len() != 1) and change the preprocessing logic that uses videos_for_preprocess.first() to iterate over all entries in videos_for_preprocess (perform preprocessing per item and collect results), ensuring downstream TokenSpeed assembly stays item-oriented and consumes the per-video processed outputs; also update any error messages to reflect per-item failures rather than a single-video-only error.

mergify · 2026-06-11T07:13:43Z

Hi @yechank-nvidia, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch:

git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease

yechank-nvidia marked this pull request as ready for review June 5, 2026 02:41

yechank-nvidia requested review from CatherineSue, key4ng and slin1237 as code owners June 5, 2026 02:41

github-actions Bot added dependencies Dependency updates grpc gRPC client and router changes multimodal Multimodal crate changes model-gateway Model gateway crate changes labels Jun 5, 2026

yechank-nvidia force-pushed the yechan/mm-abi-generalization branch from 40c0739 to 70a013f Compare June 5, 2026 02:43

gemini-code-assist Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread crates/multimodal/src/vision/processors/qwen_vl_base.rs

Comment thread crates/multimodal/src/media.rs Outdated

Comment thread model_gateway/src/routers/grpc/multimodal.rs

Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py Outdated

chatgpt-codex-connector Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread crates/grpc_client/proto/tokenspeed_scheduler.proto

yechank-nvidia changed the title ~~Yechan/mm abi generalization~~ Multimodal ABI generalization Jun 5, 2026

yechank-nvidia changed the title ~~Multimodal ABI generalization~~ refactor(multimodal) Multimodal ABI generalization Jun 5, 2026

This was referenced Jun 5, 2026

feat(multimodal): add video media and Qwen VL preprocessing #1603

Draft

perf(tokenspeed): optimize multimodal tensor transport #1604

Draft

coderabbitai Bot requested changes Jun 5, 2026

View reviewed changes

chenht2022 reviewed Jun 8, 2026

View reviewed changes

yechank-nvidia changed the title ~~refactor(multimodal) Multimodal ABI generalization~~ refactor(multimodal): Multimodal ABI generalization Jun 9, 2026

yechank-nvidia requested a review from XinyueZhang369 as a code owner June 9, 2026 15:53

github-actions Bot added ci CI/CD configuration changes tests Test changes labels Jun 9, 2026

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread crates/grpc_client/proto/tokenspeed_scheduler.proto

coderabbitai Bot requested changes Jun 9, 2026

View reviewed changes

Comment thread crates/multimodal/src/registry/llava.rs Outdated

Comment thread grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py Outdated

coderabbitai Bot approved these changes Jun 9, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread crates/multimodal/src/vision/processors/qwen_vl_base.rs

CatherineSue reviewed Jun 11, 2026

View reviewed changes

yechank-nvidia added 2 commits June 10, 2026 22:15

Add precomputed video multimodal support

5c6f06e

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

Add itemized multimodal TokenSpeed payloads

9e40211

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

yechank-nvidia added 11 commits June 10, 2026 22:15

Address multimodal config review feedback

77c98b7

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

Add video processor config regression tests

d0a51b1

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

Address multimodal ABI review feedback

657e84c

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

refactor(multimodal): generalize vision encoder inputs

676c081

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

feat(tokenspeed): assemble itemized multimodal payloads

6c38d22

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

refactor(tokenspeed): clean tensor payload ABI

b28a62f

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

test(tokenspeed): cover multimodal ABI contracts

227ef8d

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

style(tokenspeed): format multimodal servicer path

c8cfa8a

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

fix(multimodal): preserve per-image Llava token counts

3a5bfa8

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

fix(tokenspeed): reject conflicting multimodal placeholder ids

d30ef36

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

refactor(multimodal): move model-specific slicing helpers

3ceda86

Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>

yechank-nvidia force-pushed the yechan/mm-abi-generalization branch from eded36c to 3ceda86 Compare June 11, 2026 05:18

coderabbitai Bot requested changes Jun 11, 2026

View reviewed changes

mergify Bot added the needs-rebase PR has merge conflicts that need to be resolved label Jun 11, 2026

Conversation

yechank-nvidia commented Jun 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Changes

Test Plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Uh oh!

mergify Bot commented Jun 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yechank-nvidia commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading