refactor(multimodal): Multimodal ABI generalization#1602
refactor(multimodal): Multimodal ABI generalization#1602yechank-nvidia wants to merge 13 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds itemized, typed TokenSpeed multimodal protobufs; first-class video ingestion/ffmpeg decoding; refactors vision preprocessing to encoder-inputs; makes registry and placeholder logic modality-aware; implements gateway TokenSpeed itemized assembly and servicer itemized decoding; updates tests and manifests. ChangesProtobuf contracts and multimodal types
Vision preprocessing interface refactor
All vision processors updated
Video ingestion and decoding
Tracker and multimodal extraction
Registry modality-aware APIs and Qwen video handling
Gateway multimodal orchestration and TokenSpeed assembly
Servicer decode, GetModelInfo, and CI
Request-stage wiring and utils
Supporting updates
Estimated code review effort 🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
|
Hi @yechank-nvidia, the DCO sign-off check has failed. All commits must include a To fix existing commits: # Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-leaseTo sign off future commits automatically:
|
40c0739 to
70a013f
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces video processing support to the multimodal pipeline, specifically targeting Qwen3-VL models. It updates the gRPC protobuf definitions to support itemized multimodal inputs, implements video fetching and decoding using FFmpeg, and adds video preprocessing, smart resizing, and patchification. The model gateway and Python gRPC servicer are also enhanced to route and validate video payloads. The review feedback identifies a potential division-by-zero bug in smart_resize_video when num_frames is zero, an integer overflow vulnerability during PNG stream splitting, performance overhead from uncached environment variable lookups on the hot path, and a recommendation to use == instead of is for Python enum comparisons.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 40c0739860
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs (1)
98-108:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMap placeholder-resolution failures to client error in multimodal flow.
For unsupported modality/model combinations, this path now returns
internal_error(5xx). These are request-level failures and should be returned asbad_request(4xx), consistent with the rest of multimodal preparation failures.💡 Suggested change
- error::internal_error( - "multimodal_placeholder_resolution_failed", - format!("Failed to resolve multimodal placeholder token: {e}"), - ) + error::bad_request( + "multimodal_not_supported", + format!("Failed to resolve multimodal placeholder token: {e}"), + )Based on learnings: multimodal preparation failures are currently intentionally surfaced as 400 due generic multimodal error typing.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs` around lines 98 - 108, The map_err branch in ChatPreparationStage::execute currently converts placeholder-resolution failures into an internal_error (5xx); change this to return a bad_request (4xx) so unsupported modality/model combos are surfaced as request errors: in the map_err closure that logs "Failed to resolve multimodal placeholder token" (inside ChatPreparationStage::execute), replace error::internal_error(...) with error::bad_request(...) (keeping the same error key/message formatting) so multimodal placeholder resolution failures are returned as 400-level client errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/multimodal/src/media.rs`:
- Around line 373-387: The ffmpeg/ffprobe Command::output() calls can hang;
update the calls inside decode_video_with_ffmpeg, probe_video_duration_seconds,
and probe_video_duration_seconds_with_ffmpeg to spawn the child and wait with a
timeout (e.g., tokio::time::timeout) and if the timeout elapses kill the child
process and return a timed-out error; specifically replace the
Command::new(...).output() usage with child =
Command::new(...).stdout(Stdio::piped()).spawn(), await the child's output with
a timeout, on timeout call child.kill().await (or child.kill() and then await)
and then collect stdout/stderr into the same Output-like result or error path so
callers handle failures consistently.
In `@crates/multimodal/src/vision/processors/qwen_vl_base.rs`:
- Around line 235-245: resized_pixels is computed using the padded frame count
(t_bar) but beta is computed with the unpadded num_frames, causing inconsistent
scaling; update both beta calculations in the branches that set h_bar/w_bar to
use the padded frame count (t_bar) instead of num_frames so the pixel budget
math matches resized_pixels (affecting the blocks that compute beta for
max_pixels and min_pixels and the subsequent h_bar/w_bar adjustments).
- Around line 600-613: preprocess_video() is emitting the extra key
"patches_per_video" while Qwen3VLVisionSpec::field_layouts() expects the
flat-field name "patches_per_image", causing a layout mismatch; update the
PreprocessedImages builder in
PreprocessedImages::new_dynamic(...).with_extra(...) to use the same key
"patches_per_image" (or alternatively change Qwen3VLVisionSpec::field_layouts()
to reference "patches_per_video") so the
flat(FieldLayout::flat("patches_per_image")) contract is satisfied; ensure any
downstream consumers that read the extra use the chosen canonical name.
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 374-383: GetModelInfo is advertising AUDIO while
_mm_inputs_from_itemized_proto does not carry an audio placeholder token
through, causing audio placeholder replacement to break; either stop
advertising/accepting AUDIO or thread an audio placeholder field into
MultimodalInputs. Fix by removing AUDIO from supported_modalities (the list
built from image_modality/video_modality/audio_modality) and any code paths that
accept it in _modality_from_proto/GetModelInfo until
_mm_inputs_from_itemized_proto is extended, or alternatively add plumbing:
extend _mm_inputs_from_itemized_proto to accept a new
audio_token_id/placeholder_token_id parameter and propagate that into the
MultimodalInputs construction and any call sites so audio placeholder_token_id
survives reconstruction. Ensure you update places that build
supported_modalities (the code block using supports_vision and hf_config) and
the functions _modality_from_proto and _mm_inputs_from_itemized_proto
consistently.
In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 681-686: The code clones every frame into a new Vec (frames)
before calling preprocess_video, causing unnecessary CPU/memory overhead;
instead, remove the intermediate clone and pass a slice of the existing frames
directly to processor.preprocess_video (use the existing videos_for_preprocess
-> video variable and pass &video.frames or &video.frames[..] as the first
argument), keeping pp_config as before so no data copy occurs.
In `@model_gateway/src/routers/grpc/utils/chat_utils.rs`:
- Around line 179-180: Add a regression test that verifies the template
substitution inserts the video placeholder string correctly when the template
input is formatted as a string: exercise the code path that maps "video_url" =>
image_placeholder.map(String::from) (the template substitution logic in
chat_utils.rs), supply a template containing the video_url placeholder and a
non-empty image_placeholder, then assert the rendered output contains the
expected video placeholder string (and that it matches the behavior for
"image_url"); place this test alongside the other unit tests for chat_utils
template handling.
---
Outside diff comments:
In `@model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs`:
- Around line 98-108: The map_err branch in ChatPreparationStage::execute
currently converts placeholder-resolution failures into an internal_error (5xx);
change this to return a bad_request (4xx) so unsupported modality/model combos
are surfaced as request errors: in the map_err closure that logs "Failed to
resolve multimodal placeholder token" (inside ChatPreparationStage::execute),
replace error::internal_error(...) with error::bad_request(...) (keeping the
same error key/message formatting) so multimodal placeholder resolution failures
are returned as 400-level client errors.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 786f49fa-f73d-4c2d-87a6-c808ddd70b91
📒 Files selected for processing (24)
crates/grpc_client/proto/tokenspeed_scheduler.protocrates/multimodal/Cargo.tomlcrates/multimodal/src/error.rscrates/multimodal/src/hasher.rscrates/multimodal/src/hub.rscrates/multimodal/src/lib.rscrates/multimodal/src/media.rscrates/multimodal/src/registry/qwen3_vl.rscrates/multimodal/src/registry/traits.rscrates/multimodal/src/tracker.rscrates/multimodal/src/types.rscrates/multimodal/src/vision/image_processor.rscrates/multimodal/src/vision/preprocessor_config.rscrates/multimodal/src/vision/processors/qwen3_vl.rscrates/multimodal/src/vision/processors/qwen_vl_base.rsgrpc_servicer/smg_grpc_servicer/tokenspeed/servicer.pymodel_gateway/src/routers/grpc/multimodal.rsmodel_gateway/src/routers/grpc/proto_wrapper.rsmodel_gateway/src/routers/grpc/regular/stages/chat/preparation.rsmodel_gateway/src/routers/grpc/regular/stages/chat/request_building.rsmodel_gateway/src/routers/grpc/regular/stages/messages/preparation.rsmodel_gateway/src/routers/grpc/regular/stages/messages/request_building.rsmodel_gateway/src/routers/grpc/utils/chat_utils.rsmodel_gateway/src/workflow/tokenizer_registration.rs
| tokens.push(vision_end); | ||
| tokens.extend(Self::encode_plain_text( | ||
| metadata, | ||
| &format!("<{grid_idx}.0 seconds>"), |
There was a problem hiding this comment.
These timestamps seem to not match HF Qwen3-VL. HF emits real seconds(frame_idx / fps, averaged per temporal patch, :.1f) but we emit the integer grid_idx; HF emits a timestamp before every group including frame 0 but we skip the first (grid_idx > 0).
e.g. at the default 2fps with uniform sampling, HF gives 0.2/1.2/2.2/... but here (none)/1.0/2.0/....
There was a problem hiding this comment.
Thx. Addressed
| /// Implementations that support video should emit the same primary | ||
| /// `pixel_values` tensor shape used by the image path, plus video-specific | ||
| /// model metadata such as `video_grid_thw`. | ||
| fn preprocess_video( |
There was a problem hiding this comment.
Now that this trait preprocesses video too, the image-centric names are misleading: ImagePreProcessor, the PreprocessedImages return type, and its num_img_tokens / image_sizes fields.
Also the file/module name image_processor.rs itself.
There was a problem hiding this comment.
Changed into more general format.
| placeholder_token_id: self.placeholder_token_id, | ||
| }; | ||
|
|
||
| tokenspeed::MultimodalInputs { items: vec![item] } |
There was a problem hiding this comment.
MultimodalInputs is repeated MultimodalItem, but the producer always emits exactly one item (all images/videos are folded into a single item, and mixed modality is rejected), so the repeated field isn't really used. It would be better to emit one MultimodalItem per image / per video.
There was a problem hiding this comment.
Thx. Addressed
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 732f59bf83
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
crates/multimodal/src/media.rs (1)
572-587: 🧹 Nitpick | 🔵 Trivial | 💤 Low valuePNG chunk length overflow risk on 32-bit targets.
Casting
u32chunk length tousizeis safe on 64-bit but could truncate on 32-bit platforms iflen > usize::MAX. While unlikely in practice, the subsequent arithmetic could behave unexpectedly.Consider adding an explicit check or using
usize::try_fromif 32-bit support is intended.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/multimodal/src/media.rs` around lines 572 - 587, The cast from u32 to usize is unsafe on 32-bit targets; replace the direct cast in the PNG parsing block (the line using len_bytes -> len and subsequent cursor arithmetic) with a fallible conversion (e.g., usize::try_from(u32::from_be_bytes(len_bytes))) and return a MediaConnectorError::VideoDecode if the conversion fails or if len exceeds the available remaining bytes; also use checked arithmetic (checked_add/checked_sub or validate cursor + 12 + len won't overflow) before advancing cursor to avoid overflow when computing cursor += 12 + len.crates/multimodal/src/registry/qwen3_vl.rs (1)
66-74:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAccept
video_grid_thwasUintTensor(not onlyIntTensor).Line 66 currently reads only
IntTensor, but the pipeline also usesUintTensorforvideo_grid_thw(seemodel_gateway/src/routers/grpc/multimodal.rsLine 1854-1957). That makesvideo_grid_treturnNoneand Line 259 falls back to pad-only tokens, skipping the Qwen3.5 split path.Proposed fix
fn video_grid_t(preprocessed: &PreprocessedEncoderInputs) -> Option<usize> { match preprocessed.model_specific.get("video_grid_thw") { Some(ModelSpecificValue::IntTensor { data, shape }) - if shape == &[1, 3] && !data.is_empty() => + if shape.len() == 2 && shape[1] == 3 && !data.is_empty() => { usize::try_from(data[0]).ok() } + Some(ModelSpecificValue::UintTensor { data, shape }) + if shape.len() == 2 && shape[1] == 3 && !data.is_empty() => + { + Some(data[0] as usize) + } _ => None, } }Also applies to: 244-259
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/multimodal/src/registry/qwen3_vl.rs` around lines 66 - 74, The pattern match in video_grid_t currently accepts only ModelSpecificValue::IntTensor, which causes video_grid_thw encoded as UintTensor to be ignored; update the match in fn video_grid_t (and the similar match around the other occurrence at the 244-259 region) to also accept ModelSpecificValue::UintTensor, extract the first element from either IntTensor or UintTensor, and convert it to usize via usize::try_from (or equivalent) so both signed and unsigned tensor representations of video_grid_thw are handled consistently.model_gateway/src/routers/grpc/multimodal.rs (1)
557-603:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReject partial multimodal fetches before prompt expansion.
This path only rejects the all-failed case. If a request contains multiple media parts and one fetch/decode drops out while another survives, preprocessing continues with the surviving items only.
prompt_replacementswill then cover fewer items than the prompt still contains, leaving unresolved placeholder tokens inexpanded_token_idsand sending a mismatched prompt/mm payload downstream. Cache the requested per-modality counts up front and fail when the fetched count is smaller.💡 Suggested change
+ let requested_images = content_parts + .iter() + .filter(|part| matches!(part, MediaContentPart::ImageUrl { .. })) + .count(); + let requested_videos = content_parts + .iter() + .filter(|part| matches!(part, MediaContentPart::VideoUrl { .. })) + .count(); + let mut tracker = AsyncMultiModalTracker::new(components.media_connector.clone()); @@ let modality = match (images.is_empty(), videos.is_empty()) { (false, true) => Modality::Image, (true, false) => Modality::Video, @@ } }; + + match modality { + Modality::Image if images.len() != requested_images => { + anyhow::bail!("Failed to fetch every requested image for multimodal request"); + } + Modality::Video if videos.len() != requested_videos => { + anyhow::bail!("Failed to fetch every requested video for multimodal request"); + } + _ => {} + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@model_gateway/src/routers/grpc/multimodal.rs` around lines 557 - 603, Record the requested per-modality counts before calling tracker.finalize() (e.g., capture the expected number of image and video parts from the incoming multimodal request or tracker initialization), then after building `tracker_output` and extracting `images` and `videos` compare the fetched lengths against those cached expected counts (using the `images.len()` and `videos.len()` vs the stored expected image/video counts); if any fetched count is smaller than requested, return an error (similar to the existing errors) instead of continuing to prompt expansion so you never proceed with partial multimodal fetches that would leave `prompt_replacements`/`expanded_token_ids` mismatched. Ensure you reference `tracker.finalize()`, `tracker_output`, `images`, `videos`, `prompt_replacements`, and `expanded_token_ids` when adding this check.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/multimodal/src/registry/llava.rs`:
- Around line 61-71: The code currently uses the first value from
preprocessed.feature_token_counts and replicates it for all images via
PromptReplacement::repeated, relying on debug_assert! to catch mismatches;
change this to a real runtime-safe implementation by either (a) mapping over
preprocessed.feature_token_counts and creating a
PromptReplacement::repeated(Modality::Image, &token, token_id, count) for each
count so each image uses its actual token count, or (b) if divergent counts are
invalid, replace the debug_assert! with a runtime check that returns an Err with
a clear message when any count differs; update the block around
preprocessed.feature_token_counts, PromptReplacement::repeated, token and
token_id accordingly.
In `@grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py`:
- Around line 800-804: In _mm_inputs_from_itemized_proto(), the loop that
collapses per-item placeholder_token_id into im_token_id/video_token_id should
validate consistency instead of letting the last value win; when you see
item_proto.placeholder_token_id, if modality is Modality.IMAGE compare
int(item_proto.placeholder_token_id) against any existing im_token_id and if
im_token_id is unset assign it, otherwise if it differs raise an error (e.g.,
ValueError or appropriate RPC error); do the same for Modality.VIDEO with
video_token_id so conflicting placeholder IDs are rejected rather than silently
overwritten.
---
Outside diff comments:
In `@crates/multimodal/src/media.rs`:
- Around line 572-587: The cast from u32 to usize is unsafe on 32-bit targets;
replace the direct cast in the PNG parsing block (the line using len_bytes ->
len and subsequent cursor arithmetic) with a fallible conversion (e.g.,
usize::try_from(u32::from_be_bytes(len_bytes))) and return a
MediaConnectorError::VideoDecode if the conversion fails or if len exceeds the
available remaining bytes; also use checked arithmetic (checked_add/checked_sub
or validate cursor + 12 + len won't overflow) before advancing cursor to avoid
overflow when computing cursor += 12 + len.
In `@crates/multimodal/src/registry/qwen3_vl.rs`:
- Around line 66-74: The pattern match in video_grid_t currently accepts only
ModelSpecificValue::IntTensor, which causes video_grid_thw encoded as UintTensor
to be ignored; update the match in fn video_grid_t (and the similar match around
the other occurrence at the 244-259 region) to also accept
ModelSpecificValue::UintTensor, extract the first element from either IntTensor
or UintTensor, and convert it to usize via usize::try_from (or equivalent) so
both signed and unsigned tensor representations of video_grid_thw are handled
consistently.
In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 557-603: Record the requested per-modality counts before calling
tracker.finalize() (e.g., capture the expected number of image and video parts
from the incoming multimodal request or tracker initialization), then after
building `tracker_output` and extracting `images` and `videos` compare the
fetched lengths against those cached expected counts (using the `images.len()`
and `videos.len()` vs the stored expected image/video counts); if any fetched
count is smaller than requested, return an error (similar to the existing
errors) instead of continuing to prompt expansion so you never proceed with
partial multimodal fetches that would leave
`prompt_replacements`/`expanded_token_ids` mismatched. Ensure you reference
`tracker.finalize()`, `tracker_output`, `images`, `videos`,
`prompt_replacements`, and `expanded_token_ids` when adding this check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 695ee3ec-6c35-4e27-a461-1f43faa3f2c7
📒 Files selected for processing (34)
.github/workflows/pr-test-rust.ymlcrates/grpc_client/proto/tokenspeed_scheduler.protocrates/grpc_client/src/tokenspeed_scheduler.rscrates/multimodal/benches/image_preprocess.rscrates/multimodal/src/lib.rscrates/multimodal/src/media.rscrates/multimodal/src/registry/kimi_k25.rscrates/multimodal/src/registry/llama4.rscrates/multimodal/src/registry/llava.rscrates/multimodal/src/registry/mod.rscrates/multimodal/src/registry/phi3_v.rscrates/multimodal/src/registry/qwen3_vl.rscrates/multimodal/src/registry/qwen_vl.rscrates/multimodal/src/registry/traits.rscrates/multimodal/src/types.rscrates/multimodal/src/vision/mod.rscrates/multimodal/src/vision/preprocessor_config.rscrates/multimodal/src/vision/processor.rscrates/multimodal/src/vision/processors/kimi_k25.rscrates/multimodal/src/vision/processors/llama4_vision.rscrates/multimodal/src/vision/processors/llava.rscrates/multimodal/src/vision/processors/mod.rscrates/multimodal/src/vision/processors/phi3_vision.rscrates/multimodal/src/vision/processors/phi4_vision.rscrates/multimodal/src/vision/processors/pixtral.rscrates/multimodal/src/vision/processors/qwen2_vl.rscrates/multimodal/src/vision/processors/qwen3_vl.rscrates/multimodal/src/vision/processors/qwen_vl_base.rscrates/multimodal/tests/vision_golden_tests.rsgrpc_servicer/smg_grpc_servicer/tokenspeed/servicer.pymodel_gateway/Cargo.tomlmodel_gateway/src/routers/grpc/multimodal.rsmodel_gateway/src/routers/grpc/proto_wrapper.rsmodel_gateway/src/routers/grpc/utils/chat_utils.rs
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7864781423
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| Ok(TokenSpeedMultimodalData { items }) | ||
| } | ||
|
|
||
| fn tokenspeed_item_count(intermediate: &PrecomputedMultimodalIntermediate) -> Result<usize> { |
There was a problem hiding this comment.
Is this specifically for tokenspeed? or it is universal to other engines?
There was a problem hiding this comment.
The helper is currently only used by the TokenSpeed assembler because TokenSpeed is the backend that needs itemized multimodal payloads here, but the invariant itself is not TokenSpeed-specific. It validates that the precomputed media/token/placeholder counts agree before itemization.
I renamed it to precomputed_multimodal_item_count and made the error messages backend-neutral.
| } | ||
| } | ||
|
|
||
| fn serialize_model_specific_for_item( |
There was a problem hiding this comment.
Should this helper be in multimodal crate instead?
There was a problem hiding this comment.
Good point.The item slicing semantics are not gateway-specific, so I moved that part onto ModelSpecificValue in the multimodal crate via as_flat_sizes() and slice_first_dim().
The gateway now only performs transport serialization after slicing.
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank-nvidia <161688079+yechank-nvidia@users.noreply.github.com>
eded36c to
3ceda86
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
crates/multimodal/src/vision/processors/qwen2_vl.rs (1)
209-214:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
preprocess()drops config-driven resize/token overrides.This delegates straight to
self.inner.preprocess(images, config), butQwenVLProcessorBase::preprocess()reads patch size / merge size / pixel limits fromself.config. SoQwen2VLProcessor::new().preprocess(..., config_with_patch_or_pixel_overrides)still runs with the constructor defaults instead of the request config.Suggested fix
fn preprocess( &self, images: &[DynamicImage], config: &PreProcessorConfig, ) -> Result<PreprocessedEncoderInputs, TransformError> { - self.inner.preprocess(images, config) + let processor = if config.patch_size.is_some() + || config.merge_size.is_some() + || config.min_pixels.is_some() + || config.max_pixels.is_some() + || config.temporal_patch_size.is_some() + { + Self::from_preprocessor_config(config) + } else { + self.clone() + }; + + processor.inner.preprocess(images, config) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/multimodal/src/vision/processors/qwen2_vl.rs` around lines 209 - 214, preprocess currently delegates to self.inner.preprocess(images, config) but QwenVLProcessorBase::preprocess reads from self.config so per-request overrides in the passed config are ignored; fix by applying the incoming PreProcessorConfig overrides onto the inner processor's config before calling preprocess (e.g. clone or make a mutable copy of self.inner, copy/override relevant fields such as patch_size, merge_size, max_pixels/min_pixels or any pixel/resize/token overrides from the incoming config into inner.config), then call that inner's preprocess (or adjust QwenVLProcessorBase::preprocess to honor the passed config) so the request-level config takes effect.crates/multimodal/src/vision/processors/phi4_vision.rs (1)
466-470:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHonor
image_std-only overrides when selecting the processor.This branch only rebuilds from
PreProcessorConfigwhendynamic_hdorimage_meanis set. A request that overrides justimage_stdstill takesself.clone(), so normalization uses the default std and produces the wrongencoder_input.Suggested fix
- let processor = if config.dynamic_hd.is_some() || config.image_mean.is_some() { + let processor = if config.dynamic_hd.is_some() + || config.image_mean.is_some() + || config.image_std.is_some() + { Self::from_preprocessor_config(config) } else { self.clone() };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/multimodal/src/vision/processors/phi4_vision.rs` around lines 466 - 470, The processor selection branch doesn't account for image_std-only overrides, so when only config.image_std is set it wrongly uses self.clone(); update the conditional that builds processor (the let processor = ... block) to also check config.image_std (i.e., rebuild via Self::from_preprocessor_config when config.dynamic_hd.is_some() || config.image_mean.is_some() || config.image_std.is_some()) so normalization uses the overridden std instead of the clone's defaults.model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs (1)
79-111:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid hard-coding image modality in placeholder resolution.
resolve_placeholder_tokenis modality-specific, but Line 110 always passesModality::Image. For video message requests, this can resolve the wrong placeholder token and break downstream token-expansion matching. Align this with the chat preparation flow by deriving request modality from message parts (and rejecting mixed modalities if still unsupported) before callingresolve_placeholder_token.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs` around lines 79 - 111, The code currently always passes Modality::Image to multimodal::resolve_placeholder_token in MessagePreparationStage::execute which breaks non-image multimodal requests; update the logic to derive the modality from the incoming request.messages (inspect each message part to determine Modality enum values), ensure you reject mixed-modal requests if unsupported, and pass the determined modality to multimodal::resolve_placeholder_token instead of Modality::Image; use multimodal::has_multimodal_content_messages, ctx.components.multimodal, and the tokenizer lookup (tokenizer_registry/get_by_name) as-is and only change the modality selection and validation before calling resolve_placeholder_token.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/multimodal/src/media.rs`:
- Around line 214-216: The code decodes base64 blobs and fully buffers them into
memory via BASE64_STANDARD.decode and then calls self.decode_video(...,
VideoSource::DataUrl) which can OOM on large inputs; before decoding or before
creating a VideoClip, enforce a maximum allowed payload size (e.g. reject or
truncate if decoded length > MAX_VIDEO_BYTES) by first checking the base64
length/estimate or streaming/limiting the decode, return an error on oversize
inputs, and document MAX_VIDEO_BYTES; apply the same bounding check to the other
call sites that create VideoClip (the calls into self.decode_video at the other
referenced locations) so no code path retains unbounded byte buffers.
In `@crates/multimodal/src/registry/qwen3_vl.rs`:
- Around line 365-400: The test qwen3_5_video_replacement_splits_temporal_grid
currently relies on TestTokenizer which returns empty IDs from
TestTokenizer::encode so timestamp tokens are never asserted; replace or extend
the tokenizer used in the test with a tokenizer test-double that maps timestamp
strings (e.g. "<0.0 seconds>", "<1.0 seconds>") to deterministic, unique token
IDs so encoding of timestamp text actually produces IDs, then update the
assertions on replacements (obtained via
ModelRegistry::lookup(...).prompt_replacements_for(...)) to check for those
timestamp token IDs at the expected temporal segment boundaries (use the same
test helper test_preprocessed_with_tokens and ModelSpecificValue::int_2d to
produce segments). Ensure the tokenizer double is referenced where
metadata.tokenizer is constructed so prompt_replacements_for sees the real
encoded timestamp IDs.
In `@crates/multimodal/src/vision/processors/llama4_vision.rs`:
- Around line 453-465: The code pushes image dimensions into item_sizes as
(height, width), which is inconsistent with other processors that use (width,
height); change the push in the loop that uses
processor.process_single_image(...) so item_sizes.push((image.width(),
image.height())) to match the ordering expected by PreprocessedEncoderInputs and
avoid flipping portrait/landscape metadata for Llama4Vision.
In `@crates/multimodal/src/vision/processors/qwen3_vl.rs`:
- Around line 124-131: from_preprocessor_config() maps
size.shortest_edge/longest_edge into min_pixels/max_pixels but preprocess() and
preprocess_video() still delegate to self.inner.*, so the new pixel limits are
never used; update the constructor/path so the inner processor uses the mapped
config — either initialize self.inner with the modified config (ensure
Qwen3VLProcessor::new or from_preprocessor_config creates the inner
QwenVLProcessorBase using the mapped min_pixels/max_pixels) or override
preprocess()/preprocess_video() in Qwen3VLProcessor to call the base
implementation that reads self.config (QwenVLProcessorBase) so the size-edge
fallback takes effect. Ensure the change is mirrored for the other occurrence
around lines 216-230.
In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 605-608: The code currently rejects multi-video requests in the
check using Modality::Video and by only preprocessing
videos_for_preprocess.first(), which prevents the new per-item contract from
working; remove or relax the single-video guard (the if checking modality ==
Modality::Video && videos.len() != 1) and change the preprocessing logic that
uses videos_for_preprocess.first() to iterate over all entries in
videos_for_preprocess (perform preprocessing per item and collect results),
ensuring downstream TokenSpeed assembly stays item-oriented and consumes the
per-video processed outputs; also update any error messages to reflect per-item
failures rather than a single-video-only error.
---
Outside diff comments:
In `@crates/multimodal/src/vision/processors/phi4_vision.rs`:
- Around line 466-470: The processor selection branch doesn't account for
image_std-only overrides, so when only config.image_std is set it wrongly uses
self.clone(); update the conditional that builds processor (the let processor =
... block) to also check config.image_std (i.e., rebuild via
Self::from_preprocessor_config when config.dynamic_hd.is_some() ||
config.image_mean.is_some() || config.image_std.is_some()) so normalization uses
the overridden std instead of the clone's defaults.
In `@crates/multimodal/src/vision/processors/qwen2_vl.rs`:
- Around line 209-214: preprocess currently delegates to
self.inner.preprocess(images, config) but QwenVLProcessorBase::preprocess reads
from self.config so per-request overrides in the passed config are ignored; fix
by applying the incoming PreProcessorConfig overrides onto the inner processor's
config before calling preprocess (e.g. clone or make a mutable copy of
self.inner, copy/override relevant fields such as patch_size, merge_size,
max_pixels/min_pixels or any pixel/resize/token overrides from the incoming
config into inner.config), then call that inner's preprocess (or adjust
QwenVLProcessorBase::preprocess to honor the passed config) so the request-level
config takes effect.
In `@model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs`:
- Around line 79-111: The code currently always passes Modality::Image to
multimodal::resolve_placeholder_token in MessagePreparationStage::execute which
breaks non-image multimodal requests; update the logic to derive the modality
from the incoming request.messages (inspect each message part to determine
Modality enum values), ensure you reject mixed-modal requests if unsupported,
and pass the determined modality to multimodal::resolve_placeholder_token
instead of Modality::Image; use multimodal::has_multimodal_content_messages,
ctx.components.multimodal, and the tokenizer lookup
(tokenizer_registry/get_by_name) as-is and only change the modality selection
and validation before calling resolve_placeholder_token.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: a453fcdf-941a-409f-9be4-b2e6fe6f3b7f
📒 Files selected for processing (44)
.github/workflows/pr-test-rust.ymlcrates/grpc_client/proto/tokenspeed_scheduler.protocrates/grpc_client/src/tokenspeed_scheduler.rscrates/multimodal/Cargo.tomlcrates/multimodal/benches/image_preprocess.rscrates/multimodal/src/error.rscrates/multimodal/src/hasher.rscrates/multimodal/src/hub.rscrates/multimodal/src/lib.rscrates/multimodal/src/media.rscrates/multimodal/src/registry/kimi_k25.rscrates/multimodal/src/registry/llama4.rscrates/multimodal/src/registry/llava.rscrates/multimodal/src/registry/mod.rscrates/multimodal/src/registry/phi3_v.rscrates/multimodal/src/registry/qwen3_vl.rscrates/multimodal/src/registry/qwen_vl.rscrates/multimodal/src/registry/traits.rscrates/multimodal/src/tracker.rscrates/multimodal/src/types.rscrates/multimodal/src/vision/mod.rscrates/multimodal/src/vision/preprocessor_config.rscrates/multimodal/src/vision/processor.rscrates/multimodal/src/vision/processors/kimi_k25.rscrates/multimodal/src/vision/processors/llama4_vision.rscrates/multimodal/src/vision/processors/llava.rscrates/multimodal/src/vision/processors/mod.rscrates/multimodal/src/vision/processors/phi3_vision.rscrates/multimodal/src/vision/processors/phi4_vision.rscrates/multimodal/src/vision/processors/pixtral.rscrates/multimodal/src/vision/processors/qwen2_vl.rscrates/multimodal/src/vision/processors/qwen3_vl.rscrates/multimodal/src/vision/processors/qwen_vl_base.rscrates/multimodal/tests/vision_golden_tests.rsgrpc_servicer/smg_grpc_servicer/tokenspeed/servicer.pymodel_gateway/Cargo.tomlmodel_gateway/src/routers/grpc/multimodal.rsmodel_gateway/src/routers/grpc/proto_wrapper.rsmodel_gateway/src/routers/grpc/regular/stages/chat/preparation.rsmodel_gateway/src/routers/grpc/regular/stages/chat/request_building.rsmodel_gateway/src/routers/grpc/regular/stages/messages/preparation.rsmodel_gateway/src/routers/grpc/regular/stages/messages/request_building.rsmodel_gateway/src/routers/grpc/utils/chat_utils.rsmodel_gateway/src/workflow/tokenizer_registration.rs
| /// Get the batch size. | ||
| pub fn batch_size(&self) -> usize { | ||
| self.pixel_values.shape()[0] | ||
| self.encoder_input.shape()[0] | ||
| } |
There was a problem hiding this comment.
batch_size() no longer matches the generalized encoder-input contract.
After this refactor, some processors flatten encoder patches/tiles into axis 0 instead of keeping “one item per batch row”. For example, KimiK25Processor now emits [total_patches, 3, patch, patch] and Llama4VisionProcessor emits [total_tiles, 3, H, W], so batch_size() returns the patch/tile count there instead of the number of media items. Any generic caller using batch_size() will misalign metadata or placeholder expansion on those models.
Suggested fix
/// Get the batch size.
pub fn batch_size(&self) -> usize {
- self.encoder_input.shape()[0]
+ self.item_sizes.len()
}Also applies to: 342-345
| let mut all_outputs = Vec::new(); | ||
| let mut all_aspect_ratios = Vec::new(); | ||
| let mut image_sizes = Vec::new(); | ||
| let mut num_img_tokens = Vec::new(); | ||
| let mut item_sizes = Vec::new(); | ||
| let mut feature_token_counts = Vec::new(); | ||
|
|
||
| for image in images { | ||
| let (output, aspect_ratio) = processor.process_single_image(image); | ||
| let tokens = processor.calculate_num_tokens_for_aspect_ratio(aspect_ratio); | ||
|
|
||
| all_outputs.push(output); | ||
| all_aspect_ratios.push(aspect_ratio); | ||
| image_sizes.push((image.height(), image.width())); | ||
| num_img_tokens.push(tokens); | ||
| item_sizes.push((image.height(), image.width())); | ||
| feature_token_counts.push(tokens); |
There was a problem hiding this comment.
Keep item_sizes in the same tuple order as the other processors.
This path pushes (height, width), but the other processors in this cohort populate item_sizes from DynamicImage::dimensions(), i.e. (width, height). Because PreprocessedEncoderInputs.item_sizes is shared across processors and carries no per-model ordering flag, this will flip portrait/landscape metadata for generic consumers on LLaMA 4 only.
Suggested fix
- item_sizes.push((image.height(), image.width()));
+ item_sizes.push(image.dimensions());🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/multimodal/src/vision/processors/llama4_vision.rs` around lines 453 -
465, The code pushes image dimensions into item_sizes as (height, width), which
is inconsistent with other processors that use (width, height); change the push
in the loop that uses processor.process_single_image(...) so
item_sizes.push((image.width(), image.height())) to match the ordering expected
by PreprocessedEncoderInputs and avoid flipping portrait/landscape metadata for
Llama4Vision.
| min_pixels: config | ||
| .min_pixels | ||
| .or_else(|| config.get_shortest_edge()) | ||
| .unwrap_or(DEFAULT_MIN_PIXELS), | ||
| max_pixels: config | ||
| .max_pixels | ||
| .or_else(|| config.get_longest_edge()) | ||
| .unwrap_or(DEFAULT_MAX_PIXELS), |
There was a problem hiding this comment.
The new size-edge fallback never reaches preprocess() / preprocess_video().
from_preprocessor_config() now maps size.shortest_edge / size.longest_edge into min_pixels / max_pixels, but both trait entrypoints still forward to self.inner.*. Since QwenVLProcessorBase computes image/video resize limits from self.config, Qwen3VLProcessor::new().preprocess_video(..., config_with_size) still uses the default budgets instead of the values you just loaded here.
Suggested fix
fn preprocess(
&self,
images: &[DynamicImage],
config: &PreProcessorConfig,
) -> Result<PreprocessedEncoderInputs, TransformError> {
- self.inner.preprocess(images, config)
+ let processor = if config.patch_size.is_some()
+ || config.merge_size.is_some()
+ || config.min_pixels.is_some()
+ || config.max_pixels.is_some()
+ || config.temporal_patch_size.is_some()
+ || config.size.is_some()
+ {
+ Self::from_preprocessor_config(config)
+ } else {
+ self.clone()
+ };
+
+ processor.inner.preprocess(images, config)
}
fn preprocess_video(
&self,
frames: &[DynamicImage],
config: &PreProcessorConfig,
) -> Result<PreprocessedEncoderInputs, TransformError> {
- self.inner.preprocess_video(frames, config)
+ let processor = if config.patch_size.is_some()
+ || config.merge_size.is_some()
+ || config.min_pixels.is_some()
+ || config.max_pixels.is_some()
+ || config.temporal_patch_size.is_some()
+ || config.size.is_some()
+ {
+ Self::from_preprocessor_config(config)
+ } else {
+ self.clone()
+ };
+
+ processor.inner.preprocess_video(frames, config)
}Also applies to: 216-230
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/multimodal/src/vision/processors/qwen3_vl.rs` around lines 124 - 131,
from_preprocessor_config() maps size.shortest_edge/longest_edge into
min_pixels/max_pixels but preprocess() and preprocess_video() still delegate to
self.inner.*, so the new pixel limits are never used; update the
constructor/path so the inner processor uses the mapped config — either
initialize self.inner with the modified config (ensure Qwen3VLProcessor::new or
from_preprocessor_config creates the inner QwenVLProcessorBase using the mapped
min_pixels/max_pixels) or override preprocess()/preprocess_video() in
Qwen3VLProcessor to call the base implementation that reads self.config
(QwenVLProcessorBase) so the size-edge fallback takes effect. Ensure the change
is mirrored for the other occurrence around lines 216-230.
| if modality == Modality::Video && videos.len() != 1 { | ||
| return Err(anyhow::anyhow!( | ||
| "No images were successfully fetched for multimodal request" | ||
| "Exactly one video is supported per request for the initial video path" | ||
| )); |
There was a problem hiding this comment.
The video path is still single-item despite the new itemized contract.
Lines 605-608 reject any request with more than one video, and Lines 684-689 only ever preprocess videos_for_preprocess.first(). That means multi-video prompts still fail here even though the downstream TokenSpeed assembly in this file is already item-oriented.
Also applies to: 684-689
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@model_gateway/src/routers/grpc/multimodal.rs` around lines 605 - 608, The
code currently rejects multi-video requests in the check using Modality::Video
and by only preprocessing videos_for_preprocess.first(), which prevents the new
per-item contract from working; remove or relax the single-video guard (the if
checking modality == Modality::Video && videos.len() != 1) and change the
preprocessing logic that uses videos_for_preprocess.first() to iterate over all
entries in videos_for_preprocess (perform preprocessing per item and collect
results), ensuring downstream TokenSpeed assembly stays item-oriented and
consumes the per-video processed outputs; also update any error messages to
reflect per-item failures rather than a single-video-only error.
|
Hi @yechank-nvidia, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch: git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease |
Description
Problem
SMG multimodal support was still structured around an image-only payload shape. This makes it hard to extend the existing precomputed multimodal path to additional modalities such as video and audio without adding more image-specific fields and special cases.
The gateway also needed a cleaner foundation for passing preprocessed multimodal encoder inputs, model-specific metadata, placeholder ranges, and modality information to TokenSpeed.
Solution
This PR generalizes the TokenSpeed multimodal ABI from a single image payload into itemized multimodal inputs. Each multimodal item now carries its own modality, encoder input tensor, model-specific side tensors, placeholder metadata, placeholder token id, and content hash.
This keeps the current precomputed multimodal design, but removes the image-only payload assumption so later PRs can add video and transport optimizations on top of the same structure.
Changes
Generalized TokenSpeed MultimodalInputs to contain repeated MultimodalItems.
Added modality metadata for image, video, and audio.
Replaced image-specific TokenSpeed fields with per-item:
Updated SMG TokenSpeed proto conversion to emit itemized multimodal payloads.
Updated the TokenSpeed Python servicer to consume the itemized multimodal payload.
Preserved existing image behavior on top of the new generalized payload structure.
Added model/tokenizer config loading support needed for multimodal preprocessor metadata.
Avoided warning spam when processor_config.json exists but does not contain video_processor.
Added regression coverage for video_processor config loading behavior.
Test Plan
Verified the stacked branch diff against the original implementation branch.
Verified whitespace/diff hygiene:
git diff --check
Checklist
Summary by CodeRabbit
New Features
Bug Fixes