feat(mllm): extract audio track from video inputs#352
feat(mllm): extract audio track from video inputs#352miguel-flowstate wants to merge 3 commits intowaybarrios:mainfrom
Conversation
The MLLM.chat() method parsed image_url and video_url content parts
but silently dropped audio_url. This meant audio data was never passed
to the model even when the underlying mlx_vlm.generate() fully
supports it.
Changes:
- Extract audio_url content parts alongside image_url in message parsing
- Insert {"type": "audio"} tokens into the chat template content list
- Pass collected audio paths to mlx_vlm.generate() via the audio kwarg
Tested with Gemma 4 E4B and a WAV file: the model now receives audio
features (input_features from the processor) and produces correct
transcriptions through the /v1/chat/completions endpoint.
When a video_url is provided, extract the audio track via ffmpeg and include it alongside the video frames so the model can describe both visual and audio content from a single video input. The extraction converts audio to 16kHz mono WAV, the format expected by audio feature extractors (e.g. Gemma4AudioFeatureExtractor). Videos without an audio track are handled gracefully. Depends on: waybarrios#351 (audio_url support in chat)
…ideo flag
Audio extraction from video is now opt-in rather than automatic:
- Add --extract-audio-from-video CLI flag to enable server-wide default
- Add extract_audio_from_video field to ChatCompletionRequest for
per-request control (overrides server default)
- Default is false so video-only models are unaffected
Usage:
vllm-mlx serve model --extract-audio-from-video
# or per-request: {"extract_audio_from_video": true, ...}
janhilgard
left a comment
There was a problem hiding this comment.
Review: feat(mllm): extract audio track from video inputs
Overall: Good feature with clean opt-in design. A few issues to address before merge.
Strengths
- Opt-in design is correct --
--extract-audio-from-videoCLI flag + per-request override. No unexpected behavior for existing users. - Graceful degradation -- videos without audio tracks or missing ffmpeg return
Nonesilently. - 30-second timeout on ffmpeg subprocess prevents hangs.
- WAV header size check (>44 bytes) to detect empty audio extraction is a nice touch.
Issues
1. Temp file leak (important)
_extract_audio_from_video() creates NamedTemporaryFile(delete=False) and returns the path on success. But there is no cleanup after mlx_vlm.generate() finishes processing the audio. These WAV files will accumulate in /tmp/ indefinitely.
Suggestion: collect extracted audio paths and clean them up after generation:
# After generate() completes:
for path in extracted_audio_paths:
try:
os.unlink(path)
except OSError:
pass2. msg_audio_count bug (inherited from #351)
Same bug as PR #351: msg_audio_count = len(all_audio_urls) uses accumulated total across all messages, not per-message count. See my review on #351 for the fix.
3. Duplicate import os
Two import os statements inside _extract_audio_from_video() -- one after the size check (success path) and one before os.unlink (failure path). Minor cleanup.
4. No CI checks have run
The statusCheckRollup is empty for this PR. This needs to be resolved before merge.
5. Security: ffmpeg with user-supplied paths
video_path from user input is passed to ffmpeg via subprocess.run() (list form, no shell=True -- good). However, ffmpeg supports network protocols (rtsp://, http://, etc.), so a user could trigger ffmpeg to fetch arbitrary URLs. The not video_path.startswith("data:") guard is good but doesn't cover this case.
Consider: only allow local file paths (already downloaded by the video frame extraction step) rather than raw user URLs.
Dependency note
This PR is stacked on #351. Please fix the msg_audio_count bug in #351 first, then rebase this one.
|
Hi @miguel-flowstate -- this PR has been open since April 15 with no activity and likely has merge conflicts with current main. Are you still planning to work on this? If so, a rebase and update would be great. If not, we can close it and you're welcome to refile when ready. Will check back in two weeks. |
Summary
video_urlis provided, automatically extract the audio track via ffmpeg and pass it to the model alongside the video framesNone, no error)How it works
_extract_audio_from_video()on the video fileall_audio_urls, which gets passed tomlx_vlm.generate(audio=...)data:video inputs (no local file to extract from)Dependencies
Depends on #351 (
audio_urlsupport inchat()) — this PR is stacked on top of that branch.Test plan