Skip to content

feat(mllm): extract audio track from video inputs#352

Open
miguel-flowstate wants to merge 3 commits intowaybarrios:mainfrom
miguel-flowstate:feat/video-audio-extraction
Open

feat(mllm): extract audio track from video inputs#352
miguel-flowstate wants to merge 3 commits intowaybarrios:mainfrom
miguel-flowstate:feat/video-audio-extraction

Conversation

@miguel-flowstate
Copy link
Copy Markdown
Contributor

Summary

  • When a video_url is provided, automatically extract the audio track via ffmpeg and pass it to the model alongside the video frames
  • Enables models to describe both visual and audio content from a single video input
  • Videos without audio tracks are handled gracefully (extraction returns None, no error)

How it works

  1. After extracting video frames in the fallback path, calls _extract_audio_from_video() on the video file
  2. Uses ffmpeg to convert the audio track to 16kHz mono WAV (format expected by audio feature extractors)
  3. Appends the extracted audio path to all_audio_urls, which gets passed to mlx_vlm.generate(audio=...)
  4. Skips base64 data: video inputs (no local file to extract from)

Dependencies

Depends on #351 (audio_url support in chat()) — this PR is stacked on top of that branch.

Test plan

  • Big Buck Bunny video with JFK speech audio baked in via ffmpeg
  • Model (Gemma 4 E2B) correctly described the forest scene visually AND transcribed "Ask not what your country can do for you. Ask what you can do for your country" from the audio track
  • Token count increased from 1636 (frames only) to 1736+ (frames + extracted audio)
  • Test with video that has no audio track (should work without error)

The MLLM.chat() method parsed image_url and video_url content parts
but silently dropped audio_url. This meant audio data was never passed
to the model even when the underlying mlx_vlm.generate() fully
supports it.

Changes:
- Extract audio_url content parts alongside image_url in message parsing
- Insert {"type": "audio"} tokens into the chat template content list
- Pass collected audio paths to mlx_vlm.generate() via the audio kwarg

Tested with Gemma 4 E4B and a WAV file: the model now receives audio
features (input_features from the processor) and produces correct
transcriptions through the /v1/chat/completions endpoint.
When a video_url is provided, extract the audio track via ffmpeg and
include it alongside the video frames so the model can describe both
visual and audio content from a single video input.

The extraction converts audio to 16kHz mono WAV, the format expected
by audio feature extractors (e.g. Gemma4AudioFeatureExtractor).
Videos without an audio track are handled gracefully.

Depends on: waybarrios#351 (audio_url support in chat)
…ideo flag

Audio extraction from video is now opt-in rather than automatic:

- Add --extract-audio-from-video CLI flag to enable server-wide default
- Add extract_audio_from_video field to ChatCompletionRequest for
  per-request control (overrides server default)
- Default is false so video-only models are unaffected

Usage:
  vllm-mlx serve model --extract-audio-from-video
  # or per-request: {"extract_audio_from_video": true, ...}
Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: feat(mllm): extract audio track from video inputs

Overall: Good feature with clean opt-in design. A few issues to address before merge.

Strengths

  1. Opt-in design is correct -- --extract-audio-from-video CLI flag + per-request override. No unexpected behavior for existing users.
  2. Graceful degradation -- videos without audio tracks or missing ffmpeg return None silently.
  3. 30-second timeout on ffmpeg subprocess prevents hangs.
  4. WAV header size check (>44 bytes) to detect empty audio extraction is a nice touch.

Issues

1. Temp file leak (important)

_extract_audio_from_video() creates NamedTemporaryFile(delete=False) and returns the path on success. But there is no cleanup after mlx_vlm.generate() finishes processing the audio. These WAV files will accumulate in /tmp/ indefinitely.

Suggestion: collect extracted audio paths and clean them up after generation:

# After generate() completes:
for path in extracted_audio_paths:
    try:
        os.unlink(path)
    except OSError:
        pass

2. msg_audio_count bug (inherited from #351)

Same bug as PR #351: msg_audio_count = len(all_audio_urls) uses accumulated total across all messages, not per-message count. See my review on #351 for the fix.

3. Duplicate import os

Two import os statements inside _extract_audio_from_video() -- one after the size check (success path) and one before os.unlink (failure path). Minor cleanup.

4. No CI checks have run

The statusCheckRollup is empty for this PR. This needs to be resolved before merge.

5. Security: ffmpeg with user-supplied paths

video_path from user input is passed to ffmpeg via subprocess.run() (list form, no shell=True -- good). However, ffmpeg supports network protocols (rtsp://, http://, etc.), so a user could trigger ffmpeg to fetch arbitrary URLs. The not video_path.startswith("data:") guard is good but doesn't cover this case.

Consider: only allow local file paths (already downloaded by the video frame extraction step) rather than raw user URLs.

Dependency note

This PR is stacked on #351. Please fix the msg_audio_count bug in #351 first, then rebase this one.

@Thump604
Copy link
Copy Markdown
Collaborator

Hi @miguel-flowstate -- this PR has been open since April 15 with no activity and likely has merge conflicts with current main. Are you still planning to work on this? If so, a rebase and update would be great. If not, we can close it and you're welcome to refile when ready. Will check back in two weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants