feat(mllm): extract audio track from video inputs by miguel-flowstate · Pull Request #352 · waybarrios/vllm-mlx

miguel-flowstate · 2026-04-15T18:04:43Z

Summary

When a video_url is provided, automatically extract the audio track via ffmpeg and pass it to the model alongside the video frames
Enables models to describe both visual and audio content from a single video input
Videos without audio tracks are handled gracefully (extraction returns None, no error)

How it works

After extracting video frames in the fallback path, calls _extract_audio_from_video() on the video file
Uses ffmpeg to convert the audio track to 16kHz mono WAV (format expected by audio feature extractors)
Appends the extracted audio path to all_audio_urls, which gets passed to mlx_vlm.generate(audio=...)
Skips base64 data: video inputs (no local file to extract from)

Dependencies

Depends on #351 (audio_url support in chat()) — this PR is stacked on top of that branch.

Test plan

Big Buck Bunny video with JFK speech audio baked in via ffmpeg
Model (Gemma 4 E2B) correctly described the forest scene visually AND transcribed "Ask not what your country can do for you. Ask what you can do for your country" from the audio track
Token count increased from 1636 (frames only) to 1736+ (frames + extracted audio)
Test with video that has no audio track (should work without error)

The MLLM.chat() method parsed image_url and video_url content parts but silently dropped audio_url. This meant audio data was never passed to the model even when the underlying mlx_vlm.generate() fully supports it. Changes: - Extract audio_url content parts alongside image_url in message parsing - Insert {"type": "audio"} tokens into the chat template content list - Pass collected audio paths to mlx_vlm.generate() via the audio kwarg Tested with Gemma 4 E4B and a WAV file: the model now receives audio features (input_features from the processor) and produces correct transcriptions through the /v1/chat/completions endpoint.

When a video_url is provided, extract the audio track via ffmpeg and include it alongside the video frames so the model can describe both visual and audio content from a single video input. The extraction converts audio to 16kHz mono WAV, the format expected by audio feature extractors (e.g. Gemma4AudioFeatureExtractor). Videos without an audio track are handled gracefully. Depends on: waybarrios#351 (audio_url support in chat)

…ideo flag Audio extraction from video is now opt-in rather than automatic: - Add --extract-audio-from-video CLI flag to enable server-wide default - Add extract_audio_from_video field to ChatCompletionRequest for per-request control (overrides server default) - Default is false so video-only models are unaffected Usage: vllm-mlx serve model --extract-audio-from-video # or per-request: {"extract_audio_from_video": true, ...}

janhilgard

Review: feat(mllm): extract audio track from video inputs

Overall: Good feature with clean opt-in design. A few issues to address before merge.

Strengths

Opt-in design is correct -- --extract-audio-from-video CLI flag + per-request override. No unexpected behavior for existing users.
Graceful degradation -- videos without audio tracks or missing ffmpeg return None silently.
30-second timeout on ffmpeg subprocess prevents hangs.
WAV header size check (>44 bytes) to detect empty audio extraction is a nice touch.

Issues

1. Temp file leak (important)

_extract_audio_from_video() creates NamedTemporaryFile(delete=False) and returns the path on success. But there is no cleanup after mlx_vlm.generate() finishes processing the audio. These WAV files will accumulate in /tmp/ indefinitely.

Suggestion: collect extracted audio paths and clean them up after generation:

# After generate() completes:
for path in extracted_audio_paths:
    try:
        os.unlink(path)
    except OSError:
        pass

2. `msg_audio_count` bug (inherited from #351)

Same bug as PR #351: msg_audio_count = len(all_audio_urls) uses accumulated total across all messages, not per-message count. See my review on #351 for the fix.

3. Duplicate `import os`

Two import os statements inside _extract_audio_from_video() -- one after the size check (success path) and one before os.unlink (failure path). Minor cleanup.

4. No CI checks have run

The statusCheckRollup is empty for this PR. This needs to be resolved before merge.

5. Security: ffmpeg with user-supplied paths

video_path from user input is passed to ffmpeg via subprocess.run() (list form, no shell=True -- good). However, ffmpeg supports network protocols (rtsp://, http://, etc.), so a user could trigger ffmpeg to fetch arbitrary URLs. The not video_path.startswith("data:") guard is good but doesn't cover this case.

Consider: only allow local file paths (already downloaded by the video frame extraction step) rather than raw user URLs.

Dependency note

This PR is stacked on #351. Please fix the msg_audio_count bug in #351 first, then rebase this one.

Thump604 · 2026-04-29T22:32:16Z

Hi @miguel-flowstate -- this PR has been open since April 15 with no activity and likely has merge conflicts with current main. Are you still planning to work on this? If so, a rebase and update would be great. If not, we can close it and you're welcome to refile when ready. Will check back in two weeks.

miguel-flowstate added 3 commits April 15, 2026 10:25

janhilgard reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mllm): extract audio track from video inputs#352

feat(mllm): extract audio track from video inputs#352
miguel-flowstate wants to merge 3 commits intowaybarrios:mainfrom
miguel-flowstate:feat/video-audio-extraction

miguel-flowstate commented Apr 15, 2026

Uh oh!

janhilgard left a comment

Uh oh!

Thump604 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

miguel-flowstate commented Apr 15, 2026

Summary

How it works

Dependencies

Test plan

Uh oh!

janhilgard left a comment

Choose a reason for hiding this comment

Review: feat(mllm): extract audio track from video inputs

Strengths

Issues

1. Temp file leak (important)

2. msg_audio_count bug (inherited from #351)

3. Duplicate import os

4. No CI checks have run

5. Security: ffmpeg with user-supplied paths

Dependency note

Uh oh!

Thump604 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

2. `msg_audio_count` bug (inherited from #351)

3. Duplicate `import os`