Skip to content

feat: add dspy.Video type adapter#9443

Open
ilia-iliev wants to merge 5 commits intostanfordnlp:mainfrom
ilia-iliev:main
Open

feat: add dspy.Video type adapter#9443
ilia-iliev wants to merge 5 commits intostanfordnlp:mainfrom
ilia-iliev:main

Conversation

@ilia-iliev
Copy link
Copy Markdown

@ilia-iliev ilia-iliev commented Mar 13, 2026

Summary

Adds a Video type to DSPy's adapter system, following the same pattern as the existing types.

Relevant issues #8507 #7847

Currently only Gemini works with video, so the PR is google-specific. Video adapter can be extended in the future when more providers allow for video.

I've been using similar adapter for video analysis for the past few months.

Supports passing video inline or referencing an already uploaded video to Google's Files API (their recommended approach)

Example:

import dspy
dspy.configure(lm=dspy.LM("gemini/gemini-3-flash-preview"))

class Describe(dspy.Signature):
    video: dspy.Video = dspy.InputField()
    description: str = dspy.OutputField()

program = dspy.Predict(Describe)
result = program(video=dspy.Video.from_file("video.mp4"))

@ilia-iliev ilia-iliev changed the title feat: add dspy.Video type for Gemini video inputs feat: add dspy.Video type adapter Mar 13, 2026
Copy link
Copy Markdown

@alvinttang alvinttang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Add dspy.Video type adapter

Clean implementation that follows the existing Audio and Image patterns well. Tests are thorough. A few issues worth addressing:

1. No file size guard for inline encoding

_encode_video_from_file reads the entire file into memory:

with open(file_path, "rb") as f:
    file_bytes = f.read()

For video files this is particularly dangerous — a 1 GB video will consume 1 GB of RAM plus ~1.33 GB for the base64 string. The docstring mentions "suitable for videos under 20 MB" but there is no enforcement. Consider adding a size check:

file_size = os.path.getsize(file_path)
if file_size > MAX_INLINE_SIZE:
    raise ValueError(f"File {file_path} is {file_size} bytes, exceeding the {MAX_INLINE_SIZE} byte limit for inline encoding. Use Video.from_url() with Google's Files API instead.")

2. The format() output structure may not match litellm's expected schema

The format method returns:

{"type": "file", "file": {"format": ..., "file_data": ...}}

But looking at litellm's Gemini integration and the OpenAI multimodal spec, video content parts typically use either:

  • {"type": "video_url", "video_url": {"url": ...}} (OpenAI-style), or
  • {"type": "image_url", ...} with a data URI (litellm converts these)

Has this been tested end-to-end with litellm/Gemini? The "type": "file" format may not be recognized by the chat adapter's message formatting. I'd suggest adding an integration test note or verifying against the actual litellm format.

3. String-based file path detection is fragile

In validate_input:

if isinstance(values, str):
    if os.path.isfile(values):
        return _encode_video_from_file(values)
    return {"url": values}

This means if a file happens to exist at a path that looks like a URL (e.g., someone has a file named https: in their cwd), it would be treated as a file. More critically, os.path.isfile() performs a filesystem call on every string input, including URLs. This is a side effect during validation that could be slow on network-mounted filesystems. Consider checking for URL schemes first:

if values.startswith(("http://", "https://", "gs://")):
    return {"url": values}
if os.path.isfile(values):
    return _encode_video_from_file(values)

4. validate_source allows both None

The validator:

if has_url == has_inline_data:
    raise ValueError("Exactly one of 'url' or 'inline_data' must be provided.")

This correctly rejects both-set and neither-set cases. But the error message for neither-set ("Exactly one of...") could be clearer — when a user passes Video(mime_type="video/mp4") they get this message, which doesn't tell them what to do. The test test_neither_url_nor_inline_data_raises expects match="url.*inline_data" which passes, but a user-facing message like "You must provide either 'url' or 'inline_data'" would be friendlier.

5. Missing __eq__ and __hash__

The model is frozen=True which makes it hashable via pydantic's default, but equality comparison uses pydantic's default which compares all fields. For inline videos with large base64 strings, this means equality checks will compare megabytes of string data. If Videos are ever used in sets or as dict keys (e.g., for caching), consider implementing __hash__ based on a content digest.

6. Test imports coupling

from tests.signatures.test_adapter_file import count_messages_with_file_pattern

Importing from another test file creates a fragile coupling. If test_adapter_file.py is refactored or the function is renamed, video tests break. Consider moving count_messages_with_file_pattern to a shared test utility module.

Overall, solid work. The main risk is point 1 (unbounded memory usage for large videos) and point 2 (format compatibility with actual LLM providers).

@ilia-iliev
Copy link
Copy Markdown
Author

Thanks, those are sound comments. Will address them

1. Added max size for inline
2. Not doing filesystem call for strings that look like URL
3. Improved error message
4. Designated count_messages_with_file_pattern to own module
@ilia-iliev
Copy link
Copy Markdown
Author

ilia-iliev commented Mar 16, 2026

2 Integration tests confirmed the formatting - both inline and Google Files API is handled correctly with {"type": "file", "file": {...}}

5 Didn't implement custom __hash__ - looks like it is already implemented by default in dspy/clients/cache.py:78-100

Addressed all other comments. Thanks!

Copy link
Copy Markdown

@alvinttang alvinttang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured PR — the Video type follows the established patterns from Image, Audio, and File closely. Thorough test coverage too. Some observations:

1. The format() method output structure

The method returns [{"type": "file", "file": {"file_data": "data:..."}}] for inline data and [{"type": "file", "file": {"file_id": url, "format": mime_type}}] for URL references. Two questions:

  • Is file_id the correct key for litellm's Gemini integration? The litellm docs for Gemini file references typically expect file_uri (matching the Gemini API's fileData.fileUri field). If file_id is used, litellm may not route it correctly to the Gemini Files API.
  • For inline data, using file_data with a data URI works for litellm, but it's worth verifying that videos encoded this way actually work end-to-end with Gemini, since the 20MB base64 limit maps to ~15MB of raw video (base64 has ~33% overhead), which is quite small for video.

2. The validate_input string handling has an implicit assumption

if values.startswith(("http://", "https://", "gs://")):
    return {"url": values}
if os.path.isfile(values):
    return _encode_video_from_file(values)
return {"url": values}  # fallback: treat as URL

The fallback return {"url": values} for strings that aren't URLs and aren't existing files silently treats any string as a URL. This could mask typos in file paths — e.g., Video("./videos/demo.mp4") where the path doesn't exist would silently become a URL reference instead of raising an error. Consider raising a ValueError for non-URL strings that don't point to an existing file, or at least logging a warning.

3. The MAX_INLINE_SIZE constant

20MB is documented as the limit, but the Gemini API's inline data limit is actually 20MB of base64-encoded data (not raw file size). Since base64 encoding increases size by ~33%, a 20MB raw file becomes ~27MB of base64, which may exceed the API limit. Consider either:

  • Lowering MAX_INLINE_SIZE to ~15MB (to stay under 20MB after encoding), or
  • Documenting that the limit refers to raw file size and the effective encoded size may be larger.

4. Extracting count_messages_with_file_pattern to tests/test_utils/messages.py is a good refactor

This avoids duplication between the file and video test modules. Clean.

5. Missing __init__.py in tests/test_utils/?

The new tests/test_utils/messages.py module is imported by both test files. Does tests/test_utils/ have an __init__.py? If not, the import may fail depending on how pytest is configured (though pytest usually handles this via rootdir discovery).

6. Test coverage is excellent

The test suite covers URL construction, file encoding, bytes, validation errors, frozen model, repr, signature integration, list fields, optional fields, and save/load. Very thorough.

Overall this is a solid addition. The main concerns are the format() output structure (verify file_id vs file_uri for litellm/Gemini compatibility) and the silent URL fallback for invalid file paths.

Copy link
Copy Markdown

@alvinttang alvinttang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured PR — the Video type follows the established patterns from Image, Audio, and File closely. Thorough test coverage too. Some observations:

1. The format() method output — verify file_id vs file_uri

The method returns {"type": "file", "file": {"file_id": url, "format": mime_type}} for URL references. Two questions:

  • Is file_id the correct key for litellm's Gemini integration? The litellm docs for Gemini file references typically expect file_uri (matching the Gemini API's fileData.fileUri field). If file_id is wrong, litellm may not route the video correctly to Gemini. Worth verifying against the existing Audio type's format() method to ensure consistency.
  • For inline data, the file_data key with a data URI (data:video/mp4;base64,...) matches litellm's expected format. This path should work.

2. Silent URL fallback for invalid file paths

In validate_input:

if values.startswith(("http://", "https://", "gs://")):
    return {"url": values}
if os.path.isfile(values):
    return _encode_video_from_file(values)
return {"url": values}  # fallback

The final fallback silently treats any non-URL, non-existing-file string as a URL. This could mask typos in file paths — e.g., Video("./videos/demo.mp4") where the path doesn't exist would silently become a URL reference rather than raising an error. Consider raising a ValueError for strings that don't match URL patterns and aren't existing files, or at least logging a warning. The Image type may have the same pattern, but it's worth reconsidering here since video URLs are less common than image URLs.

3. MAX_INLINE_SIZE may undercount

The 20MB limit checks raw file size, but base64 encoding increases size by ~33%. If the Gemini API's inline limit is 20MB of encoded data, a 20MB raw file becomes ~27MB encoded, which would exceed the limit. Consider either lowering to ~15MB or documenting that the limit refers to raw file size.

4. Missing __init__.py in tests/test_utils/

The new tests/test_utils/messages.py is imported by both test_adapter_file.py and test_adapter_video.py. If tests/test_utils/ doesn't have an __init__.py, the import may fail depending on pytest configuration. I don't see one being added in this diff.

5. validate_source error message could be clearer

if has_url == has_inline_data:
    raise ValueError("You must provide either 'url' or 'inline_data'.")

This fires when both are present AND when neither is present, but the error message is the same for both cases. Consider distinguishing: "Both 'url' and 'inline_data' were provided; use only one" vs "Neither 'url' nor 'inline_data' was provided; one is required."

6. Extracting count_messages_with_file_pattern to a shared utility is a good refactor

This avoids duplication between the file and video test modules.

7. Test coverage is excellent

URL construction, file encoding, bytes input, validation errors, frozen model enforcement, repr, signature integration, list fields, optional fields, and save/load are all covered. Very thorough.

Overall this is a solid addition to the type system. The main concerns are verifying the format() output matches what litellm/Gemini actually expects, and the silent URL fallback behavior.

@ilia-iliev
Copy link
Copy Markdown
Author

file_id is correct - confirmed by uploading a file to Files API and referencing it. This is also what I see in LilteLLm's docs -
image

Added a comment to clarify the 20MB limit. I'm not sure DSPy is the right place to check video sizes though I agree some limit is necessary to prevent accidental loading of huge video. Also, it appears that Gemini has recently increased their inline to 100MB.

Updated the error messages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants