add vlm run integration #6353

wbrennan899 · 2025-09-23T22:33:13Z

What changes are proposed in this pull request?

This PR adds integration with VLM Run, a Vision Language Model platform that extracts structured data from documents, images, videos, and audio files.

Key features:

New VLMRunModel class that processes:
- Documents: Extract data from PDFs, invoices, receipts, bank statements, resumes, etc.
- Images: Generate captions, classify content, answer questions about images
- Videos: Transcribe and analyze video content with timestamps
- Audio: Transcribe audio with temporal segmentation
40+ pre-built domains ready to use without training
Visual grounding (bounding boxes showing where data was found)
Temporal grounding (timestamps for video/audio content)
Automatic conversion to FiftyOne's Classification, Detection, and Attribute formats

How is this patch tested? If it is not, please explain why.

Comprehensive unit tests covering all model operations, document processing, video/audio transcription, and output conversions
Additional standalone testing performed separately to verify real-world usage
Documentation includes working examples
Verified integration with FiftyOne's existing model infrastructure

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

Yes. Give a description of this change to be included in the release notes for FiftyOne users.

Added VLM Run integration for extracting structured data from documents (PDFs, invoices, receipts), images, videos, and audio files. Includes 40+ pre-built domains and provides visual grounding (bounding boxes) and temporal grounding (timestamps). Install with pip install fiftyone[vlmrun].

What areas of FiftyOne does this PR affect?

App: FiftyOne application changes
Build: Build and test infrastructure changes
Core: Core fiftyone Python library changes
Documentation: FiftyOne documentation changes
Other

Summary by CodeRabbit

New Features
- Added VLM Run integration to apply multimodal models to images, videos, audio, and documents.
- Supports batching, progress, confidence scores, visual/temporal grounding, and flexible output as attributes, classifications, detections, or raw data.
Documentation
- New integration guide for VLM Run with setup, quickstart, examples, and usage tips.
- Added “VLM Run” card and navigation entry in the integrations index.
Tests
- Comprehensive unit and integration tests covering configuration, predictions, conversions, and dataset application.
Chores
- Added VLM Run dependency to optional requirements.

coderabbitai · 2025-09-23T22:33:36Z

Walkthrough

Introduces a new VLM Run integration: adds documentation pages and index entry, a utility module implementing model loading, prediction, conversions, and application to datasets, accompanying unit tests with mocks, and an extras dependency on vlmrun>=0.3.5.

Changes

Cohort / File(s)	Summary of Changes
Docs: Integration guide and index `docs/source/integrations/index.rst`, `docs/source/integrations/vlm.rst`	Added “VLM Run” card, toctree entry, and a new comprehensive integration guide with setup, usage examples, API reference, and limitations.
Core: VLM Run utilities `fiftyone/utils/vlmrun.py`	New module providing VLMRunModelConfig/Model, factory/loader helpers, prediction (single/batch), domain/schema utilities, output parsers (attributes/classifications/detections/grounding), and dataset apply function with progress/timeout handling.
Dependency `requirements/extras.txt`	Added `vlmrun>=0.3.5` to extras.
Tests `tests/unittests/vlm_tests.py`	Added extensive unit/integration tests with mocked clients for initialization, prediction, conversions, grounding parsing, domain listing/schema, and dataset application.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant FO as FiftyOne
  participant VM as VLMRunModel
  participant API as VLM Run API
  participant DS as Dataset

  User->>FO: apply_vlmrun_model(samples, domain/schema, output_type, ...)
  FO->>VM: construct/load model (domain, api_key, config)
  Note over VM: Determine media_type from domain
  loop For each media (or batch)
    VM->>API: predict(media | batch) [async/polling, timeout]
    API-->>VM: result (response/data/legacy)
    VM-->>FO: raw result
    FO->>FO: convert to attributes/classifications/detections/grounding
    FO->>DS: write to label_field
  end
  DS-->>User: samples updated

sequenceDiagram
  autonumber
  participant FO as FiftyOne Utils
  participant VM as VLMRunModel
  participant API as VLM Run API

  FO->>VM: predict_all(media_items)
  rect rgba(200,230,255,0.25)
    note right of VM: Batch submit and poll
    VM->>API: submit batch
    API-->>VM: batch results
  end
  VM-->>FO: list of results
  FO->>FO: map results ->\nClassification/Detections/Attributes

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I thump my paws—new lanes to run,
VLM winds beneath the sun.
Docs like clover, crisp and bright,
Utils hop from left to right.
Tests burrow deep to check the ground—
Carrots of data neatly found.
Ship it! The meadow hums with sound.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "add vlm run integration" is concise and directly summarizes the primary change — adding VLM Run integration (model, utilities, documentation, and tests) — and is clearly related to the changeset.
Description Check	✅ Passed	The pull request description follows the repository template and includes the required sections: a clear summary of proposed changes, testing details, release notes with a selected option, and the affected areas, providing sufficient information for reviewers to evaluate the change.
Docstring Coverage	✅ Passed	Docstring coverage is 87.80% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

wbrennan899 · 2025-09-23T22:36:22Z

@harpreetsahota204 @AdonaiVera VLM Run asked me to tag you.

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

fiftyone/utils/vlmrun.py (1)

734-828: apply_vlmrun_model only supports image samples; add media-aware handling

Currently validates image collections and always opens images. This breaks for document/video/audio domains (common for this integration).

-    # Validate collection
-    # Validate samples are images
-    import fiftyone.core.validation as fov
-
-    fov.validate_image_collection(samples)
+    # Validate collection per media type
+    import fiftyone.core.validation as fov
+    media_type = model.media_type if hasattr(model, "media_type") else "image"
+    if media_type == "image":
+        fov.validate_image_collection(samples)
+    elif media_type == "video" and hasattr(fov, "validate_video_collection"):
+        fov.validate_video_collection(samples)
+    # For "document"/"audio", skip strict media validation here; the client accepts file paths.
@@
-                try:
-                    # Load image
-                    img = Image.open(sample.filepath)
-
-                    # Make prediction
-                    result = model.predict(img)
+                try:
+                    # Prepare input per media type
+                    if media_type == "image":
+                        media = Image.open(sample.filepath)
+                    else:
+                        media = sample.filepath  # pass file path for document/video/audio
+
+                    # Make prediction
+                    result = model.predict(media)

Optionally, you can batch by batch_size for image domains later.

🧹 Nitpick comments (5)

docs/source/integrations/vlm.rst (1)
35-49: Installation docs: include the extras path

Since this is an integration, advertise the extras install in addition to raw vlmrun.
-To get started with VLM Run, install the `vlmrun` package:
+To get started, install the integration via FiftyOne extras (recommended):
+
+.. code-block:: shell
+
+    pip install "fiftyone[vlmrun]"
+
+Or install the `vlmrun` package directly:
fiftyone/utils/vlmrun.py (3)
85-149: Tighten confidence mapping and simplify conditionals (minor)

You can simplify the high/med/low mapping and reduce returns.
-            conf_text = response_data.get("confidence", "medium")
-            if conf_text == "hi" or conf_text == "high":
-                confidence = 0.9
-            elif conf_text == "medium":
-                confidence = 0.7
-            elif conf_text == "low":
-                confidence = 0.3
-            else:
-                confidence = 0.5
+            conf_text = str(response_data.get("confidence", "medium")).lower()
+            confidence = {"hi": 0.9, "high": 0.9, "med": 0.7, "medium": 0.7, "low": 0.3}.get(conf_text, 0.5)
466-479: Unused kwargs in VLMRunModelConfig (minor)

**kwargs are accepted but ignored; either document or plumb them through (e.g., to GenerationConfig).

555-707: Type-specific errors and messages (minor)

Where you validate input type (e.g., requiring file path), raise TypeError for wrong types and keep messages concise.
tests/unittests/vlm_tests.py (1)
262-264: Remove unused variable

img is unused.
-        img = np.zeros((100, 100, 3), dtype=np.uint8)
         sample = fo.Sample(filepath="test.jpg")
         dataset.add_sample(sample)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a57e299 and bfa6124.

📒 Files selected for processing (5)

docs/source/integrations/index.rst (2 hunks)
docs/source/integrations/vlm.rst (1 hunks)
fiftyone/utils/vlmrun.py (1 hunks)
requirements/extras.txt (1 hunks)
tests/unittests/vlm_tests.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

tests/unittests/vlm_tests.py (2)

fiftyone/utils/cvat.py (1)

patch (3876-3886)

fiftyone/utils/vlmrun.py (16)

VLMRunModel (497-731)

media_type (513-522)

has_logits (525-527)

VLMRunModelConfig (450-494)

predict (554-711)

predict_all (713-731)

to_classification (85-148)

to_detections (151-186)

to_attributes (236-279)

apply_vlmrun_model (734-827)

convert_vlm_model (31-61)

load_vlmrun_model (64-82)

list_vlmrun_domains (393-421)

get_domain_schema (424-447)

parse_visual_grounding (282-350)

parse_temporal_grounding (353-390)

fiftyone/utils/vlmrun.py (3)

fiftyone/core/utils.py (3)

lazy_import (732-754)

ensure_package (396-438)

ProgressBar (950-996)

fiftyone/core/labels.py (2)

Detections (632-709)

Detection (438-629)

fiftyone/core/models.py (2)

ModelConfig (2106-2115)

Model (2118-2243)

🪛 Ruff (0.13.1)

tests/unittests/vlm_tests.py

262-262: Local variable img is assigned to but never used

Remove assignment to unused variable img

(F841)

fiftyone/utils/vlmrun.py

419-419: Do not catch blind exception: Exception

(BLE001)

445-445: Do not catch blind exception: Exception

(BLE001)

478-478: Unused method argument: kwargs

(ARG002)

481-481: Avoid specifying long messages outside the exception class

(TRY003)

585-585: Avoid specifying long messages outside the exception class

(TRY003)

593-593: Avoid specifying long messages outside the exception class

(TRY003)

614-616: Avoid specifying long messages outside the exception class

(TRY003)

621-623: Avoid specifying long messages outside the exception class

(TRY003)

633-633: Avoid specifying long messages outside the exception class

(TRY003)

661-663: Avoid specifying long messages outside the exception class

(TRY003)

668-670: Avoid specifying long messages outside the exception class

(TRY003)

693-693: Avoid specifying long messages outside the exception class

(TRY003)

706-706: Prefer TypeError exception for invalid type

(TRY004)

706-706: Avoid specifying long messages outside the exception class

(TRY003)

727-727: Do not catch blind exception: Exception

(BLE001)

743-743: Unused function argument: batch_size

(ARG001)

773-775: Avoid specifying long messages outside the exception class

(TRY003)

819-819: Abstract raise to an inner function

(TRY301)

819-819: Avoid specifying long messages outside the exception class

(TRY003)

824-824: Do not catch blind exception: Exception

(BLE001)

🪛 Pylint (3.3.8)

tests/unittests/vlm_tests.py

[error] 1-1: Unrecognized option found: optimize-ast, files-output, function-name-hint, variable-name-hint, const-name-hint, attr-name-hint, argument-name-hint, class-attribute-name-hint, inlinevar-name-hint, class-name-hint, module-name-hint, method-name-hint, no-space-check

(E0015)

[refactor] 1-1: Useless option value for '--disable', 'bad-continuation' was removed from pylint, see pylint-dev/pylint#3571.

(R0022)

[refactor] 17-17: Use 'from fiftyone.utils import vlmrun' instead

(R0402)

fiftyone/utils/vlmrun.py

[error] 1-1: Unrecognized option found: optimize-ast, files-output, function-name-hint, variable-name-hint, const-name-hint, attr-name-hint, argument-name-hint, class-attribute-name-hint, inlinevar-name-hint, class-name-hint, module-name-hint, method-name-hint, no-space-check

(E0015)

[refactor] 1-1: Useless option value for '--disable', 'bad-continuation' was removed from pylint, see pylint-dev/pylint#3571.

(R0022)

[refactor] 114-114: Consider merging these comparisons with 'in' by using 'conf_text in ('hi', 'high')'. Use a set instead if elements are hashable.

(R1714)

[refactor] 85-85: Too many return statements (7/6)

(R0911)

[refactor] 413-418: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

[refactor] 466-466: Too many positional arguments (11/5)

(R0917)

[refactor] 516-521: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

[refactor] 611-616: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

[refactor] 656-659: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

[refactor] 554-554: Too many return statements (7/6)

(R0911)

[refactor] 734-734: Too many positional arguments (10/5)

(R0917)

🔇 Additional comments (4)

requirements/extras.txt (1)

8-8: Confirm extras wiring and version compatibility

Adding vlmrun>=0.3.5 looks fine. Please verify:

The vlmrun extra is exposed in packaging (setup.cfg/pyproject) so pip install fiftyone[vlmrun] works as documented.

The minimum version matches what you tested against and any APIs used here (e.g., client.video.generate, predictions.get) exist in that range.

If needed, I can generate a quick grep script to locate where extras are declared and check for a vlmrun extra.

docs/source/integrations/index.rst (1)

198-204: LGTM: card and toctree entry

The new card and toctree link look consistent with the added vlm.rst.

tests/unittests/vlm_tests.py (2)

371-381: LGTM: factory tests cover config propagation

Covers convert_vlm_model/load_vlmrun_model happy paths well.

438-466: Nice coverage for grounding parsers

Visual grounding confidence mapping and bbox parsing are validated.

coderabbitai · 2025-09-23T22:41:35Z