Skip to content

Conversation

wbrennan899
Copy link

@wbrennan899 wbrennan899 commented Sep 23, 2025

What changes are proposed in this pull request?

This PR adds integration with VLM Run, a Vision Language Model platform that extracts structured data from documents, images, videos, and audio files.

Key features:

  • New VLMRunModel class that processes:
    • Documents: Extract data from PDFs, invoices, receipts, bank statements, resumes, etc.
    • Images: Generate captions, classify content, answer questions about images
    • Videos: Transcribe and analyze video content with timestamps
    • Audio: Transcribe audio with temporal segmentation
  • 40+ pre-built domains ready to use without training
  • Visual grounding (bounding boxes showing where data was found)
  • Temporal grounding (timestamps for video/audio content)
  • Automatic conversion to FiftyOne's Classification, Detection, and Attribute formats

How is this patch tested? If it is not, please explain why.

  • Comprehensive unit tests covering all model operations, document processing, video/audio transcription, and output conversions
  • Additional standalone testing performed separately to verify real-world usage
  • Documentation includes working examples
  • Verified integration with FiftyOne's existing model infrastructure

Release Notes

Is this a user-facing change that should be mentioned in the release notes?

  • Yes. Give a description of this change to be included in the release notes for FiftyOne users.

Added VLM Run integration for extracting structured data from documents (PDFs, invoices, receipts), images, videos, and audio files. Includes 40+ pre-built domains and provides visual grounding (bounding boxes) and temporal grounding (timestamps). Install with pip install fiftyone[vlmrun].

What areas of FiftyOne does this PR affect?

  • App: FiftyOne application changes
  • Build: Build and test infrastructure changes
  • Core: Core fiftyone Python library changes
  • Documentation: FiftyOne documentation changes
  • Other

Summary by CodeRabbit

  • New Features

    • Added VLM Run integration to apply multimodal models to images, videos, audio, and documents.
    • Supports batching, progress, confidence scores, visual/temporal grounding, and flexible output as attributes, classifications, detections, or raw data.
  • Documentation

    • New integration guide for VLM Run with setup, quickstart, examples, and usage tips.
    • Added “VLM Run” card and navigation entry in the integrations index.
  • Tests

    • Comprehensive unit and integration tests covering configuration, predictions, conversions, and dataset application.
  • Chores

    • Added VLM Run dependency to optional requirements.

@wbrennan899 wbrennan899 requested review from a team as code owners September 23, 2025 22:33
Copy link
Contributor

coderabbitai bot commented Sep 23, 2025

Walkthrough

Introduces a new VLM Run integration: adds documentation pages and index entry, a utility module implementing model loading, prediction, conversions, and application to datasets, accompanying unit tests with mocks, and an extras dependency on vlmrun>=0.3.5.

Changes

Cohort / File(s) Summary of Changes
Docs: Integration guide and index
docs/source/integrations/index.rst, docs/source/integrations/vlm.rst
Added “VLM Run” card, toctree entry, and a new comprehensive integration guide with setup, usage examples, API reference, and limitations.
Core: VLM Run utilities
fiftyone/utils/vlmrun.py
New module providing VLMRunModelConfig/Model, factory/loader helpers, prediction (single/batch), domain/schema utilities, output parsers (attributes/classifications/detections/grounding), and dataset apply function with progress/timeout handling.
Dependency
requirements/extras.txt
Added vlmrun>=0.3.5 to extras.
Tests
tests/unittests/vlm_tests.py
Added extensive unit/integration tests with mocked clients for initialization, prediction, conversions, grounding parsing, domain listing/schema, and dataset application.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant FO as FiftyOne
  participant VM as VLMRunModel
  participant API as VLM Run API
  participant DS as Dataset

  User->>FO: apply_vlmrun_model(samples, domain/schema, output_type, ...)
  FO->>VM: construct/load model (domain, api_key, config)
  Note over VM: Determine media_type from domain
  loop For each media (or batch)
    VM->>API: predict(media | batch) [async/polling, timeout]
    API-->>VM: result (response/data/legacy)
    VM-->>FO: raw result
    FO->>FO: convert to attributes/classifications/detections/grounding
    FO->>DS: write to label_field
  end
  DS-->>User: samples updated
Loading
sequenceDiagram
  autonumber
  participant FO as FiftyOne Utils
  participant VM as VLMRunModel
  participant API as VLM Run API

  FO->>VM: predict_all(media_items)
  rect rgba(200,230,255,0.25)
    note right of VM: Batch submit and poll
    VM->>API: submit batch
    API-->>VM: batch results
  end
  VM-->>FO: list of results
  FO->>FO: map results ->\nClassification/Detections/Attributes
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I thump my paws—new lanes to run,
VLM winds beneath the sun.
Docs like clover, crisp and bright,
Utils hop from left to right.
Tests burrow deep to check the ground—
Carrots of data neatly found.
Ship it! The meadow hums with sound.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title "add vlm run integration" is concise and directly summarizes the primary change — adding VLM Run integration (model, utilities, documentation, and tests) — and is clearly related to the changeset.
Description Check ✅ Passed The pull request description follows the repository template and includes the required sections: a clear summary of proposed changes, testing details, release notes with a selected option, and the affected areas, providing sufficient information for reviewers to evaluate the change.
Docstring Coverage ✅ Passed Docstring coverage is 87.80% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@wbrennan899
Copy link
Author

@harpreetsahota204 @AdonaiVera VLM Run asked me to tag you.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
fiftyone/utils/vlmrun.py (1)

734-828: apply_vlmrun_model only supports image samples; add media-aware handling

Currently validates image collections and always opens images. This breaks for document/video/audio domains (common for this integration).

-    # Validate collection
-    # Validate samples are images
-    import fiftyone.core.validation as fov
-
-    fov.validate_image_collection(samples)
+    # Validate collection per media type
+    import fiftyone.core.validation as fov
+    media_type = model.media_type if hasattr(model, "media_type") else "image"
+    if media_type == "image":
+        fov.validate_image_collection(samples)
+    elif media_type == "video" and hasattr(fov, "validate_video_collection"):
+        fov.validate_video_collection(samples)
+    # For "document"/"audio", skip strict media validation here; the client accepts file paths.
@@
-                try:
-                    # Load image
-                    img = Image.open(sample.filepath)
-
-                    # Make prediction
-                    result = model.predict(img)
+                try:
+                    # Prepare input per media type
+                    if media_type == "image":
+                        media = Image.open(sample.filepath)
+                    else:
+                        media = sample.filepath  # pass file path for document/video/audio
+
+                    # Make prediction
+                    result = model.predict(media)

Optionally, you can batch by batch_size for image domains later.

🧹 Nitpick comments (5)
docs/source/integrations/vlm.rst (1)

35-49: Installation docs: include the extras path

Since this is an integration, advertise the extras install in addition to raw vlmrun.

-To get started with VLM Run, install the `vlmrun` package:
+To get started, install the integration via FiftyOne extras (recommended):
+
+.. code-block:: shell
+
+    pip install "fiftyone[vlmrun]"
+
+Or install the `vlmrun` package directly:
fiftyone/utils/vlmrun.py (3)

85-149: Tighten confidence mapping and simplify conditionals (minor)

You can simplify the high/med/low mapping and reduce returns.

-            conf_text = response_data.get("confidence", "medium")
-            if conf_text == "hi" or conf_text == "high":
-                confidence = 0.9
-            elif conf_text == "medium":
-                confidence = 0.7
-            elif conf_text == "low":
-                confidence = 0.3
-            else:
-                confidence = 0.5
+            conf_text = str(response_data.get("confidence", "medium")).lower()
+            confidence = {"hi": 0.9, "high": 0.9, "med": 0.7, "medium": 0.7, "low": 0.3}.get(conf_text, 0.5)

466-479: Unused kwargs in VLMRunModelConfig (minor)

**kwargs are accepted but ignored; either document or plumb them through (e.g., to GenerationConfig).


555-707: Type-specific errors and messages (minor)

Where you validate input type (e.g., requiring file path), raise TypeError for wrong types and keep messages concise.

tests/unittests/vlm_tests.py (1)

262-264: Remove unused variable

img is unused.

-        img = np.zeros((100, 100, 3), dtype=np.uint8)
         sample = fo.Sample(filepath="test.jpg")
         dataset.add_sample(sample)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a57e299 and bfa6124.

📒 Files selected for processing (5)
  • docs/source/integrations/index.rst (2 hunks)
  • docs/source/integrations/vlm.rst (1 hunks)
  • fiftyone/utils/vlmrun.py (1 hunks)
  • requirements/extras.txt (1 hunks)
  • tests/unittests/vlm_tests.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/unittests/vlm_tests.py (2)
fiftyone/utils/cvat.py (1)
  • patch (3876-3886)
fiftyone/utils/vlmrun.py (16)
  • VLMRunModel (497-731)
  • media_type (513-522)
  • has_logits (525-527)
  • VLMRunModelConfig (450-494)
  • predict (554-711)
  • predict_all (713-731)
  • to_classification (85-148)
  • to_detections (151-186)
  • to_attributes (236-279)
  • apply_vlmrun_model (734-827)
  • convert_vlm_model (31-61)
  • load_vlmrun_model (64-82)
  • list_vlmrun_domains (393-421)
  • get_domain_schema (424-447)
  • parse_visual_grounding (282-350)
  • parse_temporal_grounding (353-390)
fiftyone/utils/vlmrun.py (3)
fiftyone/core/utils.py (3)
  • lazy_import (732-754)
  • ensure_package (396-438)
  • ProgressBar (950-996)
fiftyone/core/labels.py (2)
  • Detections (632-709)
  • Detection (438-629)
fiftyone/core/models.py (2)
  • ModelConfig (2106-2115)
  • Model (2118-2243)
🪛 Ruff (0.13.1)
tests/unittests/vlm_tests.py

262-262: Local variable img is assigned to but never used

Remove assignment to unused variable img

(F841)

fiftyone/utils/vlmrun.py

419-419: Do not catch blind exception: Exception

(BLE001)


445-445: Do not catch blind exception: Exception

(BLE001)


478-478: Unused method argument: kwargs

(ARG002)


481-481: Avoid specifying long messages outside the exception class

(TRY003)


585-585: Avoid specifying long messages outside the exception class

(TRY003)


593-593: Avoid specifying long messages outside the exception class

(TRY003)


614-616: Avoid specifying long messages outside the exception class

(TRY003)


621-623: Avoid specifying long messages outside the exception class

(TRY003)


633-633: Avoid specifying long messages outside the exception class

(TRY003)


661-663: Avoid specifying long messages outside the exception class

(TRY003)


668-670: Avoid specifying long messages outside the exception class

(TRY003)


693-693: Avoid specifying long messages outside the exception class

(TRY003)


706-706: Prefer TypeError exception for invalid type

(TRY004)


706-706: Avoid specifying long messages outside the exception class

(TRY003)


727-727: Do not catch blind exception: Exception

(BLE001)


743-743: Unused function argument: batch_size

(ARG001)


773-775: Avoid specifying long messages outside the exception class

(TRY003)


819-819: Abstract raise to an inner function

(TRY301)


819-819: Avoid specifying long messages outside the exception class

(TRY003)


824-824: Do not catch blind exception: Exception

(BLE001)

🪛 Pylint (3.3.8)
tests/unittests/vlm_tests.py

[error] 1-1: Unrecognized option found: optimize-ast, files-output, function-name-hint, variable-name-hint, const-name-hint, attr-name-hint, argument-name-hint, class-attribute-name-hint, inlinevar-name-hint, class-name-hint, module-name-hint, method-name-hint, no-space-check

(E0015)


[refactor] 1-1: Useless option value for '--disable', 'bad-continuation' was removed from pylint, see pylint-dev/pylint#3571.

(R0022)


[refactor] 17-17: Use 'from fiftyone.utils import vlmrun' instead

(R0402)

fiftyone/utils/vlmrun.py

[error] 1-1: Unrecognized option found: optimize-ast, files-output, function-name-hint, variable-name-hint, const-name-hint, attr-name-hint, argument-name-hint, class-attribute-name-hint, inlinevar-name-hint, class-name-hint, module-name-hint, method-name-hint, no-space-check

(E0015)


[refactor] 1-1: Useless option value for '--disable', 'bad-continuation' was removed from pylint, see pylint-dev/pylint#3571.

(R0022)


[refactor] 114-114: Consider merging these comparisons with 'in' by using 'conf_text in ('hi', 'high')'. Use a set instead if elements are hashable.

(R1714)


[refactor] 85-85: Too many return statements (7/6)

(R0911)


[refactor] 413-418: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[refactor] 466-466: Too many positional arguments (11/5)

(R0917)


[refactor] 516-521: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 611-616: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 656-659: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[refactor] 554-554: Too many return statements (7/6)

(R0911)


[refactor] 734-734: Too many positional arguments (10/5)

(R0917)

🔇 Additional comments (4)
requirements/extras.txt (1)

8-8: Confirm extras wiring and version compatibility

Adding vlmrun>=0.3.5 looks fine. Please verify:

  • The vlmrun extra is exposed in packaging (setup.cfg/pyproject) so pip install fiftyone[vlmrun] works as documented.
  • The minimum version matches what you tested against and any APIs used here (e.g., client.video.generate, predictions.get) exist in that range.

If needed, I can generate a quick grep script to locate where extras are declared and check for a vlmrun extra.

docs/source/integrations/index.rst (1)

198-204: LGTM: card and toctree entry

The new card and toctree link look consistent with the added vlm.rst.

tests/unittests/vlm_tests.py (2)

371-381: LGTM: factory tests cover config propagation

Covers convert_vlm_model/load_vlmrun_model happy paths well.


438-466: Nice coverage for grounding parsers

Visual grounding confidence mapping and bbox parsing are validated.

Comment on lines +63 to +66
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.vlm as fouv

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix import path: use the new module name

The correct module is fiftyone.utils.vlmrun, not fiftyone.utils.vlm.

-    import fiftyone.utils.vlm as fouv
+    import fiftyone.utils.vlmrun as fouv
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.vlm as fouv
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.vlmrun as fouv
🤖 Prompt for AI Agents
In docs/source/integrations/vlm.rst around lines 63 to 66, the imports use the
old module path `fiftyone.utils.vlm`; update the import to the new module name
`fiftyone.utils.vlmrun` so the code reads import fiftyone.utils.vlmrun (or
import as fouv if aliasing is needed) and ensure any subsequent references use
the new alias.

Comment on lines +70 to +79
# Load a VLM Run model for document invoice extraction
model = fouv.load_vlm_model("document.invoice")

# Apply the model to extract invoice data
fouv.apply_vlm_model(
dataset,
model=model,
label_field="invoice_data",
output_type="attributes"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Update API names: load/apply helpers were named vlmrun

Docs reference non-existent helpers. Use the implemented names.

-    model = fouv.load_vlm_model("document.invoice")
+    model = fouv.load_vlmrun_model("document.invoice")
@@
-    fouv.apply_vlm_model(
+    fouv.apply_vlmrun_model(
         dataset,
         model=model,
         label_field="invoice_data",
         output_type="attributes"
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Load a VLM Run model for document invoice extraction
model = fouv.load_vlm_model("document.invoice")
# Apply the model to extract invoice data
fouv.apply_vlm_model(
dataset,
model=model,
label_field="invoice_data",
output_type="attributes"
)
# Load a VLM Run model for document invoice extraction
model = fouv.load_vlmrun_model("document.invoice")
# Apply the model to extract invoice data
fouv.apply_vlmrun_model(
dataset,
model=model,
label_field="invoice_data",
output_type="attributes"
)
🤖 Prompt for AI Agents
In docs/source/integrations/vlm.rst around lines 70 to 79, the example calls use
non-existent helpers load_vlm_model and apply_vlm_model; update them to the
implemented vlmrun API by replacing load_vlm_model("document.invoice") with
vlmrun.load_model("document.invoice") (or the exact implemented vlmrun.load_*
function) and replacing fouv.apply_vlm_model(...) with
vlmrun.apply_model(dataset, model=model, label_field="invoice_data",
output_type="attributes") so the docs reference the actual vlmrun helper names
used in code.

Comment on lines +118 to +123
# Image classification
model = fouv.load_vlm_model("image.classification")

# Apply to dataset
dataset.apply_model(model, label_field="vlm_predictions")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

dataset.apply_model() won’t work here; use apply_vlmrun_model

VLMRunModel.predict() returns a VLM Run result object, not FiftyOne labels. dataset.apply_model() expects labels. Use the provided apply helper.

-    # Apply to dataset
-    dataset.apply_model(model, label_field="vlm_predictions")
+    # Apply to dataset
+    fouv.apply_vlmrun_model(
+        dataset,
+        model=model,
+        label_field="vlm_predictions",
+        output_type="classification",
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Image classification
model = fouv.load_vlm_model("image.classification")
# Apply to dataset
dataset.apply_model(model, label_field="vlm_predictions")
# Image classification
model = fouv.load_vlm_model("image.classification")
# Apply to dataset
fouv.apply_vlmrun_model(
dataset,
model=model,
label_field="vlm_predictions",
output_type="classification",
)
🤖 Prompt for AI Agents
In docs/source/integrations/vlm.rst around lines 118 to 123, the example uses
dataset.apply_model(model, ...) but VLMRunModel.predict() returns a VLM run
result object (not FiftyOne labels), so dataset.apply_model will fail; replace
the call with the VLM-specific helper (e.g., dataset.apply_vlmrun_model) and
pass the VLM model plus the desired label_field so the helper can convert VLM
run results into FiftyOne labels before storing them.

Comment on lines +160 to +169
# Create model with custom schema
model = fouv.VLMRunModel(schema=ProductInfo)

# Apply to dataset
fouv.apply_vlm_model(
dataset,
model=model,
label_field="product_info",
output_type="attributes"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Model requires a domain; or use the factory

VLMRunModelConfig requires domain. If you want schema-driven extraction, document the factory and include the domain.

-    model = fouv.VLMRunModel(schema=ProductInfo)
+    model = fouv.convert_vlm_model(domain="document.custom", schema=ProductInfo)

If there is an official domain name for custom schemas, replace "document.custom" accordingly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Create model with custom schema
model = fouv.VLMRunModel(schema=ProductInfo)
# Apply to dataset
fouv.apply_vlm_model(
dataset,
model=model,
label_field="product_info",
output_type="attributes"
)
# Create model with custom schema
model = fouv.convert_vlm_model(domain="document.custom", schema=ProductInfo)
# Apply to dataset
fouv.apply_vlm_model(
dataset,
model=model,
label_field="product_info",
output_type="attributes"
)
🤖 Prompt for AI Agents
In docs/source/integrations/vlm.rst around lines 160 to 169, the example
constructing VLMRunModel omits the required domain for VLMRunModelConfig and
doesn't show the factory alternative; update the snippet and docs to either (A)
pass a domain when creating the model (e.g., VLMRunModel(schema=ProductInfo,
domain="document.custom" or the official domain name) and ensure the
label_field/output_type remain consistent, or (B) document and show using the
provided factory function for schema-driven extraction that populates the
required domain automatically; also mention replacing "document.custom" with the
project's official domain name if one exists.

Comment on lines +186 to +197
fouv.apply_vlm_model(
dataset,
domain="document.invoice",
label_field="invoice",
output_type="attributes"
)

# Access extracted fields
sample = dataset.first()
print(sample["invoice.vendor"])
print(sample["invoice.total"])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix attribute access paths

apply_vlmrun_model(..., output_type="attributes", label_field="invoice") flattens to fields like invoice_vendor, not nested paths.

-    print(sample["invoice.vendor"])
-    print(sample["invoice.total"])
+    print(sample["invoice_vendor"])
+    print(sample["invoice_total"])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fouv.apply_vlm_model(
dataset,
domain="document.invoice",
label_field="invoice",
output_type="attributes"
)
# Access extracted fields
sample = dataset.first()
print(sample["invoice.vendor"])
print(sample["invoice.total"])
fouv.apply_vlm_model(
dataset,
domain="document.invoice",
label_field="invoice",
output_type="attributes"
)
# Access extracted fields
sample = dataset.first()
print(sample["invoice_vendor"])
print(sample["invoice_total"])
🤖 Prompt for AI Agents
In docs/source/integrations/vlm.rst around lines 186 to 197, the example shows
accessing extracted attributes using nested keys like sample["invoice.vendor"]
and sample["invoice.total"], but when apply_vlm_model is called with
output_type="attributes" and label_field="invoice" the fields are flattened to
names like invoice_vendor and invoice_total; update the example to use the
flattened attribute names (e.g., sample["invoice_vendor"],
sample["invoice_total"]) and adjust any surrounding text to reflect flattened
attribute access instead of nested paths.

Comment on lines +255 to +284
Invoice Processing
^^^^^^^^^^^^^^^^^^

Extract structured invoice data from a dataset of invoice images:

.. code-block:: python
:linenos:

import fiftyone as fo
import fiftyone.utils.vlm as fouv

# Load dataset of invoice images
dataset = fo.Dataset()
dataset.add_samples([
fo.Sample(filepath="/path/to/invoice1.pdf"),
fo.Sample(filepath="/path/to/invoice2.jpg"),
])

# Extract invoice data
fouv.apply_vlm_model(
dataset,
domain="document.invoice",
label_field="invoice",
output_type="attributes"
)

# Query extracted data
high_value = dataset.filter_labels("invoice.total", F() > 1000)
print(f"Found {len(high_value)} high-value invoices")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Correct filtering example

filter_labels is for label fields; here we’re filtering by flattened attributes. Use ViewField and match.

-    high_value = dataset.filter_labels("invoice.total", F() > 1000)
+    from fiftyone import ViewField as F
+    high_value = dataset.match(F("invoice_total") > 1000)

If you choose to store nested documents instead, adjust the apply helper accordingly and keep the dotted path.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Invoice Processing
^^^^^^^^^^^^^^^^^^
Extract structured invoice data from a dataset of invoice images:
.. code-block:: python
:linenos:
import fiftyone as fo
import fiftyone.utils.vlm as fouv
# Load dataset of invoice images
dataset = fo.Dataset()
dataset.add_samples([
fo.Sample(filepath="/path/to/invoice1.pdf"),
fo.Sample(filepath="/path/to/invoice2.jpg"),
])
# Extract invoice data
fouv.apply_vlm_model(
dataset,
domain="document.invoice",
label_field="invoice",
output_type="attributes"
)
# Query extracted data
high_value = dataset.filter_labels("invoice.total", F() > 1000)
print(f"Found {len(high_value)} high-value invoices")
Invoice Processing
^^^^^^^^^^^^^^^^^^
Extract structured invoice data from a dataset of invoice images:
.. code-block:: python
:linenos:
import fiftyone as fo
import fiftyone.utils.vlm as fouv
# Load dataset of invoice images
dataset = fo.Dataset()
dataset.add_samples([
fo.Sample(filepath="/path/to/invoice1.pdf"),
fo.Sample(filepath="/path/to/invoice2.jpg"),
])
# Extract invoice data
fouv.apply_vlm_model(
dataset,
domain="document.invoice",
label_field="invoice",
output_type="attributes"
)
# Query extracted data
from fiftyone import ViewField as F
high_value = dataset.match(F("invoice_total") > 1000)
print(f"Found {len(high_value)} high-value invoices")
🤖 Prompt for AI Agents
In docs/source/integrations/vlm.rst around lines 255 to 284, the example
incorrectly uses dataset.filter_labels for filtering flattened invoice
attributes; update the example to use a view-based filter with ViewField.match
(e.g., use fo.ViewField to reference the flattened attribute path and call
.match or appropriate view filtering method) so it filters samples by the
invoice.total attribute value; alternatively, if you change the helper to store
nested label documents, update the apply_vlm_model call and keep using the
dotted path for label filtering.

Comment on lines +364 to +383
# Load a VLM Run model
fouv.load_vlm_model(domain, api_key=None, **kwargs)

# Convert model for FiftyOne
fouv.convert_vlm_model(domain=None, schema=None, **kwargs)

# Apply model to dataset
fouv.apply_vlm_model(
samples,
model=None,
domain=None,
schema=None,
label_field="vlm_predictions",
output_type="attributes",
confidence_thresh=None,
api_key=None,
batch_size=None,
progress=None
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

API reference helper names

Align with the implemented function names.

-    fouv.load_vlm_model(domain, api_key=None, **kwargs)
+    fouv.load_vlmrun_model(domain, api_key=None, **kwargs)
@@
-    fouv.apply_vlm_model(
+    fouv.apply_vlmrun_model(
         samples,
         model=None,
         domain=None,
         schema=None,
         label_field="vlm_predictions",
         output_type="attributes",
         confidence_thresh=None,
         api_key=None,
         batch_size=None,
         progress=None
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Load a VLM Run model
fouv.load_vlm_model(domain, api_key=None, **kwargs)
# Convert model for FiftyOne
fouv.convert_vlm_model(domain=None, schema=None, **kwargs)
# Apply model to dataset
fouv.apply_vlm_model(
samples,
model=None,
domain=None,
schema=None,
label_field="vlm_predictions",
output_type="attributes",
confidence_thresh=None,
api_key=None,
batch_size=None,
progress=None
)
# Load a VLM Run model
fouv.load_vlmrun_model(domain, api_key=None, **kwargs)
# Convert model for FiftyOne
fouv.convert_vlm_model(domain=None, schema=None, **kwargs)
# Apply model to dataset
fouv.apply_vlmrun_model(
samples,
model=None,
domain=None,
schema=None,
label_field="vlm_predictions",
output_type="attributes",
confidence_thresh=None,
api_key=None,
batch_size=None,
progress=None
)
🤖 Prompt for AI Agents
In docs/source/integrations/vlm.rst around lines 364 to 383, the helper call
names and signatures in the example do not match the implemented functions;
verify the actual function names and parameter names in the fouv (or relevant)
module and update this example to use those exact names and signatures
(including correct parameter names, defaults, and keyword ordering) so the docs
match the code.

Comment on lines +189 to +234
def _parse_detection(item, confidence_thresh=None):
"""Parse a single detection from VLM Run output."""
if isinstance(item, dict):
label = item.get("label") or item.get("class") or item.get("category")
bbox = item.get("bbox") or item.get("bounding_box") or item.get("box")
confidence = item.get("confidence", 1.0)
elif hasattr(item, "label") or hasattr(item, "class_name"):
label = getattr(item, "label", None) or getattr(
item, "class_name", None
)
bbox = getattr(item, "bbox", None) or getattr(
item, "bounding_box", None
)
confidence = getattr(item, "confidence", 1.0)
else:
return None

if label is None or bbox is None:
return None

if confidence_thresh is not None and confidence < confidence_thresh:
return None

# Convert bbox to FiftyOne format [x, y, width, height] with values in [0, 1]
if len(bbox) == 4:
# Assume bbox is in [x1, y1, x2, y2] format
x1, y1, x2, y2 = bbox
x = x1
y = y1
width = x2 - x1
height = y2 - y1

# Normalize if values are > 1 (pixel coordinates)
if any(v > 1 for v in [x, y, width, height]):
# We'll need image dimensions for proper normalization
# For now, we'll assume they're already normalized
pass

return fol.Detection(
label=str(label),
bounding_box=[x, y, width, height],
confidence=confidence,
)

return None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

BBox parsing treats [x,y,w,h] as [x1,y1,x2,y2] — fix to support both

Current logic assumes [x1,y1,x2,y2] and silently “passes” on pixel coords, producing incorrect widths/heights for normalized [x,y,w,h]. Handle both forms.

-    # Convert bbox to FiftyOne format [x, y, width, height] with values in [0, 1]
-    if len(bbox) == 4:
-        # Assume bbox is in [x1, y1, x2, y2] format
-        x1, y1, x2, y2 = bbox
-        x = x1
-        y = y1
-        width = x2 - x1
-        height = y2 - y1
-
-        # Normalize if values are > 1 (pixel coordinates)
-        if any(v > 1 for v in [x, y, width, height]):
-            # We'll need image dimensions for proper normalization
-            # For now, we'll assume they're already normalized
-            pass
-
-        return fol.Detection(
-            label=str(label),
-            bounding_box=[x, y, width, height],
-            confidence=confidence,
-        )
+    # Convert bbox to FiftyOne format [x, y, width, height] with values in [0, 1]
+    if len(bbox) == 4:
+        x1, y1, a, b = bbox
+        # Heuristic: if a/b fit within [0,1] and x1+a <= 1, y1+b <= 1, treat as [x, y, w, h]
+        if all(0 <= v <= 1 for v in bbox) and (x1 + a) <= 1 and (y1 + b) <= 1:
+            x, y, width, height = x1, y1, a, b
+        else:
+            # Treat as [x1, y1, x2, y2]
+            x, y = x1, y1
+            width, height = max(0.0, a - x1), max(0.0, b - y1)
+
+        return fol.Detection(
+            label=str(label),
+            bounding_box=[x, y, width, height],
+            confidence=confidence,
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _parse_detection(item, confidence_thresh=None):
"""Parse a single detection from VLM Run output."""
if isinstance(item, dict):
label = item.get("label") or item.get("class") or item.get("category")
bbox = item.get("bbox") or item.get("bounding_box") or item.get("box")
confidence = item.get("confidence", 1.0)
elif hasattr(item, "label") or hasattr(item, "class_name"):
label = getattr(item, "label", None) or getattr(
item, "class_name", None
)
bbox = getattr(item, "bbox", None) or getattr(
item, "bounding_box", None
)
confidence = getattr(item, "confidence", 1.0)
else:
return None
if label is None or bbox is None:
return None
if confidence_thresh is not None and confidence < confidence_thresh:
return None
# Convert bbox to FiftyOne format [x, y, width, height] with values in [0, 1]
if len(bbox) == 4:
# Assume bbox is in [x1, y1, x2, y2] format
x1, y1, x2, y2 = bbox
x = x1
y = y1
width = x2 - x1
height = y2 - y1
# Normalize if values are > 1 (pixel coordinates)
if any(v > 1 for v in [x, y, width, height]):
# We'll need image dimensions for proper normalization
# For now, we'll assume they're already normalized
pass
return fol.Detection(
label=str(label),
bounding_box=[x, y, width, height],
confidence=confidence,
)
return None
def _parse_detection(item, confidence_thresh=None):
"""Parse a single detection from VLM Run output."""
if isinstance(item, dict):
label = item.get("label") or item.get("class") or item.get("category")
bbox = item.get("bbox") or item.get("bounding_box") or item.get("box")
confidence = item.get("confidence", 1.0)
elif hasattr(item, "label") or hasattr(item, "class_name"):
label = getattr(item, "label", None) or getattr(
item, "class_name", None
)
bbox = getattr(item, "bbox", None) or getattr(
item, "bounding_box", None
)
confidence = getattr(item, "confidence", 1.0)
else:
return None
if label is None or bbox is None:
return None
if confidence_thresh is not None and confidence < confidence_thresh:
return None
# Convert bbox to FiftyOne format [x, y, width, height] with values in [0, 1]
if len(bbox) == 4:
x1, y1, a, b = bbox
# Heuristic: if a/b fit within [0,1] and x1+a <= 1, y1+b <= 1, treat as [x, y, w, h]
if all(0 <= v <= 1 for v in bbox) and (x1 + a) <= 1 and (y1 + b) <= 1:
x, y, width, height = x1, y1, a, b
else:
# Treat as [x1, y1, x2, y2]
x, y = x1, y1
width, height = max(0.0, a - x1), max(0.0, b - y1)
return fol.Detection(
label=str(label),
bounding_box=[x, y, width, height],
confidence=confidence,
)
return None

Comment on lines +609 to +616
while elapsed < max_wait:
status = self.client.predictions.get(id=response.id)
if status.status == "completed":
return status
elif status.status == "failed":
raise RuntimeError(
f"Audio prediction failed: {status.error}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

❓ Verification inconclusive

Unify predictions.get signature

Mixed usage (get(id=...) vs get(prediction_id)) risks runtime errors depending on SDK signature.

-                    status = self.client.predictions.get(id=response.id)
+                    status = self.client.predictions.get(response.id)

Please confirm the correct signature for the pinned vlmrun version.


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Search for usages of predictions.get in repository..."
rg -n --hidden -S "predictions\.get\(" -C3 || true

echo
echo "Show fiftyone/utils/vlmrun.py around the reported lines (580-640)..."
sed -n '1,240p' fiftyone/utils/vlmrun.py | sed -n '560,640p' || true

echo
echo "Search for a Predictions class or function definition (look for 'class Predictions' or 'def get(' near 'predictions')..."
rg -n --hidden -S "class\s+Predictions\b|def\s+get\(" -g '!**/site-packages/**' -C3 || true

echo
echo "Search for client.predictions assignment or import (to find SDK wrapper)..."
rg -n --hidden -S "predictions\s*=" -C3 || true

echo
echo "Search for 'client.predictions' usage (how other code calls it)..."
rg -n --hidden -S "client\.predictions\." -C3 || true

Length of output: 139817


Unify predictions.get signature

Mixed usage in fiftyone/utils/vlmrun.py — line 610 uses get(id=response.id) while line 652 uses get(prediction_id); unify to the positional form used elsewhere.

-                    status = self.client.predictions.get(id=response.id)
+                    status = self.client.predictions.get(response.id)

Confirm the correct signature for the pinned vlmrun version.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
while elapsed < max_wait:
status = self.client.predictions.get(id=response.id)
if status.status == "completed":
return status
elif status.status == "failed":
raise RuntimeError(
f"Audio prediction failed: {status.error}"
)
while elapsed < max_wait:
status = self.client.predictions.get(response.id)
if status.status == "completed":
return status
elif status.status == "failed":
raise RuntimeError(
f"Audio prediction failed: {status.error}"
)
🧰 Tools
🪛 Ruff (0.13.1)

614-616: Avoid specifying long messages outside the exception class

(TRY003)

🪛 Pylint (3.3.8)

[refactor] 611-616: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

🤖 Prompt for AI Agents
In fiftyone/utils/vlmrun.py around lines 609 to 616, the call uses
self.client.predictions.get(id=response.id) which mixes a keyword form with
other places that call get(prediction_id); update this call to the positional
form used elsewhere (e.g., self.client.predictions.get(response.id)) so
signatures are consistent, and verify the pinned vlmrun package version to
confirm the correct get() signature — if the pinned version requires a keyword,
update all usages to the keyword form or upgrade/downgrade the pin accordingly.

@harpreetsahota204
Copy link
Contributor

Hi @wbrennan899

Thanks for submitting this PR. This is awesome, and it's gonna open up a lot of functionality.

I feel this integration would be better suited as either a Remote Source Zoo Model or a Plugin, rather than being integrated into the core library.

If you need help writing this implementation, let me know. It seems the core patterns are already here, just need to wrap appropriately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants