-
Notifications
You must be signed in to change notification settings - Fork 675
add vlm run integration #6353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
add vlm run integration #6353
Conversation
WalkthroughIntroduces a new VLM Run integration: adds documentation pages and index entry, a utility module implementing model loading, prediction, conversions, and application to datasets, accompanying unit tests with mocks, and an extras dependency on vlmrun>=0.3.5. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant FO as FiftyOne
participant VM as VLMRunModel
participant API as VLM Run API
participant DS as Dataset
User->>FO: apply_vlmrun_model(samples, domain/schema, output_type, ...)
FO->>VM: construct/load model (domain, api_key, config)
Note over VM: Determine media_type from domain
loop For each media (or batch)
VM->>API: predict(media | batch) [async/polling, timeout]
API-->>VM: result (response/data/legacy)
VM-->>FO: raw result
FO->>FO: convert to attributes/classifications/detections/grounding
FO->>DS: write to label_field
end
DS-->>User: samples updated
sequenceDiagram
autonumber
participant FO as FiftyOne Utils
participant VM as VLMRunModel
participant API as VLM Run API
FO->>VM: predict_all(media_items)
rect rgba(200,230,255,0.25)
note right of VM: Batch submit and poll
VM->>API: submit batch
API-->>VM: batch results
end
VM-->>FO: list of results
FO->>FO: map results ->\nClassification/Detections/Attributes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
@harpreetsahota204 @AdonaiVera VLM Run asked me to tag you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
fiftyone/utils/vlmrun.py (1)
734-828
: apply_vlmrun_model only supports image samples; add media-aware handlingCurrently validates image collections and always opens images. This breaks for document/video/audio domains (common for this integration).
- # Validate collection - # Validate samples are images - import fiftyone.core.validation as fov - - fov.validate_image_collection(samples) + # Validate collection per media type + import fiftyone.core.validation as fov + media_type = model.media_type if hasattr(model, "media_type") else "image" + if media_type == "image": + fov.validate_image_collection(samples) + elif media_type == "video" and hasattr(fov, "validate_video_collection"): + fov.validate_video_collection(samples) + # For "document"/"audio", skip strict media validation here; the client accepts file paths. @@ - try: - # Load image - img = Image.open(sample.filepath) - - # Make prediction - result = model.predict(img) + try: + # Prepare input per media type + if media_type == "image": + media = Image.open(sample.filepath) + else: + media = sample.filepath # pass file path for document/video/audio + + # Make prediction + result = model.predict(media)Optionally, you can batch by
batch_size
for image domains later.
🧹 Nitpick comments (5)
docs/source/integrations/vlm.rst (1)
35-49
: Installation docs: include the extras pathSince this is an integration, advertise the extras install in addition to raw
vlmrun
.-To get started with VLM Run, install the `vlmrun` package: +To get started, install the integration via FiftyOne extras (recommended): + +.. code-block:: shell + + pip install "fiftyone[vlmrun]" + +Or install the `vlmrun` package directly:fiftyone/utils/vlmrun.py (3)
85-149
: Tighten confidence mapping and simplify conditionals (minor)You can simplify the high/med/low mapping and reduce returns.
- conf_text = response_data.get("confidence", "medium") - if conf_text == "hi" or conf_text == "high": - confidence = 0.9 - elif conf_text == "medium": - confidence = 0.7 - elif conf_text == "low": - confidence = 0.3 - else: - confidence = 0.5 + conf_text = str(response_data.get("confidence", "medium")).lower() + confidence = {"hi": 0.9, "high": 0.9, "med": 0.7, "medium": 0.7, "low": 0.3}.get(conf_text, 0.5)
466-479
: Unused kwargs in VLMRunModelConfig (minor)
**kwargs
are accepted but ignored; either document or plumb them through (e.g., to GenerationConfig).
555-707
: Type-specific errors and messages (minor)Where you validate input type (e.g., requiring file path), raise
TypeError
for wrong types and keep messages concise.tests/unittests/vlm_tests.py (1)
262-264
: Remove unused variable
img
is unused.- img = np.zeros((100, 100, 3), dtype=np.uint8) sample = fo.Sample(filepath="test.jpg") dataset.add_sample(sample)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
docs/source/integrations/index.rst
(2 hunks)docs/source/integrations/vlm.rst
(1 hunks)fiftyone/utils/vlmrun.py
(1 hunks)requirements/extras.txt
(1 hunks)tests/unittests/vlm_tests.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/unittests/vlm_tests.py (2)
fiftyone/utils/cvat.py (1)
patch
(3876-3886)fiftyone/utils/vlmrun.py (16)
VLMRunModel
(497-731)media_type
(513-522)has_logits
(525-527)VLMRunModelConfig
(450-494)predict
(554-711)predict_all
(713-731)to_classification
(85-148)to_detections
(151-186)to_attributes
(236-279)apply_vlmrun_model
(734-827)convert_vlm_model
(31-61)load_vlmrun_model
(64-82)list_vlmrun_domains
(393-421)get_domain_schema
(424-447)parse_visual_grounding
(282-350)parse_temporal_grounding
(353-390)
fiftyone/utils/vlmrun.py (3)
fiftyone/core/utils.py (3)
lazy_import
(732-754)ensure_package
(396-438)ProgressBar
(950-996)fiftyone/core/labels.py (2)
Detections
(632-709)Detection
(438-629)fiftyone/core/models.py (2)
ModelConfig
(2106-2115)Model
(2118-2243)
🪛 Ruff (0.13.1)
tests/unittests/vlm_tests.py
262-262: Local variable img
is assigned to but never used
Remove assignment to unused variable img
(F841)
fiftyone/utils/vlmrun.py
419-419: Do not catch blind exception: Exception
(BLE001)
445-445: Do not catch blind exception: Exception
(BLE001)
478-478: Unused method argument: kwargs
(ARG002)
481-481: Avoid specifying long messages outside the exception class
(TRY003)
585-585: Avoid specifying long messages outside the exception class
(TRY003)
593-593: Avoid specifying long messages outside the exception class
(TRY003)
614-616: Avoid specifying long messages outside the exception class
(TRY003)
621-623: Avoid specifying long messages outside the exception class
(TRY003)
633-633: Avoid specifying long messages outside the exception class
(TRY003)
661-663: Avoid specifying long messages outside the exception class
(TRY003)
668-670: Avoid specifying long messages outside the exception class
(TRY003)
693-693: Avoid specifying long messages outside the exception class
(TRY003)
706-706: Prefer TypeError
exception for invalid type
(TRY004)
706-706: Avoid specifying long messages outside the exception class
(TRY003)
727-727: Do not catch blind exception: Exception
(BLE001)
743-743: Unused function argument: batch_size
(ARG001)
773-775: Avoid specifying long messages outside the exception class
(TRY003)
819-819: Abstract raise
to an inner function
(TRY301)
819-819: Avoid specifying long messages outside the exception class
(TRY003)
824-824: Do not catch blind exception: Exception
(BLE001)
🪛 Pylint (3.3.8)
tests/unittests/vlm_tests.py
[error] 1-1: Unrecognized option found: optimize-ast, files-output, function-name-hint, variable-name-hint, const-name-hint, attr-name-hint, argument-name-hint, class-attribute-name-hint, inlinevar-name-hint, class-name-hint, module-name-hint, method-name-hint, no-space-check
(E0015)
[refactor] 1-1: Useless option value for '--disable', 'bad-continuation' was removed from pylint, see pylint-dev/pylint#3571.
(R0022)
[refactor] 17-17: Use 'from fiftyone.utils import vlmrun' instead
(R0402)
fiftyone/utils/vlmrun.py
[error] 1-1: Unrecognized option found: optimize-ast, files-output, function-name-hint, variable-name-hint, const-name-hint, attr-name-hint, argument-name-hint, class-attribute-name-hint, inlinevar-name-hint, class-name-hint, module-name-hint, method-name-hint, no-space-check
(E0015)
[refactor] 1-1: Useless option value for '--disable', 'bad-continuation' was removed from pylint, see pylint-dev/pylint#3571.
(R0022)
[refactor] 114-114: Consider merging these comparisons with 'in' by using 'conf_text in ('hi', 'high')'. Use a set instead if elements are hashable.
(R1714)
[refactor] 85-85: Too many return statements (7/6)
(R0911)
[refactor] 413-418: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 466-466: Too many positional arguments (11/5)
(R0917)
[refactor] 516-521: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 611-616: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 656-659: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 554-554: Too many return statements (7/6)
(R0911)
[refactor] 734-734: Too many positional arguments (10/5)
(R0917)
🔇 Additional comments (4)
requirements/extras.txt (1)
8-8
: Confirm extras wiring and version compatibilityAdding
vlmrun>=0.3.5
looks fine. Please verify:
- The
vlmrun
extra is exposed in packaging (setup.cfg/pyproject) sopip install fiftyone[vlmrun]
works as documented.- The minimum version matches what you tested against and any APIs used here (e.g.,
client.video.generate
,predictions.get
) exist in that range.If needed, I can generate a quick grep script to locate where extras are declared and check for a
vlmrun
extra.docs/source/integrations/index.rst (1)
198-204
: LGTM: card and toctree entryThe new card and toctree link look consistent with the added
vlm.rst
.tests/unittests/vlm_tests.py (2)
371-381
: LGTM: factory tests cover config propagationCovers
convert_vlm_model
/load_vlmrun_model
happy paths well.
438-466
: Nice coverage for grounding parsersVisual grounding confidence mapping and bbox parsing are validated.
import fiftyone as fo | ||
import fiftyone.zoo as foz | ||
import fiftyone.utils.vlm as fouv | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix import path: use the new module name
The correct module is fiftyone.utils.vlmrun
, not fiftyone.utils.vlm
.
- import fiftyone.utils.vlm as fouv
+ import fiftyone.utils.vlmrun as fouv
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
import fiftyone as fo | |
import fiftyone.zoo as foz | |
import fiftyone.utils.vlm as fouv | |
import fiftyone as fo | |
import fiftyone.zoo as foz | |
import fiftyone.utils.vlmrun as fouv |
🤖 Prompt for AI Agents
In docs/source/integrations/vlm.rst around lines 63 to 66, the imports use the
old module path `fiftyone.utils.vlm`; update the import to the new module name
`fiftyone.utils.vlmrun` so the code reads import fiftyone.utils.vlmrun (or
import as fouv if aliasing is needed) and ensure any subsequent references use
the new alias.
# Load a VLM Run model for document invoice extraction | ||
model = fouv.load_vlm_model("document.invoice") | ||
|
||
# Apply the model to extract invoice data | ||
fouv.apply_vlm_model( | ||
dataset, | ||
model=model, | ||
label_field="invoice_data", | ||
output_type="attributes" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Update API names: load/apply helpers were named vlmrun
Docs reference non-existent helpers. Use the implemented names.
- model = fouv.load_vlm_model("document.invoice")
+ model = fouv.load_vlmrun_model("document.invoice")
@@
- fouv.apply_vlm_model(
+ fouv.apply_vlmrun_model(
dataset,
model=model,
label_field="invoice_data",
output_type="attributes"
)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Load a VLM Run model for document invoice extraction | |
model = fouv.load_vlm_model("document.invoice") | |
# Apply the model to extract invoice data | |
fouv.apply_vlm_model( | |
dataset, | |
model=model, | |
label_field="invoice_data", | |
output_type="attributes" | |
) | |
# Load a VLM Run model for document invoice extraction | |
model = fouv.load_vlmrun_model("document.invoice") | |
# Apply the model to extract invoice data | |
fouv.apply_vlmrun_model( | |
dataset, | |
model=model, | |
label_field="invoice_data", | |
output_type="attributes" | |
) |
🤖 Prompt for AI Agents
In docs/source/integrations/vlm.rst around lines 70 to 79, the example calls use
non-existent helpers load_vlm_model and apply_vlm_model; update them to the
implemented vlmrun API by replacing load_vlm_model("document.invoice") with
vlmrun.load_model("document.invoice") (or the exact implemented vlmrun.load_*
function) and replacing fouv.apply_vlm_model(...) with
vlmrun.apply_model(dataset, model=model, label_field="invoice_data",
output_type="attributes") so the docs reference the actual vlmrun helper names
used in code.
# Image classification | ||
model = fouv.load_vlm_model("image.classification") | ||
|
||
# Apply to dataset | ||
dataset.apply_model(model, label_field="vlm_predictions") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
dataset.apply_model()
won’t work here; use apply_vlmrun_model
VLMRunModel.predict()
returns a VLM Run result object, not FiftyOne labels. dataset.apply_model()
expects labels. Use the provided apply helper.
- # Apply to dataset
- dataset.apply_model(model, label_field="vlm_predictions")
+ # Apply to dataset
+ fouv.apply_vlmrun_model(
+ dataset,
+ model=model,
+ label_field="vlm_predictions",
+ output_type="classification",
+ )
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Image classification | |
model = fouv.load_vlm_model("image.classification") | |
# Apply to dataset | |
dataset.apply_model(model, label_field="vlm_predictions") | |
# Image classification | |
model = fouv.load_vlm_model("image.classification") | |
# Apply to dataset | |
fouv.apply_vlmrun_model( | |
dataset, | |
model=model, | |
label_field="vlm_predictions", | |
output_type="classification", | |
) |
🤖 Prompt for AI Agents
In docs/source/integrations/vlm.rst around lines 118 to 123, the example uses
dataset.apply_model(model, ...) but VLMRunModel.predict() returns a VLM run
result object (not FiftyOne labels), so dataset.apply_model will fail; replace
the call with the VLM-specific helper (e.g., dataset.apply_vlmrun_model) and
pass the VLM model plus the desired label_field so the helper can convert VLM
run results into FiftyOne labels before storing them.
# Create model with custom schema | ||
model = fouv.VLMRunModel(schema=ProductInfo) | ||
|
||
# Apply to dataset | ||
fouv.apply_vlm_model( | ||
dataset, | ||
model=model, | ||
label_field="product_info", | ||
output_type="attributes" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Model requires a domain; or use the factory
VLMRunModelConfig
requires domain
. If you want schema-driven extraction, document the factory and include the domain.
- model = fouv.VLMRunModel(schema=ProductInfo)
+ model = fouv.convert_vlm_model(domain="document.custom", schema=ProductInfo)
If there is an official domain name for custom schemas, replace "document.custom"
accordingly.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Create model with custom schema | |
model = fouv.VLMRunModel(schema=ProductInfo) | |
# Apply to dataset | |
fouv.apply_vlm_model( | |
dataset, | |
model=model, | |
label_field="product_info", | |
output_type="attributes" | |
) | |
# Create model with custom schema | |
model = fouv.convert_vlm_model(domain="document.custom", schema=ProductInfo) | |
# Apply to dataset | |
fouv.apply_vlm_model( | |
dataset, | |
model=model, | |
label_field="product_info", | |
output_type="attributes" | |
) |
🤖 Prompt for AI Agents
In docs/source/integrations/vlm.rst around lines 160 to 169, the example
constructing VLMRunModel omits the required domain for VLMRunModelConfig and
doesn't show the factory alternative; update the snippet and docs to either (A)
pass a domain when creating the model (e.g., VLMRunModel(schema=ProductInfo,
domain="document.custom" or the official domain name) and ensure the
label_field/output_type remain consistent, or (B) document and show using the
provided factory function for schema-driven extraction that populates the
required domain automatically; also mention replacing "document.custom" with the
project's official domain name if one exists.
fouv.apply_vlm_model( | ||
dataset, | ||
domain="document.invoice", | ||
label_field="invoice", | ||
output_type="attributes" | ||
) | ||
|
||
# Access extracted fields | ||
sample = dataset.first() | ||
print(sample["invoice.vendor"]) | ||
print(sample["invoice.total"]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix attribute access paths
apply_vlmrun_model(..., output_type="attributes", label_field="invoice")
flattens to fields like invoice_vendor
, not nested paths.
- print(sample["invoice.vendor"])
- print(sample["invoice.total"])
+ print(sample["invoice_vendor"])
+ print(sample["invoice_total"])
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
fouv.apply_vlm_model( | |
dataset, | |
domain="document.invoice", | |
label_field="invoice", | |
output_type="attributes" | |
) | |
# Access extracted fields | |
sample = dataset.first() | |
print(sample["invoice.vendor"]) | |
print(sample["invoice.total"]) | |
fouv.apply_vlm_model( | |
dataset, | |
domain="document.invoice", | |
label_field="invoice", | |
output_type="attributes" | |
) | |
# Access extracted fields | |
sample = dataset.first() | |
print(sample["invoice_vendor"]) | |
print(sample["invoice_total"]) |
🤖 Prompt for AI Agents
In docs/source/integrations/vlm.rst around lines 186 to 197, the example shows
accessing extracted attributes using nested keys like sample["invoice.vendor"]
and sample["invoice.total"], but when apply_vlm_model is called with
output_type="attributes" and label_field="invoice" the fields are flattened to
names like invoice_vendor and invoice_total; update the example to use the
flattened attribute names (e.g., sample["invoice_vendor"],
sample["invoice_total"]) and adjust any surrounding text to reflect flattened
attribute access instead of nested paths.
Invoice Processing | ||
^^^^^^^^^^^^^^^^^^ | ||
|
||
Extract structured invoice data from a dataset of invoice images: | ||
|
||
.. code-block:: python | ||
:linenos: | ||
|
||
import fiftyone as fo | ||
import fiftyone.utils.vlm as fouv | ||
|
||
# Load dataset of invoice images | ||
dataset = fo.Dataset() | ||
dataset.add_samples([ | ||
fo.Sample(filepath="/path/to/invoice1.pdf"), | ||
fo.Sample(filepath="/path/to/invoice2.jpg"), | ||
]) | ||
|
||
# Extract invoice data | ||
fouv.apply_vlm_model( | ||
dataset, | ||
domain="document.invoice", | ||
label_field="invoice", | ||
output_type="attributes" | ||
) | ||
|
||
# Query extracted data | ||
high_value = dataset.filter_labels("invoice.total", F() > 1000) | ||
print(f"Found {len(high_value)} high-value invoices") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Correct filtering example
filter_labels
is for label fields; here we’re filtering by flattened attributes. Use ViewField
and match
.
- high_value = dataset.filter_labels("invoice.total", F() > 1000)
+ from fiftyone import ViewField as F
+ high_value = dataset.match(F("invoice_total") > 1000)
If you choose to store nested documents instead, adjust the apply helper accordingly and keep the dotted path.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Invoice Processing | |
^^^^^^^^^^^^^^^^^^ | |
Extract structured invoice data from a dataset of invoice images: | |
.. code-block:: python | |
:linenos: | |
import fiftyone as fo | |
import fiftyone.utils.vlm as fouv | |
# Load dataset of invoice images | |
dataset = fo.Dataset() | |
dataset.add_samples([ | |
fo.Sample(filepath="/path/to/invoice1.pdf"), | |
fo.Sample(filepath="/path/to/invoice2.jpg"), | |
]) | |
# Extract invoice data | |
fouv.apply_vlm_model( | |
dataset, | |
domain="document.invoice", | |
label_field="invoice", | |
output_type="attributes" | |
) | |
# Query extracted data | |
high_value = dataset.filter_labels("invoice.total", F() > 1000) | |
print(f"Found {len(high_value)} high-value invoices") | |
Invoice Processing | |
^^^^^^^^^^^^^^^^^^ | |
Extract structured invoice data from a dataset of invoice images: | |
.. code-block:: python | |
:linenos: | |
import fiftyone as fo | |
import fiftyone.utils.vlm as fouv | |
# Load dataset of invoice images | |
dataset = fo.Dataset() | |
dataset.add_samples([ | |
fo.Sample(filepath="/path/to/invoice1.pdf"), | |
fo.Sample(filepath="/path/to/invoice2.jpg"), | |
]) | |
# Extract invoice data | |
fouv.apply_vlm_model( | |
dataset, | |
domain="document.invoice", | |
label_field="invoice", | |
output_type="attributes" | |
) | |
# Query extracted data | |
from fiftyone import ViewField as F | |
high_value = dataset.match(F("invoice_total") > 1000) | |
print(f"Found {len(high_value)} high-value invoices") |
🤖 Prompt for AI Agents
In docs/source/integrations/vlm.rst around lines 255 to 284, the example
incorrectly uses dataset.filter_labels for filtering flattened invoice
attributes; update the example to use a view-based filter with ViewField.match
(e.g., use fo.ViewField to reference the flattened attribute path and call
.match or appropriate view filtering method) so it filters samples by the
invoice.total attribute value; alternatively, if you change the helper to store
nested label documents, update the apply_vlm_model call and keep using the
dotted path for label filtering.
# Load a VLM Run model | ||
fouv.load_vlm_model(domain, api_key=None, **kwargs) | ||
|
||
# Convert model for FiftyOne | ||
fouv.convert_vlm_model(domain=None, schema=None, **kwargs) | ||
|
||
# Apply model to dataset | ||
fouv.apply_vlm_model( | ||
samples, | ||
model=None, | ||
domain=None, | ||
schema=None, | ||
label_field="vlm_predictions", | ||
output_type="attributes", | ||
confidence_thresh=None, | ||
api_key=None, | ||
batch_size=None, | ||
progress=None | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
API reference helper names
Align with the implemented function names.
- fouv.load_vlm_model(domain, api_key=None, **kwargs)
+ fouv.load_vlmrun_model(domain, api_key=None, **kwargs)
@@
- fouv.apply_vlm_model(
+ fouv.apply_vlmrun_model(
samples,
model=None,
domain=None,
schema=None,
label_field="vlm_predictions",
output_type="attributes",
confidence_thresh=None,
api_key=None,
batch_size=None,
progress=None
)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Load a VLM Run model | |
fouv.load_vlm_model(domain, api_key=None, **kwargs) | |
# Convert model for FiftyOne | |
fouv.convert_vlm_model(domain=None, schema=None, **kwargs) | |
# Apply model to dataset | |
fouv.apply_vlm_model( | |
samples, | |
model=None, | |
domain=None, | |
schema=None, | |
label_field="vlm_predictions", | |
output_type="attributes", | |
confidence_thresh=None, | |
api_key=None, | |
batch_size=None, | |
progress=None | |
) | |
# Load a VLM Run model | |
fouv.load_vlmrun_model(domain, api_key=None, **kwargs) | |
# Convert model for FiftyOne | |
fouv.convert_vlm_model(domain=None, schema=None, **kwargs) | |
# Apply model to dataset | |
fouv.apply_vlmrun_model( | |
samples, | |
model=None, | |
domain=None, | |
schema=None, | |
label_field="vlm_predictions", | |
output_type="attributes", | |
confidence_thresh=None, | |
api_key=None, | |
batch_size=None, | |
progress=None | |
) |
🤖 Prompt for AI Agents
In docs/source/integrations/vlm.rst around lines 364 to 383, the helper call
names and signatures in the example do not match the implemented functions;
verify the actual function names and parameter names in the fouv (or relevant)
module and update this example to use those exact names and signatures
(including correct parameter names, defaults, and keyword ordering) so the docs
match the code.
def _parse_detection(item, confidence_thresh=None): | ||
"""Parse a single detection from VLM Run output.""" | ||
if isinstance(item, dict): | ||
label = item.get("label") or item.get("class") or item.get("category") | ||
bbox = item.get("bbox") or item.get("bounding_box") or item.get("box") | ||
confidence = item.get("confidence", 1.0) | ||
elif hasattr(item, "label") or hasattr(item, "class_name"): | ||
label = getattr(item, "label", None) or getattr( | ||
item, "class_name", None | ||
) | ||
bbox = getattr(item, "bbox", None) or getattr( | ||
item, "bounding_box", None | ||
) | ||
confidence = getattr(item, "confidence", 1.0) | ||
else: | ||
return None | ||
|
||
if label is None or bbox is None: | ||
return None | ||
|
||
if confidence_thresh is not None and confidence < confidence_thresh: | ||
return None | ||
|
||
# Convert bbox to FiftyOne format [x, y, width, height] with values in [0, 1] | ||
if len(bbox) == 4: | ||
# Assume bbox is in [x1, y1, x2, y2] format | ||
x1, y1, x2, y2 = bbox | ||
x = x1 | ||
y = y1 | ||
width = x2 - x1 | ||
height = y2 - y1 | ||
|
||
# Normalize if values are > 1 (pixel coordinates) | ||
if any(v > 1 for v in [x, y, width, height]): | ||
# We'll need image dimensions for proper normalization | ||
# For now, we'll assume they're already normalized | ||
pass | ||
|
||
return fol.Detection( | ||
label=str(label), | ||
bounding_box=[x, y, width, height], | ||
confidence=confidence, | ||
) | ||
|
||
return None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BBox parsing treats [x,y,w,h] as [x1,y1,x2,y2] — fix to support both
Current logic assumes [x1,y1,x2,y2]
and silently “passes” on pixel coords, producing incorrect widths/heights for normalized [x,y,w,h]
. Handle both forms.
- # Convert bbox to FiftyOne format [x, y, width, height] with values in [0, 1]
- if len(bbox) == 4:
- # Assume bbox is in [x1, y1, x2, y2] format
- x1, y1, x2, y2 = bbox
- x = x1
- y = y1
- width = x2 - x1
- height = y2 - y1
-
- # Normalize if values are > 1 (pixel coordinates)
- if any(v > 1 for v in [x, y, width, height]):
- # We'll need image dimensions for proper normalization
- # For now, we'll assume they're already normalized
- pass
-
- return fol.Detection(
- label=str(label),
- bounding_box=[x, y, width, height],
- confidence=confidence,
- )
+ # Convert bbox to FiftyOne format [x, y, width, height] with values in [0, 1]
+ if len(bbox) == 4:
+ x1, y1, a, b = bbox
+ # Heuristic: if a/b fit within [0,1] and x1+a <= 1, y1+b <= 1, treat as [x, y, w, h]
+ if all(0 <= v <= 1 for v in bbox) and (x1 + a) <= 1 and (y1 + b) <= 1:
+ x, y, width, height = x1, y1, a, b
+ else:
+ # Treat as [x1, y1, x2, y2]
+ x, y = x1, y1
+ width, height = max(0.0, a - x1), max(0.0, b - y1)
+
+ return fol.Detection(
+ label=str(label),
+ bounding_box=[x, y, width, height],
+ confidence=confidence,
+ )
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def _parse_detection(item, confidence_thresh=None): | |
"""Parse a single detection from VLM Run output.""" | |
if isinstance(item, dict): | |
label = item.get("label") or item.get("class") or item.get("category") | |
bbox = item.get("bbox") or item.get("bounding_box") or item.get("box") | |
confidence = item.get("confidence", 1.0) | |
elif hasattr(item, "label") or hasattr(item, "class_name"): | |
label = getattr(item, "label", None) or getattr( | |
item, "class_name", None | |
) | |
bbox = getattr(item, "bbox", None) or getattr( | |
item, "bounding_box", None | |
) | |
confidence = getattr(item, "confidence", 1.0) | |
else: | |
return None | |
if label is None or bbox is None: | |
return None | |
if confidence_thresh is not None and confidence < confidence_thresh: | |
return None | |
# Convert bbox to FiftyOne format [x, y, width, height] with values in [0, 1] | |
if len(bbox) == 4: | |
# Assume bbox is in [x1, y1, x2, y2] format | |
x1, y1, x2, y2 = bbox | |
x = x1 | |
y = y1 | |
width = x2 - x1 | |
height = y2 - y1 | |
# Normalize if values are > 1 (pixel coordinates) | |
if any(v > 1 for v in [x, y, width, height]): | |
# We'll need image dimensions for proper normalization | |
# For now, we'll assume they're already normalized | |
pass | |
return fol.Detection( | |
label=str(label), | |
bounding_box=[x, y, width, height], | |
confidence=confidence, | |
) | |
return None | |
def _parse_detection(item, confidence_thresh=None): | |
"""Parse a single detection from VLM Run output.""" | |
if isinstance(item, dict): | |
label = item.get("label") or item.get("class") or item.get("category") | |
bbox = item.get("bbox") or item.get("bounding_box") or item.get("box") | |
confidence = item.get("confidence", 1.0) | |
elif hasattr(item, "label") or hasattr(item, "class_name"): | |
label = getattr(item, "label", None) or getattr( | |
item, "class_name", None | |
) | |
bbox = getattr(item, "bbox", None) or getattr( | |
item, "bounding_box", None | |
) | |
confidence = getattr(item, "confidence", 1.0) | |
else: | |
return None | |
if label is None or bbox is None: | |
return None | |
if confidence_thresh is not None and confidence < confidence_thresh: | |
return None | |
# Convert bbox to FiftyOne format [x, y, width, height] with values in [0, 1] | |
if len(bbox) == 4: | |
x1, y1, a, b = bbox | |
# Heuristic: if a/b fit within [0,1] and x1+a <= 1, y1+b <= 1, treat as [x, y, w, h] | |
if all(0 <= v <= 1 for v in bbox) and (x1 + a) <= 1 and (y1 + b) <= 1: | |
x, y, width, height = x1, y1, a, b | |
else: | |
# Treat as [x1, y1, x2, y2] | |
x, y = x1, y1 | |
width, height = max(0.0, a - x1), max(0.0, b - y1) | |
return fol.Detection( | |
label=str(label), | |
bounding_box=[x, y, width, height], | |
confidence=confidence, | |
) | |
return None |
while elapsed < max_wait: | ||
status = self.client.predictions.get(id=response.id) | ||
if status.status == "completed": | ||
return status | ||
elif status.status == "failed": | ||
raise RuntimeError( | ||
f"Audio prediction failed: {status.error}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
❓ Verification inconclusive
Unify predictions.get signature
Mixed usage (get(id=...)
vs get(prediction_id)
) risks runtime errors depending on SDK signature.
- status = self.client.predictions.get(id=response.id)
+ status = self.client.predictions.get(response.id)
Please confirm the correct signature for the pinned vlmrun
version.
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Search for usages of predictions.get in repository..."
rg -n --hidden -S "predictions\.get\(" -C3 || true
echo
echo "Show fiftyone/utils/vlmrun.py around the reported lines (580-640)..."
sed -n '1,240p' fiftyone/utils/vlmrun.py | sed -n '560,640p' || true
echo
echo "Search for a Predictions class or function definition (look for 'class Predictions' or 'def get(' near 'predictions')..."
rg -n --hidden -S "class\s+Predictions\b|def\s+get\(" -g '!**/site-packages/**' -C3 || true
echo
echo "Search for client.predictions assignment or import (to find SDK wrapper)..."
rg -n --hidden -S "predictions\s*=" -C3 || true
echo
echo "Search for 'client.predictions' usage (how other code calls it)..."
rg -n --hidden -S "client\.predictions\." -C3 || true
Length of output: 139817
Unify predictions.get signature
Mixed usage in fiftyone/utils/vlmrun.py — line 610 uses get(id=response.id) while line 652 uses get(prediction_id); unify to the positional form used elsewhere.
- status = self.client.predictions.get(id=response.id)
+ status = self.client.predictions.get(response.id)
Confirm the correct signature for the pinned vlmrun
version.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
while elapsed < max_wait: | |
status = self.client.predictions.get(id=response.id) | |
if status.status == "completed": | |
return status | |
elif status.status == "failed": | |
raise RuntimeError( | |
f"Audio prediction failed: {status.error}" | |
) | |
while elapsed < max_wait: | |
status = self.client.predictions.get(response.id) | |
if status.status == "completed": | |
return status | |
elif status.status == "failed": | |
raise RuntimeError( | |
f"Audio prediction failed: {status.error}" | |
) |
🧰 Tools
🪛 Ruff (0.13.1)
614-616: Avoid specifying long messages outside the exception class
(TRY003)
🪛 Pylint (3.3.8)
[refactor] 611-616: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
🤖 Prompt for AI Agents
In fiftyone/utils/vlmrun.py around lines 609 to 616, the call uses
self.client.predictions.get(id=response.id) which mixes a keyword form with
other places that call get(prediction_id); update this call to the positional
form used elsewhere (e.g., self.client.predictions.get(response.id)) so
signatures are consistent, and verify the pinned vlmrun package version to
confirm the correct get() signature — if the pinned version requires a keyword,
update all usages to the keyword form or upgrade/downgrade the pin accordingly.
Hi @wbrennan899 Thanks for submitting this PR. This is awesome, and it's gonna open up a lot of functionality. I feel this integration would be better suited as either a Remote Source Zoo Model or a Plugin, rather than being integrated into the core library. If you need help writing this implementation, let me know. It seems the core patterns are already here, just need to wrap appropriately. |
What changes are proposed in this pull request?
This PR adds integration with VLM Run, a Vision Language Model platform that extracts structured data from documents, images, videos, and audio files.
Key features:
VLMRunModel
class that processes:How is this patch tested? If it is not, please explain why.
Release Notes
Is this a user-facing change that should be mentioned in the release notes?
Added VLM Run integration for extracting structured data from documents (PDFs, invoices, receipts), images, videos, and audio files. Includes 40+ pre-built domains and provides visual grounding (bounding boxes) and temporal grounding (timestamps). Install with
pip install fiftyone[vlmrun]
.What areas of FiftyOne does this PR affect?
fiftyone
Python library changesSummary by CodeRabbit
New Features
Documentation
Tests
Chores