Open-set object detection and tracking in video using Grounding DINO + SAM 2.1 on A100 GPUs.
Give it a video and a text prompt, get back an annotated video with tracked bounding boxes/masks + per-frame detection JSON.
modal run segmentation/app.py --video-path data/IMG_4723.MOV --text-prompt "book. door. painting."Outputs _tracked.mp4 and _detections.json next to the input file.
| Flag | Default | Description |
|---|---|---|
--text-prompt |
"person." |
Objects to detect. Period-separated (e.g. "car. person. dog.") |
--prompt-type |
mask |
How to prompt SAM 2 tracker: point, box, or mask |
--box-threshold |
0.4 |
Grounding DINO box confidence. Lower = more detections |
--text-threshold |
0.3 |
Grounding DINO text similarity threshold |
--ann-frame-idx |
0 |
Frame to run initial detection on |
# More detections (lower thresholds)
modal run segmentation/app.py --video-path vid.mov --text-prompt "chair. lamp. door." --box-threshold 0.25
# Detect from a later frame
modal run segmentation/app.py --video-path vid.mov --text-prompt "person." --ann-frame-idx 30Process all data/*.MOV files in parallel across separate A100s:
modal run segmentation/run_batch.py --text-prompt "book. door. painting. chair."Results go to data/segmentation/.
Combines Grounded SAM 2 tracking with Depth Anything V2 to get depth maps masked to only the segmented objects.
modal run segmentation/depth_app.py --video-path data/IMG_4723.MOV --text-prompt "painting. chair. lamp. door."Outputs to data/segmentation/:
{stem}_masked_depth.mp4— depth colormap only where objects are, black elsewhere{stem}_composite.mp4— full depth dimmed to 20%, objects at full brightness with colored outlines + labels{stem}_seg_depth.json— per-frame detection metadata
Three-panel playback (Original | Object Depth | Composite):
python segmentation/depth_viewer.py data/segmentation/ --source-dir data/
# Open http://localhost:8080Keyboard: Space play/pause, ←→ step, [] ±5 frames, Home/End first/last.
Combines all three systems (Grounded SAM 2 + Depth Anything V2 + HLoc) to predict where objects are in 3D world coordinates. Uses camera pose estimation + object depth to backproject 2D detections into 3D space.
Requires a pre-built HLoc reference (see hloc_localization/).
modal run segmentation/locate_app.py \
--video-path data/IMG_4730.MOV \
--text-prompt "painting. chair. lamp. door." \
--reference-path hloc_localization/data/hloc_reference/IMG_4720/reference.tar.gz \
--localize-fps 2Outputs {stem}_objects3d.json to data/segmentation/ with per-object 3D positions, camera poses, and metadata.
Viser-based viewer with smooth camera path playback, object highlighting on the point cloud, and video panels (source, segmentation, depth):
python segmentation/locate_viewer.py \
data/reconstruction/IMG_4720.glb \
data/segmentation/IMG_4730_objects3d.json \
--reference hloc_localization/data/hloc_reference/IMG_4720/reference.tar.gz \
--video data/IMG_4730.MOV \
--results-dir data/segmentation/
# Open http://localhost:8890Features:
- Smooth camera path — SLERP-interpolated between keyframes, play/pause with speed control
- Object highlighting — projects SAM masks into the GLB point cloud to color object surfaces
- Video panels — auto-loads source, tracked (
_tracked.mp4), composite depth (_composite.mp4), and masked depth (_masked_depth.mp4) from--results-dir - Per-object toggles — show/hide individual objects, toggle point cloud highlights
| Flag | Default | Description |
|---|---|---|
--video |
none | Source .MOV to show camera frames |
--results-dir |
same as JSON | Dir with tracked/depth videos to auto-load |
--interp-steps |
20 |
Smooth interpolation steps between camera keyframes |
- Extract all video frames as JPEGs on Modal A100
- Run Grounding DINO (HuggingFace) on the annotation frame to detect objects from text prompt
- SAM 2.1 image predictor generates masks per detected object
- Register objects with SAM 2 video predictor (point/box/mask prompt)
- Propagate tracking across all frames
- Annotate with bounding boxes + masks + labels via
supervision - Stitch annotated frames back into mp4