Name	Name	Last commit message	Last commit date
parent directory ..
.modalignore	.modalignore
README.md	README.md
__init__.py	__init__.py
depth_viewer.py	depth_viewer.py
locate_viewer.py	locate_viewer.py
run_batch.py	run_batch.py
seg_viewer.py	seg_viewer.py

Grounded SAM 2 on Modal

Open-set object detection and tracking in video using Grounding DINO + SAM 2.1 on A100 GPUs.

Give it a video and a text prompt, get back an annotated video with tracked bounding boxes/masks + per-frame detection JSON.

Usage

modal run segmentation/app.py --video-path data/IMG_4723.MOV --text-prompt "book. door. painting."

Outputs _tracked.mp4 and _detections.json next to the input file.

Flags

Flag	Default	Description
`--text-prompt`	`"person."`	Objects to detect. Period-separated (e.g. `"car. person. dog."`)
`--prompt-type`	`mask`	How to prompt SAM 2 tracker: `point`, `box`, or `mask`
`--box-threshold`	`0.4`	Grounding DINO box confidence. Lower = more detections
`--text-threshold`	`0.3`	Grounding DINO text similarity threshold
`--ann-frame-idx`	`0`	Frame to run initial detection on

# More detections (lower thresholds)
modal run segmentation/app.py --video-path vid.mov --text-prompt "chair. lamp. door." --box-threshold 0.25

# Detect from a later frame
modal run segmentation/app.py --video-path vid.mov --text-prompt "person." --ann-frame-idx 30

Batch (parallel)

Process all data/*.MOV files in parallel across separate A100s:

modal run segmentation/run_batch.py --text-prompt "book. door. painting. chair."

Results go to data/segmentation/.

Object-Aware Depth

Combines Grounded SAM 2 tracking with Depth Anything V2 to get depth maps masked to only the segmented objects.

modal run segmentation/depth_app.py --video-path data/IMG_4723.MOV --text-prompt "painting. chair. lamp. door."

Outputs to data/segmentation/:

{stem}_masked_depth.mp4 — depth colormap only where objects are, black elsewhere
{stem}_composite.mp4 — full depth dimmed to 20%, objects at full brightness with colored outlines + labels
{stem}_seg_depth.json — per-frame detection metadata

Viewer

Three-panel playback (Original | Object Depth | Composite):

python segmentation/depth_viewer.py data/segmentation/ --source-dir data/
# Open http://localhost:8080

Keyboard: Space play/pause, ←→ step, [] ±5 frames, Home/End first/last.

3D Object Localization

Combines all three systems (Grounded SAM 2 + Depth Anything V2 + HLoc) to predict where objects are in 3D world coordinates. Uses camera pose estimation + object depth to backproject 2D detections into 3D space.

Requires a pre-built HLoc reference (see hloc_localization/).

modal run segmentation/locate_app.py \
  --video-path data/IMG_4730.MOV \
  --text-prompt "painting. chair. lamp. door." \
  --reference-path hloc_localization/data/hloc_reference/IMG_4720/reference.tar.gz \
  --localize-fps 2

Outputs {stem}_objects3d.json to data/segmentation/ with per-object 3D positions, camera poses, and metadata.

3D Viewer

Viser-based viewer with smooth camera path playback, object highlighting on the point cloud, and video panels (source, segmentation, depth):

python segmentation/locate_viewer.py \
  data/reconstruction/IMG_4720.glb \
  data/segmentation/IMG_4730_objects3d.json \
  --reference hloc_localization/data/hloc_reference/IMG_4720/reference.tar.gz \
  --video data/IMG_4730.MOV \
  --results-dir data/segmentation/
# Open http://localhost:8890

Features:

Smooth camera path — SLERP-interpolated between keyframes, play/pause with speed control
Object highlighting — projects SAM masks into the GLB point cloud to color object surfaces
Video panels — auto-loads source, tracked (_tracked.mp4), composite depth (_composite.mp4), and masked depth (_masked_depth.mp4) from --results-dir
Per-object toggles — show/hide individual objects, toggle point cloud highlights

Flag	Default	Description
`--video`	none	Source `.MOV` to show camera frames
`--results-dir`	same as JSON	Dir with tracked/depth videos to auto-load
`--interp-steps`	`20`	Smooth interpolation steps between camera keyframes

Pipeline

Extract all video frames as JPEGs on Modal A100
Run Grounding DINO (HuggingFace) on the annotation frame to detect objects from text prompt
SAM 2.1 image predictor generates masks per detected object
Register objects with SAM 2 video predictor (point/box/mask prompt)
Propagate tracking across all frames
Annotate with bounding boxes + masks + labels via supervision
Stitch annotated frames back into mp4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Grounded SAM 2 on Modal

Usage

Flags

Batch (parallel)

Object-Aware Depth

Viewer

3D Object Localization

3D Viewer

Pipeline

FilesExpand file tree

segmentation

Directory actions

More options

Directory actions

More options

Latest commit

History

segmentation

Folders and files

parent directory

README.md

Grounded SAM 2 on Modal

Usage

Flags

Batch (parallel)

Object-Aware Depth

Viewer

3D Object Localization

3D Viewer

Pipeline