Grounded Situation Recognition Meets Segment Anything for Helping People with Visual Impairments [PDF]
# Clone this repository and navigate into the repository
git clone
cd OpenSU
# Create a conda environment, activate the environment and install PyTorch via conda
conda create --name OpenSU python=3.9
conda activate OpenSU
conda install pytorch==1.8.0 torchvision==0.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge
# Install requirements via pip
pip install -r requirements.txt
# Install Segment Anything
pip install git+
Download images to the folder SWiG
, and json files to the folder SWiG/SWiG_jsons
Swin-T to the folder ckpt/Swin
Segment Anything and MobileSAM to the folder ckpt/sam
, and GSR model to the folder ckpt
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --use_env --batch_size 4 --dataset_file swig --epochs 40 --num_workers 4 --num_glance_enc_layers 3 --num_gaze_s1_dec_layers 3 --num_gaze_s1_enc_layers 3 --num_gaze_s2_dec_layers 3 --dropout 0.15 --hidden_dim 512 --output_dir OpenSU
python --saved_model ckpt/OpenSU_Swin.pth --output_dir OpenSU_eva --dev # Evaluation on develpment set
python --saved_model ckpt/OpenSU_Swin.pth --output_dir OpenSU_eva --test # Evaluation on test set
python --image_path img/carting_214.jpg --sam sam # Using vanilla Segment Anything as segmentation map generator
python --image_path img/carting_214.jpg --sam mobilesam # Using MobileSAM as segmentation map generator
# Text information
verb: carting
role: agent, noun: dog.n.01
role: item, noun: man.n.01
role: tool, noun: cart.n.01
role: place, noun: outdoors.n.01
the dog cartes the man in a cart at a outdoors.
Our system is built upon the framework of CoFormer. If you use our OpenSU, please cite
title={Open Scene Understanding: Grounded Situation Recognition Meets Segment Anything for Helping People with Visual Impairments},
author={Liu, Ruiping and Zhang, Jiaming and Peng, Kunyu and Zheng, Junwei and Cao, Ke and Chen, Yufan and Yang, Kailun and Stiefelhagen, Rainer},
booktitle={2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)},