Xuyang Liu1*, Siteng Huang2*, Yachen Kang2, Honggang Chen1, Donglin Wang2✉
1Sichuan University, 2Westlake University
TLDR: In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset.
VGDiffZero is initailized by .diffusion/models.py, and implemented by a class VGDiffZeroExecutor
, which can be found in executor.py.
Create a conda environment and activate with the following command:
# create conda env
conda env create -f environment.yml
# activate the environment
conda activate VGDiffZero
Download images via this link, and move them to ./data/
.
Download annotations from this google drive link, and move them to ./data/
.
python main.py --input_file INPUT_FILE --image_root IMAGE_ROOT --diffusion_model {1-4/2-1} --method {full_exp/core_exp/random} --box_representation_method {crop/mask/crop,mask} --box_method_aggregator {sum/max} {--output_file PATH_TO_OUTPUT_FILE} {--detector_file PATH_TO_DETECTION_FILE}
(/
is used above to denote different options for a given argument.)
--input_file
: the processed annotations in .jsonl
format
--image_root
: the top-level directory containing COCO 2014 train images
--detector_file
: if not specified, ground-truth proposals are used. For RefCOCO/g/+, the detection files should be in {refcoco/refcocog/refcoco+}_dets_dict.json
format
Choices for diffusion_model
: select different Stable Diffusion model versions. (default: "2-1")
Choices for method
: "full_exp" is using the full expression as text input. "core_exp" is using core expression extracted by spaCy as text input, and "random" is randomly selecting a proposal as the prediction. (default: "full_exp")
Choices for box_representation_method
: "crop" is using cropping only to isolate proposals. "mask" is using masking only to isolate proposals, and "crop,mask" is using both cropping and masking for comprehensive region-scoring. (default: "crop")
Choices for box_method_aggregator
: given two sets of predicted errors, "sum" selects the proposal with the minimum total error
Our implementation of VGDiffZero is partly based on the following codebases, including Stable-Diffusion, Diffusion Classifier and ReCLIP. We gratefully thank the authors for their excellent works.
Please consider citing our paper in your publications, if our findings help your research.
@inproceedings{liu2024vgdiffzero,
title={VGDiffZero: Text-to-image diffusion models can be zero-shot visual grounders},
author={Liu, Xuyang and Huang, Siteng and Kang, Yachen and Chen, Honggang and Wang, Donglin},
booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={2765--2769},
year={2024},
organization={IEEE}
}
For any question about our paper or code, please email [email protected]
.