PyTorch implementation of ConText-CIR from the paper "ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval"
ConText-CIR is trained with a novel Text Concept-Consistency loss that encourages better alignment between noun phrases in text and their corresponding image regions.
- Python 3.10+
- PyTorch 2.0+
- CUDA 11.8+
# Clone the repository
git clone https://github.com/yourusername/ConText-CIR.git
cd ConText-CIR
# Install dependencies
pip install -r requirements.txt- Download CIRR dataset from here
- Extract to
data/cirr/
- Download from here
- Place in
data/lasco/
- Download rewritten captions from our release
- Place in
data/cirr/rewritten_captions/
- Download Hotel-CIR from our release
- Extract to
data/hotels/
Create a config.yaml file:
model_dir: ./checkpoints
cirr_data_path: ./data/cirr
hotel_data_path: ./data/hotels
lasco_data_path: ./data/lasco# Train with CIRR dataset on single GPU
python Train.py --datasets cirr \
--backbone_size B \
--train_batch_size 256 \
--learning_rate 1e-5
# Train with multiple datasets on multiple GPUs
python Train.py --datasets cirr,cirr_r,hotels \
--backbone_size H \
--devices 4 \
--train_batch_size 64 \
--lambda_cc 0.08| Argument | Default | Description |
|---|---|---|
--datasets |
Required | Comma-separated list: cirr, cirr_r, hotels, lasco |
--backbone_size |
B | CLIP model size: B, L, or H |
--lambda_cc |
0.08 | Weight for concept-consistency loss |
--epsilon_cc |
0.05 | Slack variable for CC loss |
--max_nps |
10 | Max noun phrases per text |
--train_batch_size |
256 | Training batch size |
--learning_rate |
1e-5 | Learning rate |
--devices |
1 | Number of GPUs |
--approx_steps |
35000 | Approximate training steps |
# Resume from checkpoint
python Train.py --datasets cirr \
--backbone_size H \
--reload \
--output_dir ./checkpoints/experiment1We provide utilities to produce submissions to the CIRR and CIRCO testing servers.
If you find this code useful for your research, please cite:
@article{xing2025context,
title={ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval},
author={Xing, Eric and Kolouju, Pranavi and Pless, Robert and Stylianou, Abby and Jacobs, Nathan},
journal={Computer Vision and Pattern Recognition},
year={2025}
}