Hyperbolic Data Filtering

Installation

Please follow the installation steps provided by the official repository of MERU:

This code requires python>=3.9, as well as pytorch>=1.7 and torchvision>=0.8. We recommend using Conda to set up the codebase.

git clone [email protected]:facebookresearch/meru.git
cd meru
conda create -n meru python=3.9 --yes
conda activate meru

Install torch and torchvision following the instructions on pytorch.org. Then install the remaining dependencies, and this codebase as a dev package:

python -m pip install --pre timm
python -m pip install -r requirements.txt
python setup.py develop

For the DataComp installation and dataset download, please refer to the README.md in the datacomp subfolder.

Trained model checkpoints

We use the following MERU model to calculate the MERU x_time for image-text pairs. Click the links below to download model checkpoints. Their training configs are available in ./configs directory.

Model: MERU ViT-large and config: train_meru_vit_l.py

The majority of the MERU project is licensed under CC-BY-NC, however portions of the project are available under separate license terms: https://github.com/openai/clip, https://github.com/facebookresearch/slip, and https://github.com/kdexd/virtex are licensed under the MIT license.

Pipeline

We first use the datacomp/full_compare.py to calculate and save the MERU x_time for all image-text pairs in DataComp-small, and then merge it to the metadata of DataComp-small using datacomp/convert.py. For data filtering, training, and evaluation, we run the following commands inside datacomp:

python baselines.py --metadata_dir your_metadata_path --save_path your_save_path/some_name.npy --name x_time_intersect_clip_score --arch l14 --xtime_arch l  --fraction value_between_0_and_1 --xtime_fraction value_between_0_and_1
python resharder.py -i your_dataset_path -o your_subdataset_path -s your_save_path/some_name.npy
torchrun --nproc_per_node your_ngpus train.py --scale small --data_dir your_subdataset_path --output_dir your_output_path --exp_name some_name
python evaluate.py  --train_output_dir your_output_path --data_dir your_eval_set_path

For the text length filtering, please use "text_length_as_x_time_intersect_clip_score" for the name parameter above.

Visualization

The visualization codes are datacomp/density.ipynb and datacomp/plots.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
configs		configs
datacomp		datacomp
meru		meru
scripts		scripts
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
apply_filter.py		apply_filter.py
baselines.py		baselines.py
convert.py		convert.py
f.txt		f.txt
mscoco-time.pdf		mscoco-time.pdf
requirements.txt		requirements.txt
setup.py		setup.py
variance.txt		variance.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hyperbolic Data Filtering

Installation

Trained model checkpoints

Pipeline

Visualization

About

Releases

Packages

Languages

License

lst627/meru-filter

Folders and files

Latest commit

History

Repository files navigation

Hyperbolic Data Filtering

Installation

Trained model checkpoints

Pipeline

Visualization

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages