From 344068a64ce16d2fdd05cfdcc179cc8781ec9def Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Gabriel=20Ilharco=20Magalh=C3=A3es?= Date: Tue, 3 Oct 2023 18:32:39 +0200 Subject: [PATCH] Add table with results in README and overall clarify it (#652) * README revamp * update docs * update readme * update readme * update readme * Add results table * update table * update readme * update readme * update readme --- README.md | 361 +++++--------------------------------- docs/LOW_ACC.md | 38 ++++ docs/PRETRAINED.md | 148 ++++++++++++++++ docs/openclip_results.csv | 91 ++++++++++ 4 files changed, 321 insertions(+), 317 deletions(-) create mode 100644 docs/LOW_ACC.md create mode 100644 docs/PRETRAINED.md create mode 100644 docs/openclip_results.csv diff --git a/README.md b/README.md index 4922e47b8..6cb3767cd 100644 --- a/README.md +++ b/README.md @@ -1,49 +1,34 @@ # OpenCLIP -[[Paper]](https://arxiv.org/abs/2212.07143) [[Clip Colab]](https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_clip.ipynb) [[Coca Colab]](https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_coca.ipynb) +[[Paper]](https://arxiv.org/abs/2212.07143) [[Citations]](#citing) [[Clip Colab]](https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_clip.ipynb) [[Coca Colab]](https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_coca.ipynb) [![pypi](https://img.shields.io/pypi/v/open_clip_torch.svg)](https://pypi.python.org/pypi/open_clip_torch) Welcome to an open source implementation of OpenAI's [CLIP](https://arxiv.org/abs/2103.00020) (Contrastive Language-Image Pre-training). -The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift. Our starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset. -Specifically, a ResNet-50 model trained with our codebase on OpenAI's [15 million image subset of YFCC](https://github.com/openai/CLIP/blob/main/data/yfcc100m.md) achieves **32.7%** top-1 accuracy on ImageNet. OpenAI's CLIP model reaches **31.3%** when trained on the same subset of YFCC. For ease of experimentation, we also provide code for training on the 3 million images in the [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/download) dataset, where a ResNet-50x4 trained with our codebase reaches 22.2% top-1 ImageNet accuracy. - -We further this with a replication study on a dataset of comparable size to OpenAI's, [LAION-400M](https://arxiv.org/abs/2111.02114), and with larger datasets such as [LAION-2B](https://laion.ai/blog/laion-5b/) and [DataComp-1B](https://arxiv.org/abs/2304.14108) datasets. In addition, we study scaling behavior in a paper on [reproducible scaling laws for contrastive language-image learning](https://arxiv.org/abs/2212.07143). - -We have trained the following ViT CLIP models: - * ViT-B/32 on LAION-400M with a accuracy of **62.9%**, comparable to OpenAI's **63.2%**, zero-shot top-1 on ImageNet-1k - * ViT-B/32 on LAION-2B with a accuracy of **66.6%**. - * ViT-B/16 on LAION-400M achieving an accuracy of **67.1%**, lower than OpenAI's **68.3%** (as measured here, 68.6% in paper) - * ViT-B/16+ 240x240 (~50% more FLOPS than B/16 224x224) on LAION-400M achieving an accuracy of **69.2%** - * ViT-B/16 on LAION-2B with a accuracy of **70.2%**. - * ViT-L/14 on LAION-400M with an accuracy of **72.77%**, vs OpenAI's **75.5%** (as measured here, 75.3% in paper) - * ViT-L/14 on LAION-2B with an accuracy of **75.3%**, vs OpenAI's **75.5%** (as measured here, 75.3% in paper) - * ViT-L/14 on [DataComp-1B](https://github.com/mlfoundations/datacomp) with an accuracy of **79.2%**. Our best ViT-L/14 so far, trained with a 13B samples seen schedule. - * CoCa ViT-L/14 on LAION-2B with an accuracy of **75.5%** (currently only 13B samples seen) vs. CLIP ViT-L/14 73.1% (on the same dataset and samples seen) - * ViT-H/14 on LAION-2B with an accuracy of **78.0%**. - * ViT-g/14 on LAION-2B with an accuracy of **76.6%**. This was trained on reduced 12B samples seen schedule, same samples seen as 400M models. - * ViT-g/14 on LAION-2B with an accuracy of **78.5%**. Full 34B samples seen schedule. - * ViT-G/14 on LAION-2B with an accuracy of **80.1%**. The best in1k zero-shot for released, open-source weights thus far. - -And the following ConvNeXt CLIP models: - * ConvNext-Base @ 224x224 on LAION-400M with an ImageNet-1k zero-shot top-1 of **66.3%** - * ConvNext-Base (W) @ 256x256 on LAION-2B with an ImageNet-1k zero-shot top-1 of **70.8%** - * ConvNext-Base (W) @ 256x256 /w augreg (extra augmentation + regularization) on LAION-2B with a top-1 of **71.5%** - * ConvNext-Base (W) @ 256x256 on LAION-A (900M sample aesthetic subset of 2B) with a top-1 of **71.0%** - * ConvNext-Base (W) @ 320x320 on LAION-A with a top-1 of **71.7%** (eval at 384x384 is **71.0**) - * ConvNext-Base (W) @ 320x320 /w augreg on LAION-A with a top-1 of **71.3%** (eval at 384x384 is **72.2%**) - * ConvNext-Large (D) @ 256x256 /w augreg on LAION-2B with a top-1 of **75.9%** - * ConvNext-Large (D) @ 320x320 fine-tune of 256x256 weights above for ~2.5B more samples on LAION-2B, top-1 of **76.6%** - * ConvNext-Large (D) @ 320x320 soup of 3 fine-tunes of 256x256 weights above on LAION-2B, top-1 of **76.9%** - * ConvNext-XXLarge @ 256x256 original run **79.1%** - * ConvNext-XXLarge @ 256x256 rewind of last 10% **79.3%** - * ConvNext-XXLarge @ 256x256 soup of original + rewind **79.4%** - -Model cards w/ additional model specific details can be found on the Hugging Face Hub under the OpenCLIP library tag: https://huggingface.co/models?library=open_clip - -As we describe in more detail [below](#why-are-low-accuracy-clip-models-interesting), CLIP models in a medium accuracy regime already allow us to draw conclusions about the robustness of larger CLIP models since the models follow [reliable scaling laws](https://arxiv.org/abs/2107.04649). - -This codebase is work in progress, and we invite all to contribute in making it more accessible and useful. In the future, we plan to add support for TPU training and release larger models. We hope this codebase facilitates and promotes further research in contrastive image-text learning. Please submit an issue or send an email if you have any other requests or suggestions. +Using this codebase, we have trained several models on a variety of data sources and compute budgets, ranging from [small-scale experiments](docs/LOW_ACC.md) to larger runs including models trained on datasets such as [LAION-400M](https://arxiv.org/abs/2111.02114), [LAION-2B](https://arxiv.org/abs/2210.08402) and [DataComp-1B](https://arxiv.org/abs/2304.14108). +Many of our models and their scaling properties are studied in detail in the paper [reproducible scaling laws for contrastive language-image learning](https://arxiv.org/abs/2212.07143). +Some of our best models and their zero-shot ImageNet-1k accuracy are shown below, along with the ViT-L model trained by OpenAI. +We provide more details about our full collection of pretrained models [here](docs/PRETRAINED.md), and zero-shot results for 38 datasets [here](docs/openclip_results.csv). + + +| Model | Training data | Resolution | # of samples seen | ImageNet zero-shot acc. | +| -------- | ------- | ------- | ------- | ------- | +| ConvNext-Base | LAION-2B | 256px | 13B | 71.5% | +| ConvNext-Large | LAION-2B | 320px | 29B | 76.9% | +| ConvNext-XXLarge | LAION-2B | 256px | 34B | 79.5% | +| ViT-B/32 | DataComp-1B | 256px | 34B | 72.8% | +| ViT-B/16 | DataComp-1B | 224px | 13B | 73.5% | +| ViT-L/14 | LAION-2B | 224px | 32B | 75.3% | +| ViT-H/14 | LAION-2B | 224px | 32B | 78.0% | +| ViT-L/14 | DataComp-1B | 224px | 13B | 79.2% | +| ViT-G/14 | LAION-2B | 224px | 34B | 80.1% | +| | | | | | +| ViT-L/14 | OpenAI's WIT | 224px | 13B | 75.5% | + +Model cards with additional model specific details can be found on the Hugging Face Hub under the OpenCLIP library tag: https://huggingface.co/models?library=open_clip. + +If you found this repository useful, please consider [citing](#citing). +We welcome anyone to submit an issue or send an email if you have any other requests or suggestions. Note that portions of `src/open_clip/` modelling and tokenizer code are adaptations of OpenAI's official [repository](https://github.com/openai/CLIP). @@ -80,21 +65,36 @@ with torch.no_grad(), torch.cuda.amp.autocast(): print("Label probs:", text_probs) # prints: [[1., 0., 0.]] ``` -See also this [[Clip Colab]](https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_clip.ipynb) + +See also this [[Clip Colab]](https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_clip.ipynb). To compute billions of embeddings efficiently, you can use [clip-retrieval](https://github.com/rom1504/clip-retrieval) which has openclip support. +### Pretrained models + +We offer a simple model interface to instantiate both pre-trained and untrained models. +To see which pretrained models are available, use the following code snippet. +More details about our pretrained models are available [here](docs/PRETRAINED.md). + +```python +>>> import open_clip +>>> open_clip.list_pretrained() +``` + +NOTE: Many existing checkpoints use the QuickGELU activation from the original OpenAI models. This activation is actually less efficient than native torch.nn.GELU in recent versions of PyTorch. The model defaults are now nn.GELU, so one should use model definitions with `-quickgelu` postfix for the OpenCLIP pretrained weights. All OpenAI pretrained weights will always default to QuickGELU. One can also use the non `-quickgelu` model definitions with pretrained weights using QuickGELU but there will be an accuracy drop, for fine-tune that will likely vanish for longer runs. +Future trained models will use nn.GELU. + ## Fine-tuning on classification tasks This repository is focused on training CLIP models. To fine-tune a *trained* zero-shot model on a downstream classification task such as ImageNet, please see [our other repository: WiSE-FT](https://github.com/mlfoundations/wise-ft). The [WiSE-FT repository](https://github.com/mlfoundations/wise-ft) contains code for our paper on [Robust Fine-tuning of Zero-shot Models](https://arxiv.org/abs/2109.01903), in which we introduce a technique for fine-tuning zero-shot models while preserving robustness under distribution shift. ## Data -To download datasets as webdataset, we recommend [img2dataset](https://github.com/rom1504/img2dataset) +To download datasets as webdataset, we recommend [img2dataset](https://github.com/rom1504/img2dataset). ### Conceptual Captions -See [cc3m img2dataset example](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc3m.md) +See [cc3m img2dataset example](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc3m.md). ### YFCC and other datasets @@ -442,246 +442,6 @@ python -m training.main \ --pretrained laion400m_e32 ``` -## Pretrained model details - -### LAION-400M - https://laion.ai/laion-400-open-dataset - -We are working on reproducing OpenAI's ViT results with the comparably sized (and open) LAION-400M dataset. Trained -weights may be found in release [v0.2](https://github.com/mlfoundations/open_clip/releases/tag/v0.2-weights). - -The LAION400M weights have been trained on the JUWELS supercomputer (see acknowledgements section below). - -#### ViT-B/32 224x224 - -We replicate OpenAI's results on ViT-B/32, reaching a top-1 ImageNet-1k zero-shot accuracy of 62.96%. - - - -__Zero-shot comparison (courtesy of Andreas Fürst)__ - - -ViT-B/32 was trained with 128 A100 (40 GB) GPUs for ~36 hours, 4600 GPU-hours. The per-GPU batch size was 256 for a global batch size of 32768. 256 is much lower than it could have been (~320-384) due to being sized initially before moving to 'local' contrastive loss. - -#### ViT-B/16 224x224 - -The B/16 LAION400M training reached a top-1 ImageNet-1k zero-shot validation score of 67.07. - - - -This was the first major train session using the updated webdataset 0.2.x code. A bug was found that prevented shards from being shuffled properly between nodes/workers each epoch. This was fixed part way through training (epoch 26) but likely had an impact. - -ViT-B/16 was trained with 176 A100 (40 GB) GPUS for ~61 hours, 10700 GPU-hours. Batch size per GPU was 192 for a global batch size of 33792. - -#### ViT-B/16+ 240x240 - -The B/16+ 240x240 LAION400M training reached a top-1 ImageNet-1k zero-shot validation score of 69.21. - -This model is the same depth as the B/16, but increases the - * vision width from 768 -> 896 - * text width from 512 -> 640 - * the resolution 224x224 -> 240x240 (196 -> 225 tokens) - - - -Unlike the B/16 run above, this model was a clean run with no dataset shuffling issues. - -ViT-B/16+ was trained with 224 A100 (40 GB) GPUS for ~61 hours, 13620 GPU-hours. Batch size per GPU was 160 for a global batch size of 35840. - -#### ViT-L/14 224x224 - -The L/14 LAION-400M training reached a top-1 ImageNet-1k zero-shot validation score of 72.77. - - - -ViT-L/14 was trained with 400 A100 (40 GB) GPUS for ~127 hours, 50800 GPU-hours. Batch size per GPU was 96 for a global batch size of 38400. Grad checkpointing was enabled. - -### LAION-2B (en) - https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/ - -A ~2B sample subset of LAION-5B with english captions (https://huggingface.co/datasets/laion/laion2B-en) - -#### ViT-B/32 224x224 -A ViT-B/32 trained on LAION-2B, reaching a top-1 ImageNet-1k zero-shot accuracy of 65.62%. - - - -ViT-B/32 was trained with 112 A100 (40 GB) GPUs. The per-GPU batch size was 416 for a global batch size of 46592. Compute generously provided by [stability.ai](https://stability.ai/). - -A second iteration of B/32 was trained on stability.ai cluster with a larger global batch size and learning rate, hitting 66.6% top-1. See https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K - -#### ViT-L/14 224x224 - -A ViT-L/14 with a 75.3% top-1 ImageNet-1k zero-shot was trained on JUWELS Booster. See model details here https://huggingface.co/laion/CLIP-ViT-L-14-laion2B-s32B-b82K - -These weights use a different dataset mean and std than others. Instead of using the OpenAI mean & std, inception style normalization `[-1, 1]` is used via a mean and std of `[0.5, 0.5, 0.5]`. This is handled automatically if using `open_clip.create_model_and_transforms` from pretrained weights. - -#### ViT-H/14 224x224 - -A ViT-H/14 with a 78.0% top-1 ImageNet-1k zero-shot was trained on JUWELS Booster. See model details here https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K - -#### ViT-g/14 224x224 - -A ViT-g/14 with a 76.6% top-1 ImageNet-1k zero-shot was trained on JUWELS Booster. See model details here https://huggingface.co/laion/CLIP-ViT-g-14-laion2B-s12B-b42K - -This model was trained with a shorted schedule than other LAION-2B models with 12B samples seen instead of 32+B. It matches LAION-400M training in samples seen. Many zero-shot results are lower as a result, but despite this it performs very well in some OOD zero-shot and retrieval tasks. - - -#### ViT-B/32 roberta base - -A ViT-B/32 with roberta base encoder with a 61.7% top-1 ImageNet-1k zero-shot was trained on stability. See model details here https://huggingface.co/laion/CLIP-ViT-B-32-roberta-base-laion2B-s12B-b32k -This is the first openclip model using a HF text tower. It has better performance on a range of tasks compared to the standard text encoder, see [metrics](https://huggingface.co/laion/CLIP-ViT-B-32-roberta-base-laion2B-s12B-b32k/blob/main/unknown.png) - -#### ViT-B/32 xlm roberta base - -A ViT-B/32 with xlm roberta base encoder with a 62.33% top-1 ImageNet-1k zero-shot was trained on stability. See model details here https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k -This is the first openclip model trained on the full laion5B dataset; hence the first multilingual clip trained with openclip. It has better performance on a range of tasks compared to the standard text encoder, see [metrics](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k/blob/main/metrics.png) -A preliminary multilingual evaluation was run: 43% on imagenet1k italian (vs 21% for english B/32), 37% for imagenet1k japanese (vs 1% for english B/32 and 50% for B/16 clip japanese). It shows the multilingual property is indeed there as expected. Larger models will get even better performance. - -#### ViT-H/14 xlm roberta large - -A ViT-H/14 with xlm roberta large encoder with a 77.0% (vs 78% for the english equivalent) top-1 ImageNet-1k zero-shot was trained on stability. See model details here https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k - -This model was trained following the [LiT](https://arxiv.org/abs/2111.07991) methodology: the image tower was frozen (initialized from english openclip ViT-H/14), the text tower was initialized from [xlm roberta large](https://huggingface.co/xlm-roberta-large) and unfrozen. This reduced training cost by a 3x factor. - -See full english [metrics](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/resolve/main/results_xlm_roberta_large.png) - -On zero shot classification on imagenet with translated prompts this model reaches: - -* 56% in italian (vs 21% for https://github.com/clip-italian/clip-italian) -* 53% in japanese (vs 54.6% for https://github.com/rinnakk/japanese-clip) -* 55.7% in chinese (to be compared with https://github.com/OFA-Sys/Chinese-CLIP) - - -### CommonPool and DataComp models - -As part of [DataComp](https://github.com/mlfoundations/datacomp), we trained models on CommonPool using various data filtering strategies. - -The best performing models are specified below for the xlarge scale, see our paper [DataComp: In seearch of the next generation of multimodal datasets](https://arxiv.org/abs/2304.14108) for more details. - -Additional models and more information can be found at [/docs/datacomp_models.md](/docs/datacomp_models.md). - - -* `datacomp_xl_s13b_b90k`: A ViT-L/14 trained on DataComp-1B for 12.8B steps and batch size 90k. Achieves 79.2% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K. - -* `commonpool_xl_clip_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL filtered using CLIP scores, for 12.8B steps and batch size 90k. Achieves 76.4% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL.clip-s13B-b90K. - -* `commonpool_xl_laion_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL filtered using the LAION-2B filtering scheme, for 12.8B steps and batch size 90k. Achieves 75.5% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL.laion-s13B-b90K. - -* `commonpool_xl_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL without any filtering, for 12.8B steps and batch size 90k. Achieves 72.3% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL-s13B-b90K. - - -#### YFCC-15M - -Below are checkpoints of models trained on YFCC-15M, along with their zero-shot top-1 accuracies on ImageNet and ImageNetV2. These models were trained using 8 GPUs and the same hyperparameters described in the "Sample running code" section, with the exception of `lr=5e-4` and `epochs=32`. - -* [ResNet-50](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn50-quickgelu-yfcc15m-455df137.pt) (32.7% / 27.9%) -* [ResNet-101](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn101-quickgelu-yfcc15m-3e04b30e.pt) (34.8% / 30.0%) - -#### CC12M - https://github.com/google-research-datasets/conceptual-12m - -* [ResNet-50](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn50-quickgelu-cc12m-f000538c.pt) (36.45%) - -### Pretrained Model Interface - -We offer a simple model interface to instantiate both pre-trained and untrained models. - -NOTE: Many existing checkpoints use the QuickGELU activation from the original OpenAI models. This activation is actually less efficient than native torch.nn.GELU in recent versions of PyTorch. The model defaults are now nn.GELU, so one should use model definitions with `-quickgelu` postfix for the OpenCLIP pretrained weights. All OpenAI pretrained weights will always default to QuickGELU. One can also use the non `-quickgelu` model definitions with pretrained weights using QuickGELU but there will be an accuracy drop, for fine-tune that will likely vanish for longer runs. - -Future trained models will use nn.GELU. - -```python ->>> import open_clip ->>> open_clip.list_pretrained() -[('RN50', 'openai'), -('RN50', 'yfcc15m'), -('RN50', 'cc12m'), -('RN50-quickgelu', 'openai'), -('RN50-quickgelu', 'yfcc15m'), -('RN50-quickgelu', 'cc12m'), -('RN101', 'openai'), -('RN101', 'yfcc15m'), -('RN101-quickgelu', 'openai'), -('RN101-quickgelu', 'yfcc15m'), -('RN50x4', 'openai'), -('RN50x16', 'openai'), -('RN50x64', 'openai'), -('ViT-B-32', 'openai'), -('ViT-B-32', 'laion400m_e31'), -('ViT-B-32', 'laion400m_e32'), -('ViT-B-32', 'laion2b_e16'), -('ViT-B-32', 'laion2b_s34b_b79k'), -('ViT-B-32', 'datacomp_m_s128m_b4k'), -('ViT-B-32', 'commonpool_m_clip_s128m_b4k'), -('ViT-B-32', 'commonpool_m_laion_s128m_b4k'), -('ViT-B-32', 'commonpool_m_image_s128m_b4k'), -('ViT-B-32', 'commonpool_m_text_s128m_b4k'), -('ViT-B-32', 'commonpool_m_basic_s128m_b4k'), -('ViT-B-32', 'commonpool_m_s128m_b4k'), -('ViT-B-32', 'datacomp_s_s13m_b4k'), -('ViT-B-32', 'commonpool_s_clip_s13m_b4k'), -('ViT-B-32', 'commonpool_s_laion_s13m_b4k'), -('ViT-B-32', 'commonpool_s_image_s13m_b4k'), -('ViT-B-32', 'commonpool_s_text_s13m_b4k'), -('ViT-B-32', 'commonpool_s_basic_s13m_b4k'), -('ViT-B-32', 'commonpool_s_s13m_b4k'), -('ViT-B-32-quickgelu', 'openai'), -('ViT-B-32-quickgelu', 'laion400m_e31'), -('ViT-B-32-quickgelu', 'laion400m_e32'), -('ViT-B-16', 'openai'), -('ViT-B-16', 'laion400m_e31'), -('ViT-B-16', 'laion400m_e32'), -('ViT-B-16', 'laion2b_s34b_b88k'), -('ViT-B-16', 'datacomp_l_s1b_b8k'), -('ViT-B-16', 'commonpool_l_clip_s1b_b8k'), -('ViT-B-16', 'commonpool_l_laion_s1b_b8k'), -('ViT-B-16', 'commonpool_l_image_s1b_b8k'), -('ViT-B-16', 'commonpool_l_text_s1b_b8k'), -('ViT-B-16', 'commonpool_l_basic_s1b_b8k'), -('ViT-B-16', 'commonpool_l_s1b_b8k'), -('ViT-B-16-plus-240', 'laion400m_e31'), -('ViT-B-16-plus-240', 'laion400m_e32'), -('ViT-L-14', 'openai'), -('ViT-L-14', 'laion400m_e31'), -('ViT-L-14', 'laion400m_e32'), -('ViT-L-14', 'laion2b_s32b_b82k'), -('ViT-L-14', 'datacomp_xl_s13b_b90k'), -('ViT-L-14', 'commonpool_xl_clip_s13b_b90k'), -('ViT-L-14', 'commonpool_xl_laion_s13b_b90k'), -('ViT-L-14', 'commonpool_xl_s13b_b90k'), -('ViT-L-14-336', 'openai'), -('ViT-H-14', 'laion2b_s32b_b79k'), -('ViT-g-14', 'laion2b_s12b_b42k'), -('ViT-g-14', 'laion2b_s34b_b88k'), -('ViT-bigG-14', 'laion2b_s39b_b160k'), -('roberta-ViT-B-32', 'laion2b_s12b_b32k'), -('xlm-roberta-base-ViT-B-32', 'laion5b_s13b_b90k'), -('xlm-roberta-large-ViT-H-14', 'frozen_laion5b_s13b_b90k'), -('convnext_base', 'laion400m_s13b_b51k'), -('convnext_base_w', 'laion2b_s13b_b82k'), -('convnext_base_w', 'laion2b_s13b_b82k_augreg'), -('convnext_base_w', 'laion_aesthetic_s13b_b82k'), -('convnext_base_w_320', 'laion_aesthetic_s13b_b82k'), -('convnext_base_w_320', 'laion_aesthetic_s13b_b82k_augreg'), -('convnext_large_d', 'laion2b_s26b_b102k_augreg'), -('convnext_large_d_320', 'laion2b_s29b_b131k_ft'), -('convnext_large_d_320', 'laion2b_s29b_b131k_ft_soup'), -('convnext_xxlarge', 'laion2b_s34b_b82k_augreg'), -('convnext_xxlarge', 'laion2b_s34b_b82k_augreg_rewind'), -('convnext_xxlarge', 'laion2b_s34b_b82k_augreg_soup'), -('coca_ViT-B-32', 'laion2b_s13b_b90k'), -('coca_ViT-B-32', 'mscoco_finetuned_laion2b_s13b_b90k'), -('coca_ViT-L-14', 'laion2b_s13b_b90k'), -('coca_ViT-L-14', 'mscoco_finetuned_laion2b_s13b_b90k'), -('EVA01-g-14', 'laion400m_s11b_b41k'), -('EVA01-g-14-plus', 'merged2b_s11b_b114k'), -('EVA02-B-16', 'merged2b_s8b_b131k'), -('EVA02-L-14', 'merged2b_s4b_b131k'), -('EVA02-L-14-336', 'merged2b_s6b_b61k'), -('EVA02-E-14', 'laion2b_s4b_b115k'), -('EVA02-E-14-plus', 'laion2b_s9b_b144k') -] - ->>> model, train_transform, eval_transform = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k') -``` ### Model distillation You can distill from a pre-trained by using `--distill-model` and `--distill-pretrained` to specify the model you'd like to distill from. @@ -735,40 +495,7 @@ The module `open_clip.push_to_hf_hub` includes helpers for pushing models /w wei The tool can be run from command line, ex: `python -m open_clip.push_to_hf_hub --model convnext_large_d_320 --pretrained /train/checkpoints/epoch_12.pt --repo-id laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft` -## Scaling trends - -The plot below shows how zero-shot performance of CLIP models varies as we scale the number of samples used for training. Zero-shot performance increases steadily for both ImageNet and [ImageNetV2](https://arxiv.org/abs/1902.10811), and is far from saturated at ~15M samples. - - - -## Why are low-accuracy CLIP models interesting? - -**TL;DR:** CLIP models have high effective robustness, even at small scales. - -CLIP models are particularly intriguing because they are more robust to natural distribution shifts (see Section 3.3 in the [CLIP paper](https://arxiv.org/abs/2103.00020)). -This phenomena is illustrated by the figure below, with ImageNet accuracy on the x-axis -and [ImageNetV2](https://arxiv.org/abs/1902.10811) (a reproduction of the ImageNet validation set with distribution shift) accuracy on the y-axis. -Standard training denotes training on the ImageNet train set and the CLIP zero-shot models -are shown as stars. - -![CLIP scatter plot](https://raw.githubusercontent.com/mlfoundations/open_clip/main/docs/effective_robustness.png) - -As observed by [Taori et al., 2020](https://arxiv.org/abs/2007.00644) and [Miller et al., 2021](https://arxiv.org/abs/2107.04649), the in-distribution -and out-of-distribution accuracies of models trained on ImageNet follow a predictable linear trend (the red line in the above plot). *Effective robustness* -quantifies robustness as accuracy beyond this baseline, i.e., how far a model lies above the red line. Ideally a model would not suffer from distribution shift and fall on the y = x line ([trained human labelers are within a percentage point of the y = x line](http://proceedings.mlr.press/v119/shankar20c.html)). - -Even though the CLIP models trained with -this codebase achieve much lower accuracy than those trained by OpenAI, our models still lie on the same -trend of improved effective robustness (the purple line). Therefore, we can study what makes -CLIP robust without requiring industrial-scale compute. - -For more information on effective robustness, please see: - -- [Recht et al., 2019](https://arxiv.org/abs/1902.10811). -- [Taori et al., 2020](https://arxiv.org/abs/2007.00644). -- [Miller et al., 2021](https://arxiv.org/abs/2107.04649). -To know more about the factors that contribute to CLIP's robustness refer to [Fang et al., 2022](https://arxiv.org/abs/2205.01397). ## Acknowledgments @@ -776,7 +503,7 @@ We gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-ce ## The Team -Current development of this repository is led by [Ross Wightman](https://rwightman.com/), [Cade Gordon](http://cadegordon.io/), and [Vaishaal Shankar](http://vaishaal.com/). +Current development of this repository is led by [Ross Wightman](https://rwightman.com/), [Romain Beaumont](https://github.com/rom1504), [Cade Gordon](http://cadegordon.io/), and [Vaishaal Shankar](http://vaishaal.com/). The original version of this repository is from a group of researchers at UW, Google, Stanford, Amazon, Columbia, and Berkeley. diff --git a/docs/LOW_ACC.md b/docs/LOW_ACC.md new file mode 100644 index 000000000..adcad6f22 --- /dev/null +++ b/docs/LOW_ACC.md @@ -0,0 +1,38 @@ +As we describe in more detail below, CLIP models in a medium accuracy regime already allow us to draw conclusions about the robustness of larger CLIP models since the models follow reliable scaling laws. + +[Cherti et al., 2022](https://arxiv.org/abs/2212.07143) and [Gadre et al., 2023](https://arxiv.org/abs/2304.14108) show additional discussions about the scaling behavior of CLIP models. + +## Scaling trends + +The plot below shows how zero-shot performance of CLIP models varies as we scale the number of samples used for training. Zero-shot performance increases steadily for both ImageNet and [ImageNetV2](https://arxiv.org/abs/1902.10811), and is far from saturated at ~15M samples. + + + +## Why are low-accuracy CLIP models interesting? + +**TL;DR:** CLIP models have high effective robustness, even at small scales. + +CLIP models are particularly intriguing because they are more robust to natural distribution shifts (see Section 3.3 in the [CLIP paper](https://arxiv.org/abs/2103.00020)). +This phenomena is illustrated by the figure below, with ImageNet accuracy on the x-axis +and [ImageNetV2](https://arxiv.org/abs/1902.10811) (a reproduction of the ImageNet validation set with distribution shift) accuracy on the y-axis. +Standard training denotes training on the ImageNet train set and the CLIP zero-shot models +are shown as stars. + +![CLIP scatter plot](https://raw.githubusercontent.com/mlfoundations/open_clip/main/docs/effective_robustness.png) + +As observed by [Taori et al., 2020](https://arxiv.org/abs/2007.00644) and [Miller et al., 2021](https://arxiv.org/abs/2107.04649), the in-distribution +and out-of-distribution accuracies of models trained on ImageNet follow a predictable linear trend (the red line in the above plot). *Effective robustness* +quantifies robustness as accuracy beyond this baseline, i.e., how far a model lies above the red line. Ideally a model would not suffer from distribution shift and fall on the y = x line ([trained human labelers are within a percentage point of the y = x line](http://proceedings.mlr.press/v119/shankar20c.html)). + +Even though the CLIP models trained with +this codebase achieve much lower accuracy than those trained by OpenAI, our models still lie on the same +trend of improved effective robustness (the purple line). Therefore, we can study what makes +CLIP robust without requiring industrial-scale compute. + +For more information on effective robustness, please see: + +- [Recht et al., 2019](https://arxiv.org/abs/1902.10811). +- [Taori et al., 2020](https://arxiv.org/abs/2007.00644). +- [Miller et al., 2021](https://arxiv.org/abs/2107.04649). + +To know more about the factors that contribute to CLIP's robustness refer to [Fang et al., 2022](https://arxiv.org/abs/2205.01397). \ No newline at end of file diff --git a/docs/PRETRAINED.md b/docs/PRETRAINED.md new file mode 100644 index 000000000..37eae61e8 --- /dev/null +++ b/docs/PRETRAINED.md @@ -0,0 +1,148 @@ +## Pretrained model results + +We evaluate the full collection of available models on a suite of 38 datasets in a zero-shot setting (i.e., without fine-tuning), following [Gadre et al., 2023](https://arxiv.org/abs/2304.14108). +Click below to see the full results. + +[Full results](docs/openclip_results.csv) + +## Pretrained model details + +Below are details for several of our pretrained models. + +### LAION-400M - https://laion.ai/laion-400-open-dataset + +We ran experiments in an attempt to reproduce OpenAI's ViT results with the comparably sized (and open) LAION-400M dataset. Trained +weights can be found in release [v0.2](https://github.com/mlfoundations/open_clip/releases/tag/v0.2-weights). + +The LAION400M weights have been trained on the JUWELS supercomputer (see acknowledgements section below). + +#### ViT-B/32 224x224 + +We replicate OpenAI's results on ViT-B/32, reaching a top-1 ImageNet-1k zero-shot accuracy of 62.96%. + + + +__Zero-shot comparison (courtesy of Andreas Fürst)__ + + +ViT-B/32 was trained with 128 A100 (40 GB) GPUs for ~36 hours, 4600 GPU-hours. The per-GPU batch size was 256 for a global batch size of 32768. 256 is much lower than it could have been (~320-384) due to being sized initially before moving to 'local' contrastive loss. + +#### ViT-B/16 224x224 + +The B/16 LAION400M training reached a top-1 ImageNet-1k zero-shot validation score of 67.07. + + + +This was the first major train session using the updated webdataset 0.2.x code. A bug was found that prevented shards from being shuffled properly between nodes/workers each epoch. This was fixed part way through training (epoch 26) but likely had an impact. + +ViT-B/16 was trained with 176 A100 (40 GB) GPUS for ~61 hours, 10700 GPU-hours. Batch size per GPU was 192 for a global batch size of 33792. + +#### ViT-B/16+ 240x240 + +The B/16+ 240x240 LAION400M training reached a top-1 ImageNet-1k zero-shot validation score of 69.21. + +This model is the same depth as the B/16, but increases the + * vision width from 768 -> 896 + * text width from 512 -> 640 + * the resolution 224x224 -> 240x240 (196 -> 225 tokens) + + + +Unlike the B/16 run above, this model was a clean run with no dataset shuffling issues. + +ViT-B/16+ was trained with 224 A100 (40 GB) GPUS for ~61 hours, 13620 GPU-hours. Batch size per GPU was 160 for a global batch size of 35840. + +#### ViT-L/14 224x224 + +The L/14 LAION-400M training reached a top-1 ImageNet-1k zero-shot validation score of 72.77. + + + +ViT-L/14 was trained with 400 A100 (40 GB) GPUS for ~127 hours, 50800 GPU-hours. Batch size per GPU was 96 for a global batch size of 38400. Grad checkpointing was enabled. + +### LAION-2B (en) - https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/ + +A ~2B sample subset of LAION-5B with english captions (https://huggingface.co/datasets/laion/laion2B-en) + +#### ViT-B/32 224x224 +A ViT-B/32 trained on LAION-2B, reaching a top-1 ImageNet-1k zero-shot accuracy of 65.62%. + + + +ViT-B/32 was trained with 112 A100 (40 GB) GPUs. The per-GPU batch size was 416 for a global batch size of 46592. Compute generously provided by [stability.ai](https://stability.ai/). + +A second iteration of B/32 was trained on stability.ai cluster with a larger global batch size and learning rate, hitting 66.6% top-1. See https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K + +#### ViT-L/14 224x224 + +A ViT-L/14 with a 75.3% top-1 ImageNet-1k zero-shot was trained on JUWELS Booster. See model details here https://huggingface.co/laion/CLIP-ViT-L-14-laion2B-s32B-b82K + +These weights use a different dataset mean and std than others. Instead of using the OpenAI mean & std, inception style normalization `[-1, 1]` is used via a mean and std of `[0.5, 0.5, 0.5]`. This is handled automatically if using `open_clip.create_model_and_transforms` from pretrained weights. + +#### ViT-H/14 224x224 + +A ViT-H/14 with a 78.0% top-1 ImageNet-1k zero-shot was trained on JUWELS Booster. See model details here https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K + +#### ViT-g/14 224x224 + +A ViT-g/14 with a 76.6% top-1 ImageNet-1k zero-shot was trained on JUWELS Booster. See model details here https://huggingface.co/laion/CLIP-ViT-g-14-laion2B-s12B-b42K + +This model was trained with a shorted schedule than other LAION-2B models with 12B samples seen instead of 32+B. It matches LAION-400M training in samples seen. Many zero-shot results are lower as a result, but despite this it performs very well in some OOD zero-shot and retrieval tasks. + + +#### ViT-B/32 roberta base + +A ViT-B/32 with roberta base encoder with a 61.7% top-1 ImageNet-1k zero-shot was trained on stability. See model details here https://huggingface.co/laion/CLIP-ViT-B-32-roberta-base-laion2B-s12B-b32k +This is the first openclip model using a HF text tower. It has better performance on a range of tasks compared to the standard text encoder, see [metrics](https://huggingface.co/laion/CLIP-ViT-B-32-roberta-base-laion2B-s12B-b32k/blob/main/unknown.png) + +#### ViT-B/32 xlm roberta base + +A ViT-B/32 with xlm roberta base encoder with a 62.33% top-1 ImageNet-1k zero-shot was trained on stability. See model details here https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k +This is the first openclip model trained on the full laion5B dataset; hence the first multilingual clip trained with openclip. It has better performance on a range of tasks compared to the standard text encoder, see [metrics](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k/blob/main/metrics.png) +A preliminary multilingual evaluation was run: 43% on imagenet1k italian (vs 21% for english B/32), 37% for imagenet1k japanese (vs 1% for english B/32 and 50% for B/16 clip japanese). It shows the multilingual property is indeed there as expected. Larger models will get even better performance. + +#### ViT-H/14 xlm roberta large + +A ViT-H/14 with xlm roberta large encoder with a 77.0% (vs 78% for the english equivalent) top-1 ImageNet-1k zero-shot was trained on stability. See model details here https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k + +This model was trained following the [LiT](https://arxiv.org/abs/2111.07991) methodology: the image tower was frozen (initialized from english openclip ViT-H/14), the text tower was initialized from [xlm roberta large](https://huggingface.co/xlm-roberta-large) and unfrozen. This reduced training cost by a 3x factor. + +See full english [metrics](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k/resolve/main/results_xlm_roberta_large.png) + +On zero shot classification on imagenet with translated prompts this model reaches: + +* 56% in italian (vs 21% for https://github.com/clip-italian/clip-italian) +* 53% in japanese (vs 54.6% for https://github.com/rinnakk/japanese-clip) +* 55.7% in chinese (to be compared with https://github.com/OFA-Sys/Chinese-CLIP) + + +#### YFCC-15M + +Below are checkpoints of models trained on YFCC-15M, along with their zero-shot top-1 accuracies on ImageNet and ImageNetV2. These models were trained using 8 GPUs and the same hyperparameters described in the "Sample running code" section, with the exception of `lr=5e-4` and `epochs=32`. + +* [ResNet-50](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn50-quickgelu-yfcc15m-455df137.pt) (32.7% / 27.9%) +* [ResNet-101](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn101-quickgelu-yfcc15m-3e04b30e.pt) (34.8% / 30.0%) + +#### CC12M - https://github.com/google-research-datasets/conceptual-12m + +* [ResNet-50](https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/rn50-quickgelu-cc12m-f000538c.pt) (36.45%) + + +### CommonPool and DataComp models + +As part of [DataComp](https://github.com/mlfoundations/datacomp), we trained models on CommonPool using various data filtering strategies. + +The best performing models are specified below for the xlarge scale, see our paper [DataComp: In seearch of the next generation of multimodal datasets](https://arxiv.org/abs/2304.14108) for more details. + +Additional models and more information can be found at [/docs/datacomp_models.md](/docs/datacomp_models.md). + + +* `datacomp_xl_s13b_b90k`: A ViT-L/14 trained on DataComp-1B for 12.8B steps and batch size 90k. Achieves 79.2% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K. + +* `commonpool_xl_clip_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL filtered using CLIP scores, for 12.8B steps and batch size 90k. Achieves 76.4% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL.clip-s13B-b90K. + +* `commonpool_xl_laion_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL filtered using the LAION-2B filtering scheme, for 12.8B steps and batch size 90k. Achieves 75.5% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL.laion-s13B-b90K. + +* `commonpool_xl_s13b_b90k`: A ViT-L/14 trained on CommonPool-XL without any filtering, for 12.8B steps and batch size 90k. Achieves 72.3% zero-shot accuracy on ImageNet. Available at https://huggingface.co/laion/CLIP-ViT-L-14-CommonPool.XL-s13B-b90K. + + diff --git a/docs/openclip_results.csv b/docs/openclip_results.csv new file mode 100644 index 000000000..293748d9c --- /dev/null +++ b/docs/openclip_results.csv @@ -0,0 +1,91 @@ +name,pretrained,Average perf. on 38 datasets,ImageNet 1k,Caltech-101,CIFAR-10,CIFAR-100,CLEVR Counts,CLEVR Distance,Country211,Describable Textures,EuroSAT,FGVC Aircraft,Food-101,GTSRB,ImageNet Sketch,ImageNet v2,ImageNet-A,ImageNet-O,ImageNet-R,KITTI Vehicle Distance,MNIST,ObjectNet,Oxford Flowers-102,Oxford-IIIT Pet,Pascal VOC 2007,PatchCamelyon,Rendered SST2,RESISC45,Stanford Cars,STL-10,SUN397,SVHN,Flickr,MSCOCO,WinoGAViL,iWildCam,Camelyon17,FMoW,Dollar Street,GeoDE +EVA02-E-14-plus,laion2b_s9b_b144k,0.6930,0.8201,0.9535,0.9934,0.9316,0.2991,0.1998,0.3564,0.6777,0.7574,0.5360,0.9496,0.6740,0.7162,0.7564,0.8223,0.3540,0.9456,0.1842,0.7463,0.7937,0.8433,0.9567,0.8569,0.6442,0.6271,0.7490,0.9457,0.9926,0.7510,0.7560,0.8648,0.5991,0.4403,0.2591,0.6948,0.2668,0.6951,0.9244 +EVA02-E-14,laion2b_s4b_b115k,0.6690,0.8196,0.9541,0.9925,0.9258,0.1632,0.2499,0.3482,0.6878,0.7446,0.4892,0.9523,0.6729,0.7151,0.7566,0.8044,0.3340,0.9407,0.1294,0.7581,0.7674,0.8210,0.9569,0.8136,0.4972,0.5859,0.7324,0.9438,0.9926,0.7658,0.6381,0.8515,0.5892,0.4429,0.2289,0.4894,0.2801,0.6682,0.9182 +ViT-bigG-14,laion2b_s39b_b160k,0.6667,0.8009,0.9484,0.9824,0.8752,0.2989,0.2002,0.3379,0.6867,0.6919,0.4953,0.9309,0.6244,0.6894,0.7359,0.6933,0.3785,0.9213,0.1308,0.7157,0.7284,0.8163,0.9529,0.8077,0.6364,0.6535,0.7235,0.9460,0.9850,0.7450,0.6961,0.8623,0.5938,0.4488,0.1760,0.5905,0.2352,0.6857,0.9127 +ViT-L-14,datacomp_xl_s13b_b90k,0.6627,0.7921,0.9465,0.9824,0.8736,0.3555,0.2443,0.3157,0.6649,0.7124,0.4750,0.9452,0.5853,0.6795,0.7205,0.6959,0.3255,0.9083,0.2785,0.8661,0.7425,0.8262,0.9506,0.8247,0.5118,0.6101,0.6941,0.9305,0.9925,0.7427,0.6769,0.8119,0.5451,0.4666,0.1614,0.5089,0.2403,0.6624,0.9152 +EVA01-g-14-plus,merged2b_s11b_b114k,0.6624,0.7933,0.9506,0.9910,0.9008,0.2302,0.2293,0.3087,0.6734,0.7280,0.3947,0.9366,0.6644,0.6814,0.7214,0.7416,0.3415,0.9246,0.1491,0.7176,0.7491,0.7959,0.9490,0.8285,0.6244,0.5854,0.7079,0.9073,0.9949,0.7426,0.5951,0.8535,0.5925,0.4684,0.1882,0.7100,0.2283,0.6589,0.9148 +EVA02-L-14-336,merged2b_s6b_b61k,0.6583,0.8039,0.9525,0.9892,0.8980,0.3635,0.2485,0.3354,0.6473,0.7139,0.3758,0.9421,0.5759,0.6891,0.7380,0.8289,0.2850,0.9324,0.2377,0.6421,0.7789,0.7645,0.9424,0.8267,0.5487,0.6463,0.6910,0.9158,0.9966,0.7480,0.4575,0.8381,0.5605,0.5053,0.2105,0.5691,0.2198,0.6811,0.9136 +convnext_xxlarge,laion2b_s34b_b82k_augreg_soup,0.6530,0.7947,0.9448,0.9822,0.8687,0.1454,0.2365,0.3170,0.7053,0.6128,0.4434,0.9321,0.5508,0.6840,0.7260,0.6719,0.4060,0.9160,0.2363,0.8277,0.7273,0.8241,0.9445,0.8090,0.5142,0.6952,0.7190,0.9409,0.9810,0.7458,0.6254,0.8521,0.5867,0.4702,0.1730,0.6071,0.0000,0.6764,0.9215 +convnext_xxlarge,laion2b_s34b_b82k_augreg_rewind,0.6521,0.7931,0.9452,0.9823,0.8686,0.1651,0.2534,0.3155,0.7016,0.6331,0.4398,0.9308,0.5491,0.6825,0.7228,0.6657,0.3975,0.9139,0.2419,0.7930,0.7252,0.8241,0.9438,0.8100,0.5014,0.6897,0.7168,0.9406,0.9801,0.7459,0.6137,0.8498,0.5871,0.4741,0.1735,0.6071,0.0000,0.6799,0.9228 +xlm-roberta-large-ViT-H-14,frozen_laion5b_s13b_b90k,0.6515,0.7695,0.9422,0.9718,0.8430,0.3358,0.2050,0.3172,0.6926,0.6793,0.4673,0.9236,0.6239,0.6581,0.6944,0.5935,0.3390,0.8940,0.1364,0.7804,0.6911,0.7532,0.9431,0.7995,0.5792,0.6436,0.6825,0.9362,0.9889,0.7551,0.5950,0.8461,0.5758,0.5206,0.1392,0.6749,0.2098,0.6460,0.9111 +ViT-L-14,commonpool_xl_clip_s13b_b90k,0.6501,0.7637,0.9502,0.9797,0.8615,0.2547,0.2451,0.2984,0.6521,0.6681,0.3860,0.9355,0.5980,0.6538,0.6953,0.6197,0.3525,0.8924,0.2982,0.9040,0.7165,0.8006,0.9424,0.8336,0.5688,0.6178,0.6978,0.9352,0.9875,0.7351,0.6853,0.7768,0.5156,0.4728,0.1439,0.5100,0.1705,0.6776,0.9056 +EVA02-L-14,merged2b_s4b_b131k,0.6488,0.7977,0.9512,0.9908,0.9071,0.3176,0.2462,0.3091,0.6319,0.6994,0.3638,0.9340,0.5718,0.6813,0.7295,0.7619,0.2880,0.9272,0.2518,0.6729,0.7489,0.7631,0.9398,0.8220,0.5431,0.6150,0.6968,0.9055,0.9961,0.7410,0.4793,0.8351,0.5556,0.5081,0.1886,0.5124,0.2017,0.6624,0.9073 +convnext_xxlarge,laion2b_s34b_b82k_augreg,0.6479,0.7907,0.9429,0.9816,0.8677,0.1399,0.1195,0.3127,0.7096,0.6030,0.4250,0.9295,0.5454,0.6806,0.7223,0.6692,0.4025,0.9131,0.2616,0.8687,0.7235,0.8091,0.9455,0.8116,0.5340,0.6782,0.7100,0.9399,0.9824,0.7436,0.6379,0.8531,0.5834,0.4536,0.1616,0.5719,0.0000,0.6729,0.9228 +ViT-g-14,laion2b_s34b_b88k,0.6427,0.7847,0.9452,0.9815,0.8465,0.3768,0.1870,0.3091,0.6856,0.6530,0.4441,0.9241,0.4964,0.6754,0.7158,0.6092,0.3705,0.9020,0.2700,0.7191,0.6908,0.8010,0.9379,0.8166,0.5384,0.5678,0.6960,0.9394,0.9893,0.7411,0.5611,0.8456,0.5758,0.4104,0.1524,0.4771,0.2090,0.6671,0.9090 +ViT-H-14,laion2b_s32b_b79k,0.6419,0.7796,0.9421,0.9745,0.8473,0.2676,0.2358,0.2986,0.6782,0.7278,0.4265,0.9273,0.5832,0.6657,0.7090,0.5935,0.3825,0.8934,0.1097,0.7284,0.6941,0.7982,0.9438,0.7768,0.5430,0.6392,0.6995,0.9338,0.9848,0.7521,0.5252,0.8417,0.5770,0.4247,0.1528,0.5638,0.2264,0.6343,0.9086 +convnext_large_d_320,laion2b_s29b_b131k_ft_soup,0.6387,0.7685,0.9348,0.9659,0.8304,0.4293,0.2010,0.2654,0.6830,0.7161,0.3621,0.9162,0.5822,0.6504,0.6944,0.6044,0.4410,0.8862,0.1027,0.7434,0.6898,0.7755,0.9358,0.8129,0.4814,0.5585,0.7078,0.9369,0.9856,0.7376,0.6712,0.8467,0.5665,0.4549,0.1786,0.4088,0.1901,0.6449,0.9094 +ViT-L-14,commonpool_xl_laion_s13b_b90k,0.6360,0.7545,0.9352,0.9796,0.8585,0.3819,0.2489,0.2503,0.6191,0.7378,0.2869,0.9200,0.6018,0.6352,0.6851,0.5747,0.3730,0.8708,0.1378,0.7740,0.6846,0.7435,0.9308,0.8107,0.5069,0.5986,0.7065,0.8912,0.9903,0.7327,0.5730,0.8130,0.5513,0.4966,0.1421,0.5671,0.2337,0.6600,0.9115 +EVA01-g-14,laion400m_s11b_b41k,0.6358,0.7852,0.9477,0.9829,0.8865,0.1966,0.2467,0.2862,0.6144,0.7237,0.3226,0.9345,0.4913,0.6730,0.7152,0.7359,0.3285,0.9250,0.2405,0.6218,0.7200,0.7427,0.9414,0.8325,0.4987,0.5832,0.6976,0.9171,0.9889,0.7416,0.5889,0.8037,0.5293,0.4640,0.1975,0.4999,0.1859,0.6741,0.8969 +convnext_large_d_320,laion2b_s29b_b131k_ft,0.6345,0.7660,0.9341,0.9647,0.8313,0.3688,0.1999,0.2673,0.6846,0.7131,0.3770,0.9160,0.5688,0.6472,0.6929,0.5933,0.4400,0.8823,0.1027,0.7695,0.6813,0.7696,0.9346,0.8002,0.4576,0.5623,0.6989,0.9348,0.9854,0.7355,0.6496,0.8415,0.5599,0.4558,0.1664,0.4342,0.1782,0.6355,0.9090 +coca_ViT-L-14,laion2b_s13b_b90k,0.6305,0.7564,0.9433,0.9717,0.8318,0.3565,0.2365,0.2546,0.6271,0.6850,0.3622,0.9045,0.5572,0.6459,0.6794,0.5345,0.3540,0.8819,0.1899,0.7567,0.6414,0.7628,0.9400,0.8112,0.5278,0.6661,0.6883,0.9282,0.9905,0.7394,0.6205,0.8155,0.5431,0.4701,0.1348,0.4125,0.1917,0.6495,0.8969 +ViT-g-14,laion2b_s12b_b42k,0.6299,0.7663,0.9415,0.9706,0.8392,0.3317,0.2225,0.2878,0.6824,0.6469,0.3768,0.9155,0.4985,0.6516,0.6956,0.5716,0.3785,0.8869,0.1350,0.6840,0.6761,0.7800,0.9431,0.8108,0.5624,0.6425,0.7176,0.9292,0.9865,0.7541,0.3930,0.8366,0.5647,0.4427,0.1486,0.4948,0.2040,0.6542,0.9132 +convnext_large_d,laion2b_s26b_b102k_augreg,0.6294,0.7591,0.9365,0.9655,0.8309,0.3461,0.1997,0.2525,0.6739,0.6959,0.3610,0.9055,0.5299,0.6430,0.6826,0.5352,0.4425,0.8767,0.1027,0.8063,0.6618,0.7667,0.9282,0.7891,0.5309,0.5612,0.6768,0.9316,0.9829,0.7307,0.6812,0.8384,0.5550,0.4646,0.1549,0.3964,0.1793,0.6402,0.9019 +ViT-L-14-336,openai,0.6284,0.7656,0.9225,0.9493,0.7436,0.2003,0.1895,0.3445,0.5559,0.6144,0.3346,0.9386,0.5239,0.6100,0.7089,0.7748,0.3265,0.8905,0.2616,0.7916,0.7183,0.7852,0.9369,0.7815,0.6073,0.7057,0.6379,0.7932,0.9943,0.6865,0.5560,0.7730,0.4751,0.4145,0.1490,0.6456,0.2325,0.6390,0.9015 +ViT-L-14,commonpool_xl_s13b_b90k,0.6207,0.7229,0.9327,0.9801,0.8410,0.1985,0.2461,0.2962,0.6202,0.6889,0.1957,0.9107,0.5467,0.6118,0.6511,0.5625,0.2855,0.8594,0.3390,0.9084,0.7022,0.6966,0.9060,0.8076,0.5248,0.5953,0.5756,0.8939,0.9890,0.7103,0.6589,0.7339,0.4652,0.5072,0.1229,0.5246,0.1948,0.6811,0.8990 +ViT-L-14,laion2b_s32b_b82k,0.6205,0.7525,0.9388,0.9662,0.8332,0.3123,0.2234,0.2631,0.6293,0.6459,0.3652,0.9100,0.5618,0.6328,0.6780,0.5385,0.3870,0.8742,0.2293,0.5410,0.6529,0.7479,0.9309,0.8053,0.5641,0.5925,0.6687,0.9263,0.9885,0.7434,0.4087,0.8251,0.5493,0.4385,0.1257,0.5972,0.2007,0.6402,0.8919 +ViT-L-14,openai,0.6173,0.7554,0.9249,0.9559,0.7582,0.1943,0.2021,0.3187,0.5537,0.6263,0.3181,0.9305,0.5055,0.5959,0.6983,0.7075,0.3235,0.8784,0.2180,0.7634,0.6889,0.7923,0.9323,0.7828,0.5204,0.6881,0.6337,0.7788,0.9936,0.6756,0.5840,0.7508,0.4642,0.4136,0.1211,0.6741,0.2229,0.6297,0.8839 +ViT-B-16,datacomp_xl_s13b_b90k,0.6147,0.7349,0.9380,0.9624,0.8212,0.3267,0.2461,0.2215,0.5793,0.5883,0.2970,0.9047,0.5523,0.6044,0.6598,0.4840,0.4285,0.8362,0.2883,0.7649,0.6350,0.7701,0.9254,0.8178,0.6002,0.5162,0.6535,0.8883,0.9811,0.7051,0.6272,0.7633,0.4880,0.4832,0.1181,0.4799,0.1504,0.6168,0.8990 +coca_ViT-L-14,mscoco_finetuned_laion2b_s13b_b90k,0.6138,0.7210,0.9459,0.9626,0.7966,0.3649,0.2488,0.1810,0.6218,0.5904,0.2344,0.8449,0.5532,0.6116,0.6486,0.4568,0.3905,0.8579,0.3502,0.8220,0.6257,0.7078,0.9104,0.8127,0.4687,0.6134,0.6232,0.8875,0.9864,0.7377,0.5317,0.8373,0.6038,0.5178,0.1309,0.4097,0.1682,0.6729,0.8768 +ViT-B-32-256,datacomp_s34b_b86k,0.6087,0.7281,0.9348,0.9653,0.8287,0.2489,0.2271,0.1968,0.6064,0.6469,0.3645,0.8909,0.5152,0.6065,0.6481,0.3757,0.4635,0.8344,0.2658,0.7939,0.5960,0.7822,0.9115,0.7880,0.5880,0.5294,0.6505,0.8990,0.9731,0.7021,0.6708,0.7486,0.4892,0.4300,0.0910,0.6252,0.0000,0.6238,0.8923 +RN50x64,openai,0.6061,0.7391,0.9026,0.8510,0.5985,0.2254,0.1994,0.2981,0.5314,0.5765,0.3103,0.9205,0.4792,0.5593,0.6706,0.7077,0.3830,0.8441,0.3094,0.8583,0.6820,0.7745,0.9360,0.7398,0.5387,0.7106,0.6265,0.7581,0.9829,0.6661,0.6044,0.7794,0.4683,0.3936,0.1469,0.5280,0.1939,0.6472,0.8898 +ViT-L-14,laion400m_e32,0.5971,0.7277,0.9266,0.9464,0.7741,0.2421,0.2452,0.2302,0.6053,0.6233,0.2490,0.9007,0.4989,0.5964,0.6545,0.4647,0.4190,0.8467,0.1997,0.7612,0.5969,0.7306,0.9170,0.7561,0.4968,0.5601,0.6741,0.8962,0.9808,0.7258,0.4955,0.7891,0.5137,0.3932,0.1254,0.4555,0.1708,0.6168,0.8839 +ViT-L-14,laion400m_e31,0.5964,0.7271,0.9259,0.9465,0.7738,0.2420,0.2452,0.2290,0.5973,0.6322,0.2462,0.9002,0.4965,0.5944,0.6547,0.4596,0.4225,0.8466,0.1997,0.7668,0.5962,0.7323,0.9154,0.7585,0.4877,0.5651,0.6710,0.8964,0.9804,0.7247,0.4956,0.7885,0.5129,0.3949,0.1239,0.4595,0.1651,0.6075,0.8831 +EVA02-B-16,merged2b_s8b_b131k,0.5890,0.7472,0.9302,0.9846,0.8773,0.2125,0.2254,0.2136,0.5282,0.6635,0.2506,0.8943,0.4630,0.5771,0.6701,0.5396,0.3410,0.8244,0.2208,0.4729,0.6214,0.7245,0.9211,0.8019,0.5091,0.5415,0.6037,0.7855,0.9949,0.7064,0.2497,0.7873,0.5044,0.4722,0.1515,0.7095,0.1724,0.6086,0.8810 +convnext_base_w_320,laion_aesthetic_s13b_b82k_augreg,0.5869,0.7128,0.9255,0.8823,0.6515,0.2825,0.2225,0.2243,0.6074,0.5124,0.2632,0.8947,0.4365,0.5646,0.6362,0.4157,0.5075,0.8136,0.2180,0.7219,0.5237,0.7524,0.9239,0.7530,0.5696,0.5508,0.6421,0.8918,0.9755,0.7037,0.4443,0.8009,0.5142,0.4293,0.1392,0.5502,0.1215,0.6297,0.8935 +ViT-B-16,laion2b_s34b_b88k,0.5866,0.7023,0.9287,0.9494,0.7684,0.2149,0.2455,0.2029,0.5633,0.5346,0.2695,0.8663,0.4826,0.5608,0.6228,0.3823,0.4625,0.8061,0.1730,0.6577,0.5598,0.7084,0.9048,0.7886,0.5639,0.5969,0.6275,0.8848,0.9786,0.7085,0.5002,0.7807,0.5087,0.4601,0.1217,0.6249,0.1211,0.5841,0.8735 +convnext_base_w,laion2b_s13b_b82k_augreg,0.5835,0.7147,0.9258,0.9561,0.8021,0.3307,0.2450,0.2016,0.6144,0.4828,0.2235,0.8675,0.4654,0.5890,0.6329,0.3817,0.5110,0.8253,0.2068,0.6441,0.5732,0.7017,0.9191,0.7979,0.4823,0.5925,0.6056,0.9126,0.9705,0.7113,0.5376,0.7985,0.5222,0.4390,0.1285,0.3801,0.0000,0.5935,0.8881 +ViT-B-32,datacomp_xl_s13b_b90k,0.5795,0.6917,0.9230,0.9561,0.8031,0.1294,0.2423,0.1756,0.5713,0.5746,0.2463,0.8632,0.5185,0.5676,0.6075,0.3035,0.4975,0.7818,0.1632,0.8124,0.5510,0.7353,0.9002,0.8151,0.5284,0.4849,0.6343,0.8728,0.9654,0.6780,0.6240,0.7004,0.4534,0.4594,0.0863,0.6656,0.0000,0.5643,0.8731 +convnext_base_w,laion_aesthetic_s13b_b82k,0.5766,0.7099,0.9061,0.8305,0.6116,0.2960,0.1956,0.2228,0.6229,0.4519,0.2938,0.8847,0.4016,0.5546,0.6342,0.4123,0.4750,0.7986,0.2630,0.6739,0.5559,0.7170,0.9199,0.7548,0.5517,0.5579,0.6162,0.8661,0.9709,0.7143,0.2802,0.8093,0.5238,0.4764,0.1378,0.5859,0.1284,0.6343,0.8722 +convnext_base_w,laion2b_s13b_b82k,0.5761,0.7078,0.9222,0.9383,0.7519,0.2385,0.1866,0.2018,0.5957,0.5678,0.2825,0.8711,0.4930,0.5712,0.6234,0.3993,0.4815,0.8070,0.1505,0.5435,0.5795,0.6955,0.9189,0.8038,0.4154,0.6041,0.6284,0.8957,0.9775,0.7128,0.3459,0.7992,0.5171,0.4706,0.1181,0.4812,0.1072,0.6075,0.8802 +ViT-B-16-plus-240,laion400m_e32,0.5724,0.6919,0.9239,0.9273,0.7377,0.2387,0.2348,0.1894,0.5548,0.5820,0.1852,0.8734,0.4944,0.5442,0.6148,0.3689,0.4980,0.8049,0.2813,0.5709,0.5384,0.6886,0.9015,0.7636,0.5524,0.5799,0.6137,0.8448,0.9698,0.6985,0.3777,0.7730,0.4979,0.4069,0.1163,0.4876,0.1616,0.5923,0.8697 +ViT-B-16-plus-240,laion400m_e31,0.5713,0.6904,0.9219,0.9247,0.7329,0.2413,0.2346,0.1884,0.5548,0.5702,0.1861,0.8735,0.4897,0.5443,0.6138,0.3676,0.5030,0.8038,0.2799,0.5722,0.5374,0.6825,0.9035,0.7634,0.5512,0.5859,0.6144,0.8450,0.9689,0.6991,0.3767,0.7675,0.4954,0.4093,0.1164,0.4837,0.1618,0.5841,0.8689 +ViT-B-32,laion2b_s34b_b79k,0.5694,0.6656,0.9105,0.9358,0.7555,0.1535,0.2451,0.1667,0.5569,0.4806,0.2453,0.8269,0.4933,0.5366,0.5814,0.2627,0.4995,0.7643,0.2630,0.6996,0.4883,0.7024,0.9076,0.7910,0.5993,0.5728,0.6106,0.8607,0.9656,0.6872,0.4257,0.7544,0.4783,0.4479,0.0930,0.6392,0.1479,0.5666,0.8543 +RN50x16,openai,0.5670,0.7072,0.8856,0.8134,0.5209,0.1953,0.2095,0.2437,0.5266,0.4328,0.2783,0.9051,0.3984,0.5063,0.6420,0.5724,0.4495,0.7933,0.2307,0.6798,0.6071,0.7188,0.8956,0.6800,0.6249,0.6771,0.5883,0.7286,0.9775,0.6391,0.4548,0.7552,0.4538,0.3946,0.1079,0.6248,0.1593,0.6121,0.8539 +convnext_base_w_320,laion_aesthetic_s13b_b82k,0.5665,0.7167,0.9136,0.8613,0.5900,0.2283,0.2255,0.2237,0.5931,0.3519,0.2834,0.8930,0.4459,0.5639,0.6398,0.4225,0.4745,0.8054,0.0928,0.6647,0.5616,0.7165,0.9244,0.7240,0.4899,0.5541,0.6176,0.8821,0.9664,0.7161,0.2606,0.8236,0.5247,0.4610,0.1473,0.4729,0.1813,0.6273,0.8856 +xlm-roberta-base-ViT-B-32,laion5b_s13b_b90k,0.5643,0.6236,0.9079,0.9366,0.7654,0.1675,0.2025,0.1896,0.6037,0.6006,0.2692,0.8010,0.4561,0.5071,0.5425,0.2355,0.4825,0.7410,0.1814,0.7407,0.4607,0.6235,0.8690,0.7856,0.6423,0.5354,0.6137,0.8556,0.9668,0.6785,0.5532,0.7359,0.4566,0.4827,0.0801,0.5770,0.1292,0.5771,0.8647 +ViT-B-16,openai,0.5626,0.6834,0.8901,0.9077,0.6695,0.2123,0.2231,0.2282,0.4495,0.5594,0.2421,0.8872,0.4339,0.4824,0.6188,0.4995,0.4230,0.7770,0.2644,0.5135,0.5531,0.6907,0.8886,0.7831,0.5072,0.6068,0.5822,0.6477,0.9825,0.6435,0.5190,0.7218,0.4275,0.4316,0.1099,0.6808,0.1888,0.5876,0.8614 +ViT-B-16,laion400m_e32,0.5621,0.6705,0.9131,0.9172,0.7116,0.2869,0.2451,0.1810,0.5133,0.5019,0.1765,0.8613,0.4346,0.5238,0.5963,0.3324,0.5075,0.7793,0.1814,0.6624,0.5152,0.6691,0.8917,0.7684,0.5960,0.5437,0.5852,0.8373,0.9698,0.6961,0.3413,0.7458,0.4688,0.4326,0.1028,0.5999,0.1546,0.5935,0.8534 +ViT-B-16,laion400m_e31,0.5617,0.6698,0.9159,0.9169,0.7130,0.2889,0.2451,0.1804,0.5138,0.5033,0.1742,0.8587,0.4353,0.5233,0.5943,0.3327,0.5035,0.7777,0.1997,0.6531,0.5128,0.6693,0.8911,0.7678,0.5925,0.5459,0.5849,0.8365,0.9703,0.6958,0.3388,0.7451,0.4674,0.4225,0.1056,0.5976,0.1546,0.5946,0.8534 +convnext_base,laion400m_s13b_b51k,0.5576,0.6627,0.9151,0.8899,0.6462,0.2386,0.2209,0.1700,0.5404,0.4850,0.1556,0.8515,0.4551,0.5196,0.5859,0.3092,0.4925,0.7575,0.2925,0.6114,0.5058,0.6900,0.8853,0.7528,0.6116,0.5376,0.5683,0.8409,0.9656,0.6845,0.4038,0.7438,0.4615,0.4045,0.1095,0.6565,0.1589,0.5537,0.8530 +coca_ViT-B-32,laion2b_s13b_b90k,0.5547,0.6359,0.9115,0.9389,0.7396,0.1889,0.2057,0.1444,0.5388,0.4615,0.1882,0.7901,0.4474,0.5139,0.5569,0.2160,0.4995,0.7352,0.2686,0.7148,0.4518,0.6296,0.8875,0.7805,0.5974,0.5772,0.6010,0.8414,0.9634,0.6751,0.5519,0.7297,0.4560,0.4588,0.0943,0.5609,0.1088,0.5736,0.8447 +ViT-B-32,laion2b_e16,0.5483,0.6565,0.9104,0.9403,0.7544,0.1923,0.2310,0.1652,0.5383,0.5030,0.2298,0.8166,0.3655,0.5287,0.5739,0.2615,0.5030,0.7588,0.1758,0.6347,0.4877,0.6732,0.8903,0.7877,0.5072,0.5437,0.6190,0.8437,0.9653,0.6851,0.4164,0.7539,0.4768,0.4602,0.0971,0.4648,0.0000,0.5724,0.8526 +roberta-ViT-B-32,laion2b_s12b_b32k,0.5411,0.6171,0.9039,0.9325,0.7505,0.1472,0.2007,0.1472,0.5920,0.5215,0.1725,0.7812,0.4082,0.4912,0.5331,0.2120,0.5075,0.7224,0.3854,0.6636,0.4499,0.5893,0.8670,0.7804,0.4985,0.5420,0.6117,0.8315,0.9564,0.6627,0.4526,0.7302,0.4590,0.4583,0.0606,0.4098,0.1161,0.5549,0.8426 +ViT-B-16,datacomp_l_s1b_b8k,0.5372,0.6310,0.8969,0.9381,0.7540,0.2314,0.2513,0.1434,0.4691,0.5011,0.1001,0.8311,0.4343,0.4976,0.5521,0.2545,0.4955,0.7177,0.4008,0.5400,0.5298,0.6261,0.8352,0.8089,0.4973,0.5294,0.5273,0.7718,0.9576,0.6431,0.4595,0.6428,0.4045,0.4465,0.0729,0.5000,0.0976,0.5748,0.8493 +ViT-B-16,commonpool_l_clip_s1b_b8k,0.5294,0.5777,0.8853,0.9349,0.7313,0.2691,0.2313,0.1417,0.4500,0.4728,0.0822,0.7995,0.4657,0.4589,0.4995,0.2165,0.4950,0.6843,0.3755,0.7032,0.4914,0.5667,0.7561,0.7821,0.4962,0.5036,0.5295,0.8171,0.9496,0.6295,0.5985,0.5956,0.3658,0.4359,0.0741,0.4920,0.1257,0.5818,0.8501 +ViT-B-32-quickgelu,laion400m_e32,0.5272,0.6293,0.9118,0.9074,0.7029,0.1624,0.2391,0.1475,0.5457,0.5143,0.1658,0.8086,0.4197,0.4939,0.5506,0.2172,0.5345,0.7342,0.2897,0.3733,0.4389,0.6620,0.8671,0.7582,0.5592,0.5228,0.5454,0.7926,0.9560,0.6700,0.3039,0.7025,0.4395,0.4072,0.0745,0.4709,0.1296,0.5491,0.8380 +ViT-B-32-quickgelu,laion400m_e31,0.5263,0.6294,0.9121,0.9060,0.7021,0.1659,0.2397,0.1476,0.5447,0.5085,0.1675,0.8080,0.4230,0.4937,0.5487,0.2161,0.5335,0.7349,0.2911,0.3656,0.4374,0.6638,0.8629,0.7539,0.5543,0.5217,0.5446,0.7914,0.9553,0.6702,0.3144,0.7022,0.4395,0.4034,0.0788,0.4554,0.1310,0.5467,0.8363 +ViT-B-32-quickgelu,openai,0.5245,0.6332,0.8758,0.8983,0.6423,0.2320,0.2335,0.1720,0.4436,0.5044,0.1953,0.8400,0.3258,0.4229,0.5592,0.3155,0.4775,0.6933,0.2743,0.4839,0.4431,0.6670,0.8700,0.7640,0.6224,0.5865,0.5362,0.5963,0.9713,0.6248,0.3159,0.6884,0.4028,0.4125,0.0732,0.6061,0.1676,0.5386,0.8217 +ViT-B-32,openai,0.5245,0.6332,0.8758,0.8983,0.6423,0.2320,0.2335,0.1720,0.4436,0.5044,0.1953,0.8400,0.3258,0.4229,0.5592,0.3155,0.4775,0.6933,0.2743,0.4839,0.4431,0.6670,0.8700,0.7640,0.6224,0.5865,0.5362,0.5963,0.9713,0.6248,0.3159,0.6884,0.4028,0.4125,0.0732,0.6061,0.1676,0.5386,0.8217 +RN50x4,openai,0.5188,0.6627,0.8661,0.7943,0.4514,0.2045,0.0905,0.2039,0.4862,0.3354,0.2102,0.8640,0.3622,0.4468,0.5944,0.4145,0.4955,0.7274,0.2335,0.4903,0.5141,0.6766,0.8829,0.6814,0.5675,0.6716,0.5338,0.6673,0.9658,0.6089,0.3190,0.7234,0.4318,0.3912,0.0870,0.5435,0.1130,0.5654,0.8376 +ViT-B-32,laion400m_e31,0.5077,0.6022,0.8916,0.8825,0.6781,0.1549,0.2261,0.1356,0.5218,0.4694,0.1437,0.7814,0.4082,0.4648,0.5234,0.1957,0.5085,0.7079,0.1224,0.4108,0.4281,0.6319,0.8541,0.7312,0.5495,0.5162,0.5108,0.7436,0.9494,0.6508,0.2891,0.6890,0.4327,0.4262,0.0745,0.4975,0.1076,0.5491,0.8328 +ViT-B-32,laion400m_e32,0.5074,0.6024,0.8918,0.8840,0.6773,0.1536,0.2261,0.1349,0.5229,0.4754,0.1467,0.7817,0.4070,0.4646,0.5237,0.1953,0.5080,0.7084,0.1181,0.4000,0.4292,0.6323,0.8513,0.7328,0.5490,0.5206,0.5094,0.7454,0.9498,0.6509,0.2759,0.6866,0.4337,0.4265,0.0741,0.5084,0.1068,0.5444,0.8326 +RN101-quickgelu,openai,0.5033,0.6228,0.8527,0.8078,0.4764,0.2437,0.0923,0.1693,0.4335,0.3131,0.1853,0.8367,0.3753,0.4106,0.5612,0.2944,0.5085,0.6817,0.2644,0.5254,0.4515,0.6532,0.8652,0.6512,0.5819,0.6403,0.5476,0.6100,0.9680,0.5803,0.3185,0.6852,0.4025,0.4130,0.0888,0.4723,0.1615,0.5631,0.8164 +RN101,openai,0.5033,0.6228,0.8527,0.8078,0.4764,0.2437,0.0923,0.1693,0.4335,0.3131,0.1853,0.8367,0.3753,0.4106,0.5612,0.2944,0.5085,0.6817,0.2644,0.5254,0.4515,0.6532,0.8652,0.6512,0.5819,0.6403,0.5476,0.6100,0.9680,0.5803,0.3185,0.6852,0.4025,0.4130,0.0888,0.4723,0.1615,0.5631,0.8164 +ViT-B-16,commonpool_l_laion_s1b_b8k,0.5011,0.5526,0.8766,0.9296,0.7184,0.2681,0.2173,0.1119,0.4144,0.4115,0.0714,0.7661,0.3296,0.4315,0.4790,0.2004,0.4930,0.6501,0.3432,0.4753,0.4638,0.5023,0.7769,0.7686,0.5158,0.5228,0.5314,0.6760,0.9409,0.6278,0.4301,0.6447,0.3924,0.4476,0.0490,0.5127,0.1026,0.5514,0.8463 +ViT-B-16,commonpool_l_image_s1b_b8k,0.4812,0.5719,0.8856,0.9321,0.6955,0.2143,0.2453,0.1308,0.4170,0.3193,0.0735,0.7797,0.2514,0.4343,0.4872,0.2143,0.4725,0.6356,0.3826,0.2219,0.4793,0.4817,0.7784,0.7841,0.5002,0.4986,0.4622,0.6627,0.9489,0.6335,0.2673,0.6026,0.3622,0.4787,0.0424,0.5000,0.0000,0.5946,0.8422 +RN50-quickgelu,openai,0.4810,0.5982,0.8329,0.7157,0.4030,0.2171,0.1623,0.1542,0.4154,0.4081,0.1703,0.8080,0.3510,0.3544,0.5284,0.2327,0.5720,0.6073,0.1730,0.5755,0.4141,0.6522,0.8529,0.6510,0.6393,0.5645,0.4521,0.5453,0.9419,0.5994,0.2883,0.6868,0.3869,0.3622,0.0623,0.5624,0.0000,0.5222,0.8129 +RN50,openai,0.4810,0.5982,0.8329,0.7157,0.4030,0.2171,0.1623,0.1542,0.4154,0.4081,0.1703,0.8080,0.3510,0.3544,0.5284,0.2327,0.5720,0.6073,0.1730,0.5755,0.4141,0.6522,0.8529,0.6510,0.6393,0.5645,0.4521,0.5453,0.9419,0.5994,0.2883,0.6868,0.3869,0.3622,0.0623,0.5624,0.0000,0.5222,0.8129 +ViT-B-16,commonpool_l_text_s1b_b8k,0.4760,0.5605,0.8720,0.9391,0.7054,0.1843,0.2373,0.0995,0.3941,0.3830,0.0451,0.7724,0.2317,0.4437,0.4835,0.2220,0.4770,0.6708,0.2686,0.2593,0.4911,0.5164,0.7049,0.7669,0.4857,0.4931,0.4663,0.6525,0.9523,0.6088,0.2122,0.6078,0.3730,0.4570,0.0623,0.5697,0.0000,0.5643,0.8564 +ViT-B-16,commonpool_l_basic_s1b_b8k,0.4585,0.5155,0.8444,0.8289,0.5251,0.2061,0.2277,0.1173,0.4133,0.3820,0.0481,0.7461,0.2021,0.3932,0.4325,0.1913,0.4600,0.6087,0.3333,0.2809,0.4493,0.4357,0.6956,0.7151,0.5899,0.5387,0.4313,0.7216,0.9373,0.5974,0.1173,0.6015,0.3583,0.4812,0.0436,0.5712,0.0000,0.5421,0.8384 +ViT-B-16,commonpool_l_s1b_b8k,0.4370,0.4593,0.8089,0.9133,0.6421,0.1594,0.2203,0.1177,0.3383,0.3348,0.0316,0.6735,0.2766,0.3448,0.3914,0.1592,0.4335,0.5265,0.2686,0.3603,0.4126,0.3681,0.5587,0.7093,0.5516,0.5118,0.4154,0.6060,0.9339,0.5713,0.3047,0.4948,0.2855,0.4777,0.0399,0.5102,0.0000,0.5654,0.8305 +ViT-B-32,datacomp_m_s128m_b4k,0.3281,0.2972,0.7159,0.8252,0.5476,0.1365,0.2249,0.0453,0.2133,0.3393,0.0304,0.4168,0.1366,0.1930,0.2440,0.0493,0.4085,0.3402,0.2110,0.1147,0.1971,0.2965,0.4311,0.5459,0.5862,0.5316,0.2778,0.2803,0.8365,0.3637,0.1500,0.2241,0.1407,0.3287,0.0142,0.6669,0.0000,0.4498,0.6559 +ViT-B-32,commonpool_m_clip_s128m_b4k,0.3278,0.2725,0.6678,0.8405,0.5549,0.1402,0.2238,0.0458,0.2176,0.2589,0.0215,0.3999,0.1586,0.1844,0.2247,0.0420,0.3925,0.3297,0.3235,0.1778,0.2093,0.2551,0.3828,0.6074,0.5210,0.5014,0.2641,0.4123,0.8370,0.3875,0.1931,0.2465,0.1476,0.3581,0.0154,0.5369,0.0000,0.4451,0.6610 +RN50-quickgelu,cc12m,0.3260,0.3647,0.6581,0.5404,0.2079,0.2063,0.1574,0.0431,0.1910,0.2146,0.0226,0.4392,0.1284,0.2412,0.3098,0.0759,0.4160,0.4468,0.3713,0.1261,0.2320,0.2383,0.5651,0.4394,0.5033,0.4789,0.2137,0.1837,0.8751,0.4442,0.0918,0.5373,0.2891,0.3876,0.0476,0.5000,0.0000,0.4883,0.7119 +RN50,cc12m,0.3247,0.3591,0.6432,0.5241,0.2093,0.2076,0.1576,0.0422,0.2074,0.2202,0.0178,0.4241,0.1155,0.2354,0.3065,0.0763,0.4165,0.4466,0.3713,0.0919,0.2326,0.2465,0.5504,0.4700,0.5035,0.4871,0.2351,0.1818,0.8696,0.4440,0.0923,0.5357,0.2890,0.3828,0.0464,0.5000,0.0000,0.4907,0.7086 +ViT-B-32,commonpool_m_image_s128m_b4k,0.3118,0.2678,0.6650,0.7815,0.5203,0.1298,0.2248,0.0466,0.1910,0.2261,0.0219,0.3553,0.1513,0.1623,0.2183,0.0385,0.3795,0.2959,0.2996,0.1079,0.1837,0.2383,0.3482,0.6147,0.5742,0.5266,0.2275,0.1593,0.8171,0.3706,0.1294,0.2303,0.1354,0.4026,0.0149,0.6905,0.0000,0.4638,0.6397 +ViT-B-32,commonpool_m_text_s128m_b4k,0.3066,0.2548,0.6632,0.8164,0.5133,0.1891,0.2449,0.0355,0.1995,0.3587,0.0212,0.3568,0.1048,0.1655,0.2142,0.0431,0.3705,0.3107,0.2897,0.1034,0.1889,0.2184,0.2991,0.5355,0.5495,0.5008,0.2627,0.1935,0.7966,0.3535,0.1265,0.2386,0.1452,0.3618,0.0063,0.5336,0.0000,0.4544,0.6317 +RN101-quickgelu,yfcc15m,0.2993,0.3487,0.5437,0.5298,0.2262,0.1609,0.2504,0.0683,0.1851,0.2030,0.0420,0.4686,0.0940,0.0888,0.3003,0.1568,0.3370,0.2643,0.2068,0.1239,0.1988,0.4942,0.2970,0.4603,0.5004,0.4992,0.2138,0.0373,0.8661,0.4085,0.0781,0.4422,0.2326,0.3147,0.0357,0.5000,0.0546,0.4930,0.6483 +RN101,yfcc15m,0.2988,0.3407,0.5538,0.5048,0.2197,0.1369,0.2257,0.0699,0.1899,0.2076,0.0443,0.4729,0.1092,0.0888,0.2933,0.1611,0.3240,0.2629,0.2138,0.1086,0.1991,0.4886,0.3068,0.4886,0.5013,0.4920,0.2011,0.0381,0.8803,0.4235,0.1348,0.4426,0.2404,0.2987,0.0371,0.5000,0.0000,0.5035,0.6509 +ViT-B-32,commonpool_m_laion_s128m_b4k,0.2919,0.2304,0.6312,0.7744,0.5009,0.1623,0.2261,0.0345,0.2043,0.1880,0.0169,0.3131,0.0906,0.1515,0.1895,0.0424,0.3480,0.2801,0.2827,0.1520,0.1763,0.2090,0.2973,0.5302,0.6225,0.4964,0.2470,0.2189,0.7774,0.3327,0.0881,0.2276,0.1353,0.3348,0.0167,0.5054,0.0000,0.4357,0.6234 +ViT-B-32,commonpool_m_basic_s128m_b4k,0.2849,0.2255,0.6118,0.6321,0.3531,0.1417,0.2217,0.0423,0.1973,0.2191,0.0155,0.3165,0.1225,0.1434,0.1820,0.0383,0.3505,0.2684,0.2982,0.1229,0.1754,0.1853,0.2752,0.5323,0.5402,0.5014,0.2305,0.2900,0.7793,0.3490,0.0638,0.2187,0.1333,0.4015,0.0133,0.5137,0.0285,0.4591,0.6322 +RN50,yfcc15m,0.2811,0.3238,0.5095,0.4943,0.1862,0.1315,0.2003,0.0642,0.1745,0.1811,0.0373,0.4304,0.0844,0.0729,0.2806,0.1371,0.3265,0.2231,0.2602,0.1004,0.1824,0.4680,0.2777,0.3888,0.5331,0.4992,0.1494,0.0429,0.8161,0.3999,0.0640,0.4106,0.2236,0.3023,0.0324,0.5256,0.0501,0.4673,0.6289 +RN50-quickgelu,yfcc15m,0.2776,0.3275,0.5089,0.4919,0.2033,0.1305,0.1990,0.0637,0.1729,0.1596,0.0371,0.4493,0.0956,0.0715,0.2793,0.1373,0.3315,0.2220,0.2560,0.0924,0.1772,0.4718,0.2771,0.3845,0.5131,0.4992,0.1424,0.0407,0.7914,0.3919,0.0642,0.4045,0.2182,0.3130,0.0261,0.5058,0.0000,0.4638,0.6343 +ViT-B-32,commonpool_m_s128m_b4k,0.2580,0.1755,0.5231,0.7459,0.4391,0.1263,0.2265,0.0362,0.1606,0.2537,0.0115,0.2342,0.0869,0.0952,0.1440,0.0388,0.2780,0.1983,0.2743,0.0933,0.1574,0.1128,0.1676,0.5448,0.5048,0.5003,0.1810,0.1332,0.7690,0.3066,0.0933,0.1599,0.0974,0.3983,0.0127,0.5015,0.0000,0.4276,0.5942 +ViT-B-32,commonpool_s_clip_s13m_b4k,0.1731,0.0505,0.2483,0.4768,0.1937,0.1529,0.2313,0.0119,0.0782,0.2067,0.0083,0.0801,0.0732,0.0200,0.0380,0.0181,0.1380,0.0655,0.2785,0.0874,0.0506,0.0539,0.0796,0.3379,0.6367,0.5014,0.0806,0.0276,0.5353,0.1126,0.1166,0.0343,0.0224,0.2994,0.0004,0.6874,0.0000,0.2605,0.2827 +ViT-B-32,commonpool_s_text_s13m_b4k,0.1573,0.0460,0.2231,0.4679,0.1844,0.1350,0.1899,0.0121,0.0670,0.0896,0.0139,0.0618,0.0411,0.0175,0.0398,0.0187,0.1270,0.0606,0.3980,0.0771,0.0494,0.0428,0.0581,0.2942,0.5027,0.5008,0.1029,0.0204,0.5019,0.1051,0.0933,0.0424,0.0214,0.3120,0.0015,0.5000,0.0000,0.2745,0.2843 +ViT-B-32,commonpool_s_image_s13m_b4k,0.1449,0.0392,0.2238,0.3176,0.1329,0.1121,0.2217,0.0109,0.0521,0.1593,0.0120,0.0604,0.0579,0.0186,0.0308,0.0155,0.1055,0.0578,0.2883,0.0991,0.0436,0.0528,0.0474,0.2666,0.5273,0.4646,0.0794,0.0173,0.4601,0.0725,0.1305,0.0171,0.0130,0.2525,0.0033,0.5425,0.0085,0.2150,0.2752 +ViT-B-32,datacomp_s_s13m_b4k,0.1449,0.0392,0.2238,0.3176,0.1329,0.1121,0.2217,0.0109,0.0521,0.1593,0.0120,0.0604,0.0579,0.0186,0.0308,0.0155,0.1055,0.0578,0.2883,0.0991,0.0436,0.0528,0.0474,0.2666,0.5273,0.4646,0.0794,0.0173,0.4601,0.0725,0.1305,0.0171,0.0130,0.2525,0.0033,0.5425,0.0085,0.2150,0.2752 +ViT-B-32,commonpool_s_basic_s13m_b4k,0.1423,0.0377,0.1806,0.2664,0.1154,0.1245,0.2335,0.0120,0.0553,0.0587,0.0103,0.0588,0.0638,0.0151,0.0319,0.0203,0.0985,0.0499,0.3390,0.1085,0.0440,0.0351,0.0488,0.3081,0.5096,0.4986,0.0795,0.0200,0.4659,0.0879,0.0810,0.0328,0.0168,0.3033,0.0003,0.5001,0.0000,0.2325,0.2643 +ViT-B-32,commonpool_s_s13m_b4k,0.1420,0.0270,0.1564,0.4079,0.1296,0.1305,0.2233,0.0126,0.0574,0.1487,0.0081,0.0473,0.0654,0.0108,0.0234,0.0141,0.1000,0.0404,0.3460,0.0708,0.0360,0.0338,0.0443,0.2235,0.5268,0.5008,0.0698,0.0143,0.4266,0.0766,0.1121,0.0257,0.0132,0.3126,0.0002,0.5124,0.0000,0.2290,0.2167 +ViT-B-32,commonpool_s_laion_s13m_b4k,0.1332,0.0305,0.1549,0.3364,0.1347,0.1309,0.1299,0.0098,0.0553,0.1578,0.0134,0.0501,0.0538,0.0125,0.0271,0.0147,0.1015,0.0443,0.2518,0.1387,0.0369,0.0244,0.0399,0.3030,0.4216,0.4992,0.0583,0.0155,0.4874,0.0659,0.1473,0.0223,0.0121,0.2410,0.0017,0.3703,0.0000,0.2079,0.2580 +coca_ViT-B-32,mscoco_finetuned_laion2b_s13b_b90k,0.1064,0.0091,0.0441,0.2002,0.0173,0.1315,0.2019,0.0047,0.0452,0.0844,0.0139,0.0177,0.0298,0.0034,0.0086,0.0091,0.0230,0.0158,0.2714,0.1442,0.0159,0.0131,0.0438,0.1247,0.5183,0.4992,0.0589,0.0058,0.2913,0.0211,0.1519,0.0104,0.0061,0.2375,0.0003,0.5140,0.0000,0.1729,0.0814