update

UCSC-VLAA · Nov 26, 2024 · 12a55d0 · 12a55d0
1 parent ed0ea55
commit 12a55d0
Show file tree

Hide file tree

Showing 148 changed files with 8,377 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -2,14 +2,13 @@
 
 **Official implementation of the paper "_CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions_".**
 
-![Method Pipeline](./docs/resources/method.jpg)
+---
 
-Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions:
+## **Authors**
 
-1. By observing a strong inverse effect with synthetic captions, we use only **partial synthetic captions** to feed the text encoder, achieving significantly better performance.
-2. We incorporate an **autoregressive captioner** that mimics the recaptioning process, predicting full-length synthetic captions conditioned on the image and original web-crawled captions.
+- [Yanqing Liu](https://yanqing0327.github.io/Yanqing.github.io/)<sup>1</sup>, [Xianhang Li](https://xhl-video.github.io/xianhangli/)<sup>1</sup>, [Zeyu Wang](https://zw615.github.io/)<sup>1</sup>,  [Bingchen Zhao](https://bzhao.me/)<sup>2</sup>, [Cihang Xie](https://cihangxie.github.io/)<sup>1</sup>  
 
-Our method achieves **state-of-the-art (SOTA)** results in zero-shot image-text retrieval on MSCOCO and Flickr30K, while enhancing the visual capability of LLaVA.
+<sup>1</sup>UC Santa Cruz, <sup>2</sup>University of Edinburgh  
 
 ---
 
@@ -20,6 +19,19 @@ Our method achieves **state-of-the-art (SOTA)** results in zero-shot image-text
 
 ---
 
+## **Proposed Method**
+
+![Method Pipeline](./docs/resources/method.jpg)
+
+Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions:
+
+1. By observing a strong inverse effect with synthetic captions, we use only **partial synthetic captions** to feed the text encoder, achieving significantly better performance.
+2. We incorporate an **autoregressive captioner** that mimics the recaptioning process, predicting full-length synthetic captions conditioned on the image and original web-crawled captions.
+
+Our method achieves **state-of-the-art (SOTA)** results in zero-shot image-text retrieval on MSCOCO and Flickr30K, while enhancing the visual capability of LLaVA.
+
+---
+
 ## **Key Results**
 
 ### **Inverse Effect with Synthetic Captions**
@@ -57,6 +69,47 @@ Replacing OpenAI-CLIP with **CLIPS** significantly boosts LLaVA's performance ac
 | CLIPS-Large-14 | [🤗 HuggingFace Model](https://huggingface.co/UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B) |
 | CLIPS-Huge-14  | Coming Soon...                                                                          |
 
+## **Model Usage**
+### **Environment**
+Install dependencies:
+```
+pip3 install -r requirements.txt
+```
+### **With OpenCLIP**
+```python
+import torch
+import torch.nn.functional as F
+from urllib.request import urlopen
+from PIL import Image
+from open_clip import create_model_from_pretrained, get_tokenizer
+
+model, preprocess = create_model_from_pretrained('hf-hub:UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B')
+tokenizer = get_tokenizer('hf-hub:UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B')
+
+image = Image.open(urlopen(
+    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
+))
+image = preprocess(image).unsqueeze(0)
+
+text = tokenizer(["a diagram", "a dog", "a cat", "a beignet"], context_length=model.context_length)
+
+with torch.no_grad(), torch.cuda.amp.autocast():
+    image_features = model.encode_image(image)
+    text_features = model.encode_text(text)
+    image_features = F.normalize(image_features, dim=-1)
+    text_features = F.normalize(text_features, dim=-1)
+
+    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
+
+print("Label probs:", text_probs)  # prints: [[0., 0., 0., 1.0]]
+```
+#### Note: Due to differences in the default epsilon values for LayerNorm initialization between JAX and PyTorch, we made some modifications in open_clip/transformer.py to align the model's behavior.
+## Acknowledgement
+
+This pytorch repo is built on [OpenCLIP](https://github.com/mlfoundations/open_clip). 
+Many thanks to the awesome works from the open-source community!
+
+We would like to thank TPU Research Cloud (TRC) program, Google Cloud Research Credits program, and AWS Cloud Credit for Research program for supporting our computing needs.
 <!-- ---
 
 ## **Citation**

diff --git a/inference.py b/inference.py
@@ -0,0 +1,25 @@
+import torch
+import torch.nn.functional as F
+from urllib.request import urlopen
+from PIL import Image
+from open_clip import create_model_from_pretrained, get_tokenizer
+
+model, preprocess = create_model_from_pretrained('hf-hub:UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B')
+tokenizer = get_tokenizer('hf-hub:UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B')
+
+image = Image.open(urlopen(
+    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
+))
+image = preprocess(image).unsqueeze(0)
+
+text = tokenizer(["a diagram", "a dog", "a cat", "a beignet"], context_length=model.context_length)
+
+with torch.no_grad(), torch.cuda.amp.autocast():
+    image_features = model.encode_image(image)
+    text_features = model.encode_text(text)
+    image_features = F.normalize(image_features, dim=-1)
+    text_features = F.normalize(text_features, dim=-1)
+
+    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
+
+print("Label probs:", text_probs)  # prints: [[0., 0., 0., 1.0]]
diff --git a/open_clip/__init__.py b/open_clip/__init__.py
@@ -0,0 +1,18 @@
+from .version import __version__
+
+from .coca_model import CoCa
+from .constants import OPENAI_DATASET_MEAN, OPENAI_DATASET_STD
+from .factory import create_model, create_model_and_transforms, create_model_from_pretrained, get_tokenizer, create_loss
+from .factory import list_models, add_model_config, get_model_config, load_checkpoint
+from .loss import ClipLoss, DistillClipLoss, CoCaLoss
+from .model import CLIP, CustomTextCLIP, CLIPTextCfg, CLIPVisionCfg, \
+    convert_weights_to_lp, convert_weights_to_fp16, trace_model, get_cast_dtype, get_input_dtype, \
+    get_model_tokenize_cfg, get_model_preprocess_cfg, set_model_preprocess_cfg
+from .openai import load_openai_model, list_openai_models
+from .pretrained import list_pretrained, list_pretrained_models_by_tag, list_pretrained_tags_by_model, \
+    get_pretrained_url, download_pretrained_from_url, is_pretrained_cfg, get_pretrained_cfg, download_pretrained
+from .push_to_hf_hub import push_pretrained_to_hf_hub, push_to_hf_hub
+from .tokenizer import SimpleTokenizer, tokenize, decode
+from .transform import image_transform, AugmentationCfg
+from .zero_shot_classifier import build_zero_shot_classifier, build_zero_shot_classifier_legacy
+from .zero_shot_metadata import OPENAI_IMAGENET_TEMPLATES, SIMPLE_IMAGENET_TEMPLATES, IMAGENET_CLASSNAMES
diff --git a/open_clip/__pycache__/__init__.cpython-312.pyc b/open_clip/__pycache__/__init__.cpython-312.pyc
diff --git a/open_clip/__pycache__/coca_model.cpython-312.pyc b/open_clip/__pycache__/coca_model.cpython-312.pyc
diff --git a/open_clip/__pycache__/constants.cpython-312.pyc b/open_clip/__pycache__/constants.cpython-312.pyc
diff --git a/open_clip/__pycache__/convert.cpython-312.pyc b/open_clip/__pycache__/convert.cpython-312.pyc
diff --git a/open_clip/__pycache__/factory.cpython-312.pyc b/open_clip/__pycache__/factory.cpython-312.pyc
diff --git a/open_clip/__pycache__/hf_configs.cpython-312.pyc b/open_clip/__pycache__/hf_configs.cpython-312.pyc
diff --git a/open_clip/__pycache__/hf_model.cpython-312.pyc b/open_clip/__pycache__/hf_model.cpython-312.pyc
diff --git a/open_clip/__pycache__/loss.cpython-312.pyc b/open_clip/__pycache__/loss.cpython-312.pyc
diff --git a/open_clip/__pycache__/model.cpython-312.pyc b/open_clip/__pycache__/model.cpython-312.pyc
diff --git a/open_clip/__pycache__/modified_resnet.cpython-312.pyc b/open_clip/__pycache__/modified_resnet.cpython-312.pyc
diff --git a/open_clip/__pycache__/openai.cpython-312.pyc b/open_clip/__pycache__/openai.cpython-312.pyc
diff --git a/open_clip/__pycache__/pos_embed.cpython-312.pyc b/open_clip/__pycache__/pos_embed.cpython-312.pyc
diff --git a/open_clip/__pycache__/pretrained.cpython-312.pyc b/open_clip/__pycache__/pretrained.cpython-312.pyc
diff --git a/open_clip/__pycache__/push_to_hf_hub.cpython-312.pyc b/open_clip/__pycache__/push_to_hf_hub.cpython-312.pyc
diff --git a/open_clip/__pycache__/timm_model.cpython-312.pyc b/open_clip/__pycache__/timm_model.cpython-312.pyc
diff --git a/open_clip/__pycache__/tokenizer.cpython-312.pyc b/open_clip/__pycache__/tokenizer.cpython-312.pyc
diff --git a/open_clip/__pycache__/transform.cpython-312.pyc b/open_clip/__pycache__/transform.cpython-312.pyc
diff --git a/open_clip/__pycache__/transformer.cpython-312.pyc b/open_clip/__pycache__/transformer.cpython-312.pyc
diff --git a/open_clip/__pycache__/utils.cpython-312.pyc b/open_clip/__pycache__/utils.cpython-312.pyc
diff --git a/open_clip/__pycache__/version.cpython-312.pyc b/open_clip/__pycache__/version.cpython-312.pyc
diff --git a/open_clip/__pycache__/zero_shot_classifier.cpython-312.pyc b/open_clip/__pycache__/zero_shot_classifier.cpython-312.pyc
diff --git a/open_clip/__pycache__/zero_shot_metadata.cpython-312.pyc b/open_clip/__pycache__/zero_shot_metadata.cpython-312.pyc
diff --git a/open_clip/bpe_simple_vocab_16e6.txt.gz b/open_clip/bpe_simple_vocab_16e6.txt.gz