Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
Yanqing0327 committed Nov 26, 2024
1 parent ed0ea55 commit 12a55d0
Show file tree
Hide file tree
Showing 148 changed files with 8,377 additions and 5 deletions.
63 changes: 58 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,13 @@

**Official implementation of the paper "_CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions_".**

![Method Pipeline](./docs/resources/method.jpg)
---

Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions:
## **Authors**

1. By observing a strong inverse effect with synthetic captions, we use only **partial synthetic captions** to feed the text encoder, achieving significantly better performance.
2. We incorporate an **autoregressive captioner** that mimics the recaptioning process, predicting full-length synthetic captions conditioned on the image and original web-crawled captions.
- [Yanqing Liu](https://yanqing0327.github.io/Yanqing.github.io/)<sup>1</sup>, [Xianhang Li](https://xhl-video.github.io/xianhangli/)<sup>1</sup>, [Zeyu Wang](https://zw615.github.io/)<sup>1</sup>, [Bingchen Zhao](https://bzhao.me/)<sup>2</sup>, [Cihang Xie](https://cihangxie.github.io/)<sup>1</sup>

Our method achieves **state-of-the-art (SOTA)** results in zero-shot image-text retrieval on MSCOCO and Flickr30K, while enhancing the visual capability of LLaVA.
<sup>1</sup>UC Santa Cruz, <sup>2</sup>University of Edinburgh

---

Expand All @@ -20,6 +19,19 @@ Our method achieves **state-of-the-art (SOTA)** results in zero-shot image-text

---

## **Proposed Method**

![Method Pipeline](./docs/resources/method.jpg)

Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions:

1. By observing a strong inverse effect with synthetic captions, we use only **partial synthetic captions** to feed the text encoder, achieving significantly better performance.
2. We incorporate an **autoregressive captioner** that mimics the recaptioning process, predicting full-length synthetic captions conditioned on the image and original web-crawled captions.

Our method achieves **state-of-the-art (SOTA)** results in zero-shot image-text retrieval on MSCOCO and Flickr30K, while enhancing the visual capability of LLaVA.

---

## **Key Results**

### **Inverse Effect with Synthetic Captions**
Expand Down Expand Up @@ -57,6 +69,47 @@ Replacing OpenAI-CLIP with **CLIPS** significantly boosts LLaVA's performance ac
| CLIPS-Large-14 | [🤗 HuggingFace Model](https://huggingface.co/UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B) |
| CLIPS-Huge-14 | Coming Soon... |

## **Model Usage**
### **Environment**
Install dependencies:
```
pip3 install -r requirements.txt
```
### **With OpenCLIP**
```python
import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer

model, preprocess = create_model_from_pretrained('hf-hub:UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B')
tokenizer = get_tokenizer('hf-hub:UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B')

image = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)

text = tokenizer(["a diagram", "a dog", "a cat", "a beignet"], context_length=model.context_length)

with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features = F.normalize(image_features, dim=-1)
text_features = F.normalize(text_features, dim=-1)

text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs) # prints: [[0., 0., 0., 1.0]]
```
#### Note: Due to differences in the default epsilon values for LayerNorm initialization between JAX and PyTorch, we made some modifications in open_clip/transformer.py to align the model's behavior.
## Acknowledgement

This pytorch repo is built on [OpenCLIP](https://github.com/mlfoundations/open_clip).
Many thanks to the awesome works from the open-source community!

We would like to thank TPU Research Cloud (TRC) program, Google Cloud Research Credits program, and AWS Cloud Credit for Research program for supporting our computing needs.
<!-- ---
## **Citation**
Expand Down
25 changes: 25 additions & 0 deletions inference.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer

model, preprocess = create_model_from_pretrained('hf-hub:UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B')
tokenizer = get_tokenizer('hf-hub:UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B')

image = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)

text = tokenizer(["a diagram", "a dog", "a cat", "a beignet"], context_length=model.context_length)

with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features = F.normalize(image_features, dim=-1)
text_features = F.normalize(text_features, dim=-1)

text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs) # prints: [[0., 0., 0., 1.0]]
18 changes: 18 additions & 0 deletions open_clip/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from .version import __version__

from .coca_model import CoCa
from .constants import OPENAI_DATASET_MEAN, OPENAI_DATASET_STD
from .factory import create_model, create_model_and_transforms, create_model_from_pretrained, get_tokenizer, create_loss
from .factory import list_models, add_model_config, get_model_config, load_checkpoint
from .loss import ClipLoss, DistillClipLoss, CoCaLoss
from .model import CLIP, CustomTextCLIP, CLIPTextCfg, CLIPVisionCfg, \
convert_weights_to_lp, convert_weights_to_fp16, trace_model, get_cast_dtype, get_input_dtype, \
get_model_tokenize_cfg, get_model_preprocess_cfg, set_model_preprocess_cfg
from .openai import load_openai_model, list_openai_models
from .pretrained import list_pretrained, list_pretrained_models_by_tag, list_pretrained_tags_by_model, \
get_pretrained_url, download_pretrained_from_url, is_pretrained_cfg, get_pretrained_cfg, download_pretrained
from .push_to_hf_hub import push_pretrained_to_hf_hub, push_to_hf_hub
from .tokenizer import SimpleTokenizer, tokenize, decode
from .transform import image_transform, AugmentationCfg
from .zero_shot_classifier import build_zero_shot_classifier, build_zero_shot_classifier_legacy
from .zero_shot_metadata import OPENAI_IMAGENET_TEMPLATES, SIMPLE_IMAGENET_TEMPLATES, IMAGENET_CLASSNAMES
Binary file added open_clip/__pycache__/__init__.cpython-312.pyc
Binary file not shown.
Binary file added open_clip/__pycache__/coca_model.cpython-312.pyc
Binary file not shown.
Binary file added open_clip/__pycache__/constants.cpython-312.pyc
Binary file not shown.
Binary file added open_clip/__pycache__/convert.cpython-312.pyc
Binary file not shown.
Binary file added open_clip/__pycache__/factory.cpython-312.pyc
Binary file not shown.
Binary file added open_clip/__pycache__/hf_configs.cpython-312.pyc
Binary file not shown.
Binary file added open_clip/__pycache__/hf_model.cpython-312.pyc
Binary file not shown.
Binary file added open_clip/__pycache__/loss.cpython-312.pyc
Binary file not shown.
Binary file added open_clip/__pycache__/model.cpython-312.pyc
Binary file not shown.
Binary file not shown.
Binary file added open_clip/__pycache__/openai.cpython-312.pyc
Binary file not shown.
Binary file added open_clip/__pycache__/pos_embed.cpython-312.pyc
Binary file not shown.
Binary file added open_clip/__pycache__/pretrained.cpython-312.pyc
Binary file not shown.
Binary file not shown.
Binary file added open_clip/__pycache__/timm_model.cpython-312.pyc
Binary file not shown.
Binary file added open_clip/__pycache__/tokenizer.cpython-312.pyc
Binary file not shown.
Binary file added open_clip/__pycache__/transform.cpython-312.pyc
Binary file not shown.
Binary file added open_clip/__pycache__/transformer.cpython-312.pyc
Binary file not shown.
Binary file added open_clip/__pycache__/utils.cpython-312.pyc
Binary file not shown.
Binary file added open_clip/__pycache__/version.cpython-312.pyc
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added open_clip/bpe_simple_vocab_16e6.txt.gz
Binary file not shown.
Loading

0 comments on commit 12a55d0

Please sign in to comment.