Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
Yanqing0327 committed Nov 26, 2024
1 parent eb13d47 commit 5e9077c
Show file tree
Hide file tree
Showing 10 changed files with 97 additions and 15 deletions.
Binary file modified .DS_Store
Binary file not shown.
86 changes: 84 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,84 @@
# CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions
An Enhanced CLIP Framework for Learning with Synthetic Captions
# **CLIPS**

**Official implementation of the paper "_CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions_".**

---

## **Authors**

- [Yanqing Liu](https://yanqing0327.github.io/Yanqing.github.io/)<sup>1</sup>, [Xianhang Li](https://xhl-video.github.io/xianhangli/)<sup>1</sup>, [Zeyu Wang](https://zw615.github.io/)<sup>1</sup>, [Bingchen Zhao](https://bzhao.me/)<sup>2</sup>, [Cihang Xie](https://cihangxie.github.io/)<sup>1</sup>

<sup>1</sup>UC Santa Cruz, <sup>2</sup>University of Edinburgh

---

## **Links**
- [📄 Paper (arXiv)](https://arxiv.org/abs/2406.08478)
- [🤗 Pretrained Model on HuggingFace](https://huggingface.co/UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B)

---

## **Proposed Method**

### **CLIPS Pipeline**
<img src="./docs/resources/method.jpg" alt="Method Pipeline" style="width: 40%; display: block; margin: 0 auto;" />

Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions:

1. By observing a strong inverse effect with synthetic captions, we use only **partial synthetic captions** to feed the text encoder, achieving significantly better performance.
2. We incorporate an **autoregressive captioner** that mimics the recaptioning process, predicting full-length synthetic captions conditioned on the image and original web-crawled captions.

Our method achieves **state-of-the-art (SOTA)** results in zero-shot image-text retrieval on MSCOCO and Flickr30K, while enhancing the visual capability of LLaVA.

---

## **Key Results**

### **Inverse Effect with Synthetic Captions**
<img src="./docs/resources/mask_strategy.jpg" alt="Inverse Effect Visualization" style="width: 50%; display: block; margin: 0 auto;" />

Visualization of four different token reduction strategies. These strategies can improve the model's learning efficiency on synthetic captions to varying degrees. Among these strategies, the sub-caption and block mask perform best.

---

### **Zero-Shot Cross-Modal Retrieval**
<img src="./docs/resources/retrieval.png" alt="Zero-Shot Retrieval Results" style="width: 50%; display: block; margin: 0 auto;" />

Our method consistently achieves superior performance across all benchmarks and model sizes, yielding significant improvements over the baselines.

---

### **Comparison with State-of-the-Art Methods**
<img src="./docs/resources/sota.png" alt="SOTA Comparison" style="width: 50%; display: block; margin: 0 auto;" />

With increased computational resources and scaling, our best model further achieves 76.4% and 96.6% R@1 text retrieval performance on MSCOCO and Flickr30K respectively, and 57.2% and 83.9% R@1 image retrieval performance on the same datasets, setting new state-of-the-art (SOTA) results.

---

### **CLIPS in LLaVA**
<img src="./docs/resources/LLaVA.png" alt="LLaVA Results" style="width: 50%; display: block; margin: 0 auto;" />

Replacing OpenAI-CLIP with **CLIPS** significantly boosts LLaVA's performance across various benchmarks.

---

## **Model Zoo**

| Model | Link |
|----------------|------------------------------------------------------------------------------------------|
| CLIPS-Large-14 | [🤗 HuggingFace Model](https://huggingface.co/UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B) |
| CLIPS-Huge-14 | Coming Soon... |

<!-- ---
## **Citation**
If you use our work, please cite it:
```bibtex
@article{liu2024clips,
title={CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions},
author={Liu, Yanqing and Li, Xianhang and Wang, Zeyu and Zhao, Bingchen and Xie, Cihang},
journal={arXiv preprint arXiv:2406.08478},
year={2024}
} -->
26 changes: 13 additions & 13 deletions docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ <h1 class="title is-1 publication-title">CLIPS: An Enhanced CLIP Framework for L
</span>

<span class="link-block">
<a href="https://huggingface.co/tennant/llava-llama-3-8b-hqedit"
<a href="https://huggingface.co/UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fa-solidasasa fa-face-smiling-hands"></i>
Expand All @@ -102,8 +102,8 @@ <h1 class="title is-1 publication-title">CLIPS: An Enhanced CLIP Framework for L
<section class="hero teaser">
<div class="container">
<div class="hero-body", style="text-align: center;">
<img src="./resources/pipeline_1110.jpg" alt="alt text"
style="width: 50%; object-fit: cover; max-width:50%;"></a>
<img src="./resources/method.jpg" alt="alt text"
style="width: 60%; object-fit: cover; max-width:60%;"></a>
<h2 class="subtitle has-text-centered">
The pipeline of our proposed CLIPS. We introduce two simple yet effective designs:
<br>1) only a subpart of the synthetic caption is used in contrastive learning, and
Expand Down Expand Up @@ -172,7 +172,7 @@ <h2 class="title is-3">Inverse Effect with Synthetic Captions</h2>
<img src="./resources/subcaption_mask.jpg" alt="Pipeline 4" style="width: 100%;">
</div>
<p class="has-text-left">
<strong style="font-weight: 900">The inverse scaling effect of synthetic captions</strong>
<strong style="font-weight: 900">The inverse effect of synthetic captions.</strong>
Unlike the performance drop from reducing token length in original captions, shortening the token length of synthetic captions consistently improves model performance.
</p>
</div>
Expand All @@ -187,10 +187,10 @@ <h2 class="title is-3">Inverse Effect with Synthetic Captions</h2>
<div class="column is-four-fifths">
<h2 class="title is-3">Zero-Shot Cross-Modal Retrieval</h2>
<div class="content has-text-justified">
<center><img class="center" src="./resources/image_text_retrieval.png" width="100%"></center>
<center><img class="center" src="./resources/retrieval.png" width="100%"></center>
<p>
<strong style="font-weight: 900">Zero-shot image-text retrieval results on MSCOCO and Flickr30K.</strong style="font-weight: 900">
The CLIPA and CoCa results are reproduced by us. Both methods are implemented with a mixing ratio of 0.8, where the original caption accounts for 0.8 and the synthetic caption accounts for 0.2. Experimental results show that CLIPS significantly enhances zero-shot performance in cross-modal retrieval.
The CLIPA and CoCa results are reproduced by us. Both methods are implemented with a mixture training, where the original caption accounts for 80% and the synthetic caption accounts for 20%. Our method consistently achieves superior performance across all benchmarks and model sizes, yielding significant improvements over the baselines.
</p>
</div>
</div>
Expand Down Expand Up @@ -226,7 +226,7 @@ <h2 class="title is-3">CLIPS in LLaVA</h2>
<center><img class="center" src="./resources/LLaVA.png" width="100%"></center>
<p>
<strong style="font-weight: 900">Comparison of LLaVA-1.5 performance.</strong style="font-weight: 900">
We directly replace the original OpenAI-CLIP-Large-14 with the CLIPS-Large-14 and use LLaMA-3 as the language model. Our method achieves strong performance improvements across multiple metrics, effectively enhancing the cross-modal understanding capability of MLLM.
We directly replace the original OpenAI-CLIP-Large-14 with the CLIPS-Large-14 and use LLaMA-3 as the language model. The results demonstrate that integrating CLIPS significantly enhances LLaVA's performance across multiple metrics compared to using the original OpenAI-CLIP visual encoder.
</p>
</div>
</div>
Expand All @@ -242,7 +242,7 @@ <h2 class="title is-3">CLIPS in LLaVA</h2>
<h2 class="title is-3">Model Zoo</h2>
<div class="content has-text-justified">
<p>
We will release the model soon!
We have released CLIPS-Large-14, and more models will be available soon!

<h3>Models</h3>
<table>
Expand All @@ -251,11 +251,11 @@ <h3>Models</h3>
<th>url</th>
</tr>
<tr>
<td>CLIPS-Large-14</td>
<td>Coming Soon...</td>
<td>CLIPS-Large-14-336</td>
<td><a href="https://huggingface.co/UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B" target="_blank">https://huggingface.co/UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B</a></td>
</tr>
<tr>
<td>CLIPS-Huge-14</td>
<td>CLIPS-Huge-14-336</td>
<td>Coming Soon...</td>
</tr>
</table>
Expand All @@ -280,14 +280,14 @@ <h2 class="title is-3">Acknowledge</h2>
</div>
</section>

<section class="section" id="BibTeX">
<!-- <section class="section" id="BibTeX">
<div class="container is-max-desktop content">
<h2 class="title">BibTeX</h2>
<pre><code>
-----------------------------
</code></pre>
</div>
</section>
</section> -->


<footer class="footer">
Expand Down
Binary file modified docs/resources/.DS_Store
Binary file not shown.
Binary file modified docs/resources/LLaVA.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/resources/image_text_retrieval.png
Binary file not shown.
Binary file added docs/resources/method.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/resources/pipeline_1110.jpg
Binary file not shown.
Binary file added docs/resources/retrieval.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/resources/sota.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 5e9077c

Please sign in to comment.