update

UCSC-VLAA · Nov 26, 2024 · 5e9077c · 5e9077c
1 parent eb13d47
commit 5e9077c
Show file tree

Hide file tree

Showing 10 changed files with 97 additions and 15 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/README.md b/README.md
@@ -1,2 +1,84 @@
-# CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions
-An Enhanced CLIP Framework for Learning with Synthetic Captions
+# **CLIPS**
+
+**Official implementation of the paper "_CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions_".**
+
+---
+
+## **Authors**
+
+- [Yanqing Liu](https://yanqing0327.github.io/Yanqing.github.io/)<sup>1</sup>, [Xianhang Li](https://xhl-video.github.io/xianhangli/)<sup>1</sup>, [Zeyu Wang](https://zw615.github.io/)<sup>1</sup>,  [Bingchen Zhao](https://bzhao.me/)<sup>2</sup>, [Cihang Xie](https://cihangxie.github.io/)<sup>1</sup>  
+
+<sup>1</sup>UC Santa Cruz, <sup>2</sup>University of Edinburgh  
+
+---
+
+## **Links**
+- [📄 Paper (arXiv)](https://arxiv.org/abs/2406.08478)  
+- [🤗 Pretrained Model on HuggingFace](https://huggingface.co/UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B)  
+
+---
+
+## **Proposed Method**
+
+### **CLIPS Pipeline**
+<img src="./docs/resources/method.jpg" alt="Method Pipeline" style="width: 40%; display: block; margin: 0 auto;" />
+
+Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions:
+
+1. By observing a strong inverse effect with synthetic captions, we use only **partial synthetic captions** to feed the text encoder, achieving significantly better performance.
+2. We incorporate an **autoregressive captioner** that mimics the recaptioning process, predicting full-length synthetic captions conditioned on the image and original web-crawled captions.
+
+Our method achieves **state-of-the-art (SOTA)** results in zero-shot image-text retrieval on MSCOCO and Flickr30K, while enhancing the visual capability of LLaVA.
+
+---
+
+## **Key Results**
+
+### **Inverse Effect with Synthetic Captions**
+<img src="./docs/resources/mask_strategy.jpg" alt="Inverse Effect Visualization" style="width: 50%; display: block; margin: 0 auto;" />
+
+Visualization of four different token reduction strategies. These strategies can improve the model's learning efficiency on synthetic captions to varying degrees. Among these strategies, the sub-caption and block mask perform best.
+
+---
+
+### **Zero-Shot Cross-Modal Retrieval**
+<img src="./docs/resources/retrieval.png" alt="Zero-Shot Retrieval Results" style="width: 50%; display: block; margin: 0 auto;" />
+
+Our method consistently achieves superior performance across all benchmarks and model sizes, yielding significant improvements over the baselines.
+
+---
+
+### **Comparison with State-of-the-Art Methods**
+<img src="./docs/resources/sota.png" alt="SOTA Comparison" style="width: 50%; display: block; margin: 0 auto;" />
+
+With increased computational resources and scaling, our best model further achieves 76.4% and 96.6% R@1 text retrieval performance on MSCOCO and Flickr30K respectively, and 57.2% and 83.9% R@1 image retrieval performance on the same datasets, setting new state-of-the-art (SOTA) results.
+
+---
+
+### **CLIPS in LLaVA**
+<img src="./docs/resources/LLaVA.png" alt="LLaVA Results" style="width: 50%; display: block; margin: 0 auto;" />
+
+Replacing OpenAI-CLIP with **CLIPS** significantly boosts LLaVA's performance across various benchmarks.
+
+---
+
+## **Model Zoo**
+
+| Model          | Link                                                                                     |
+|----------------|------------------------------------------------------------------------------------------|
+| CLIPS-Large-14 | [🤗 HuggingFace Model](https://huggingface.co/UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B) |
+| CLIPS-Huge-14  | Coming Soon...                                                                          |
+
+<!-- ---
+
+## **Citation**
+
+If you use our work, please cite it:
+
+```bibtex
+@article{liu2024clips,
+  title={CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions},
+  author={Liu, Yanqing and Li, Xianhang and Wang, Zeyu and Zhao, Bingchen and Xie, Cihang},
+  journal={arXiv preprint arXiv:2406.08478},
+  year={2024}
+} -->
diff --git a/docs/index.html b/docs/index.html
@@ -78,7 +78,7 @@ <h1 class="title is-1 publication-title">CLIPS: An Enhanced CLIP Framework for L
               </span>
 
                 <span class="link-block">
-                <a href="https://huggingface.co/tennant/llava-llama-3-8b-hqedit"
+                <a href="https://huggingface.co/UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B"
                    class="external-link button is-normal is-rounded is-dark">
                   <span class="icon">
                     <i class="fa-solidasasa fa-face-smiling-hands"></i>
@@ -102,8 +102,8 @@ <h1 class="title is-1 publication-title">CLIPS: An Enhanced CLIP Framework for L
 <section class="hero teaser">
   <div class="container">
     <div class="hero-body", style="text-align: center;">
-      <img src="./resources/pipeline_1110.jpg" alt="alt text"
-                        style="width: 50%; object-fit: cover; max-width:50%;"></a>
+      <img src="./resources/method.jpg" alt="alt text"
+                        style="width: 60%; object-fit: cover; max-width:60%;"></a>
       <h2 class="subtitle has-text-centered">
         The pipeline of our proposed CLIPS. We introduce two simple yet effective designs:
         <br>1) only a subpart of the synthetic caption is used in contrastive learning, and 
@@ -172,7 +172,7 @@ <h2 class="title is-3">Inverse Effect with Synthetic Captions</h2>
                 <img src="./resources/subcaption_mask.jpg" alt="Pipeline 4" style="width: 100%;">
               </div>
               <p class="has-text-left">
-                <strong style="font-weight: 900">The inverse scaling effect of synthetic captions</strong>
+                <strong style="font-weight: 900">The inverse effect of synthetic captions.</strong>
                 Unlike the performance drop from reducing token length in original captions, shortening the token length of synthetic captions consistently improves model performance.
               </p>
             </div>
@@ -187,10 +187,10 @@ <h2 class="title is-3">Inverse Effect with Synthetic Captions</h2>
                 <div class="column is-four-fifths">
                   <h2 class="title is-3">Zero-Shot Cross-Modal Retrieval</h2>
                   <div class="content has-text-justified">
-                    <center><img class="center" src="./resources/image_text_retrieval.png" width="100%"></center>
+                    <center><img class="center" src="./resources/retrieval.png" width="100%"></center>
                     <p>
                       <strong style="font-weight: 900">Zero-shot image-text retrieval results on MSCOCO and Flickr30K.</strong style="font-weight: 900">
-                        The CLIPA and CoCa results are reproduced by us. Both methods are implemented with a mixing ratio of 0.8, where the original caption accounts for 0.8 and the synthetic caption accounts for 0.2. Experimental results show that CLIPS significantly enhances zero-shot performance in cross-modal retrieval.
+                        The CLIPA and CoCa results are reproduced by us. Both methods are implemented with a mixture training, where the original caption accounts for 80% and the synthetic caption accounts for 20%. Our method consistently achieves superior performance across all benchmarks and model sizes, yielding significant improvements over the baselines.
                     </p>
                   </div>
                 </div>
@@ -226,7 +226,7 @@ <h2 class="title is-3">CLIPS in LLaVA</h2>
                     <center><img class="center" src="./resources/LLaVA.png" width="100%"></center>
                     <p>
                       <strong style="font-weight: 900">Comparison of LLaVA-1.5 performance.</strong style="font-weight: 900">
-                        We directly replace the original OpenAI-CLIP-Large-14 with the CLIPS-Large-14 and use LLaMA-3 as the language model. Our method achieves strong performance improvements across multiple metrics, effectively enhancing the cross-modal understanding capability of MLLM.
+                        We directly replace the original OpenAI-CLIP-Large-14 with the CLIPS-Large-14 and use LLaMA-3 as the language model. The results demonstrate that integrating CLIPS significantly enhances LLaVA's performance across multiple metrics compared to using the original OpenAI-CLIP visual encoder.
                     </p>
                   </div>
                 </div>
@@ -242,7 +242,7 @@ <h2 class="title is-3">CLIPS in LLaVA</h2>
                   <h2 class="title is-3">Model Zoo</h2>
                   <div class="content has-text-justified">
                     <p>
-                      We will release the model soon!
+                      We have released CLIPS-Large-14, and more models will be available soon!
 
                     <h3>Models</h3>
                     <table>
@@ -251,11 +251,11 @@ <h3>Models</h3>
                         <th>url</th>
                       </tr>
                       <tr>
-                        <td>CLIPS-Large-14</td>
-                        <td>Coming Soon...</td>
+                        <td>CLIPS-Large-14-336</td>
+                        <td><a href="https://huggingface.co/UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B" target="_blank">https://huggingface.co/UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B</a></td>
                       </tr>
                       <tr>
-                        <td>CLIPS-Huge-14</td>
+                        <td>CLIPS-Huge-14-336</td>
                         <td>Coming Soon...</td>
                       </tr>
                     </table>
@@ -280,14 +280,14 @@ <h2 class="title is-3">Acknowledge</h2>
               </div>
             </section>
 
-<section class="section" id="BibTeX">
+<!-- <section class="section" id="BibTeX">
   <div class="container is-max-desktop content">
     <h2 class="title">BibTeX</h2>
     <pre><code>
 -----------------------------
     </code></pre>
   </div>
-</section>
+</section> -->
 
 
 <footer class="footer">

diff --git a/docs/resources/.DS_Store b/docs/resources/.DS_Store
diff --git a/docs/resources/LLaVA.png b/docs/resources/LLaVA.png
diff --git a/docs/resources/image_text_retrieval.png b/docs/resources/image_text_retrieval.png
diff --git a/docs/resources/method.jpg b/docs/resources/method.jpg
diff --git a/docs/resources/pipeline_1110.jpg b/docs/resources/pipeline_1110.jpg
diff --git a/docs/resources/retrieval.png b/docs/resources/retrieval.png
diff --git a/docs/resources/sota.png b/docs/resources/sota.png