Why is there a difference in the text encoder between CLIP and open-CLIP? #337

PatCH0816 · 2023-01-04T21:40:58Z

PatCH0816
Jan 4, 2023

I just wanted to compare CLIP and open-CLIP as I noticed a difference in the text-transformer architectures between CLIP and open-CLIP as shown in the figure. Is there something I'm missing? One can compare the architectures between these two models by themselves using:

import clip
model, preprocess = clip.load("RN50")
model

import open_clip
open_model, _, open_preprocess = open_clip.create_model_and_transforms('RN50', pretrained='openai')
open_model

Answered by rwightman

Jan 9, 2023

@PatCH0816 FYI, OpenCLIP was intended to be use used under AMP autocast context for mixed-precision where as CLIP is using a 'manual' mixed precision of sorts and float16 weights.

View full answer

PatCH0816 · 2023-01-04T21:52:15Z

PatCH0816
Jan 4, 2023
Author

Also, how has the modified open-CLIP architecture been trained on the proprietary openAI dataset?

0 replies

rom1504 · 2023-01-04T23:12:37Z

rom1504
Jan 4, 2023
Maintainer

Not sure about this additional layer norm, will let other people answer Regarding the dataset, we're using laion400m and laion5B. Prioritary datasets can't be used outside of their respective companies

…

On Wed, Jan 4, 2023, 22:52 Patrick Koller ***@***.***> wrote: Also, how has the modified open-CLIP architecture been trained on the proprietary openAI dataset? — Reply to this email directly, view it on GitHub <#337 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437Q55UWVFA2LII4N7FLWQXWJVANCNFSM6AAAAAATRH5NEY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.*** com>

0 replies

PatCH0816 · 2023-01-06T17:14:13Z

PatCH0816
Jan 6, 2023
Author

Hi @rom1504

Thanks a lot for your thoughts. Hopefully, somebody will be able to answer the question about the additional layer norm and how it's mean and variance parameter has been trained.

Yes, open-CLIP models can be loaded with pre-trained weights from the LAION datasets. In this case, we loaded a modified ResNet-50 model (Additional layers) with pre-trained weights from OpenAI: (Focus on the "pretrained" argument)

open_clip.create_model_and_transforms('RN50', pretrained='openai')

How does this work? If the ResNet-50 model is modified, it needs to be trained from scratch or fine-tuned and to my knowledge you can't do this if you don't have access to OpenAI's custom dataset.

1 reply

rom1504 Jan 6, 2023
Maintainer

modified means the architecture of resnet was modified by openai, not by us
they trained it with that architecture and we simply expose the weights

rwightman · 2023-01-07T22:56:49Z

rwightman
Jan 7, 2023
Maintainer

@PatCH0816 they are the same

the nn.Identity() layers have no impact, they are no-op but will have a module if layer-scale is enabled.
the LayerNorm was moved in terms of attribute assignment to reflect the order in which they were called, you see below, the CLIP models assign attributes to the class .attn=.. and then .ln_1 = ... printing the module outputs in order they were registered (assigned to the parent module). But both call them the same. I changed the OpenCLIP models as I feel it's more sensible to try and assign in the same order they are called for more clarity and sometimes less hassle with symbolic manipulation, etc.

open_clip/src/open_clip/transformer.py

Lines 214 to 236 in aebead1

    
               self.ln_1 = norm_layer(d_model) 
        
               self.attn = Attention( 
        
                   d_model, n_head, 
        
                   scaled_cosine=scale_cosine_attn, 
        
                   scale_heads=scale_heads, 
        
               ) 
        
               self.ln_attn = norm_layer(d_model) if scale_attn else nn.Identity() 
        
               self.ls_1 = LayerScale(d_model, ls_init_value) if ls_init_value is not None else nn.Identity() 
        
               self.ln_2 = norm_layer(d_model) 
        
               mlp_width = int(d_model * mlp_ratio) 
        
               self.mlp = nn.Sequential(OrderedDict([ 
        
                   ("c_fc", nn.Linear(d_model, mlp_width)), 
        
                   ('ln', norm_layer(mlp_width) if scale_fc else nn.Identity()), 
        
                   ("gelu", act_layer()), 
        
                   ("c_proj", nn.Linear(mlp_width, d_model)) 
        
               ])) 
        
               self.ls_2 = LayerScale(d_model, ls_init_value) if ls_init_value is not None else nn.Identity() 
        
           def forward(self, x: torch.Tensor, attn_mask: Optional[torch.Tensor] = None): 
        
               x = x + self.ls_1(self.ln_attn(self.attn(self.ln_1(x), attn_mask=attn_mask))) 
        
               x = x + self.ls_2(self.mlp(self.ln_2(x))) 
        
               return x

https://github.com/openai/CLIP/blob/d50d76daa670286dd6cacf3bcd80b5e4823fc8e1/clip/model.py#L175-L192

1 reply

PatCH0816 Jan 9, 2023
Author

@rwightman Thanks for your detailed and insightful explanation. I agree, the changes to the architecture do not change the results.

Nevertheless, I was surprised that feeding an image into the CLIP-image-encoder and feeding the exact same image into the open-CLIP-image-encoder results in slightly different embeddings. (E.g. 0.3120 vs. 0.3124) This can be demonstrated by observing the output of the two print(..) statements in the following code snippet. Analyzing the datatypes of all parameters for CLIP and open-CLIP explains the differences. CLIP uses a mix between torch.float32 and torch.float16, wether open-CLIP uses torch.float32 all the way. Satisfying the conditional statement in the code-snippet below the comment "set CLIP and open-CLIPs precision to the same precision" to "True", sets all parameters to the same type. If an image is fed into CLIP and open-CLIP, both embedding are now exactly the same. This statement if true for the image-encoder and the text-encoder.
The final plot in the code-snippet demonstrates the error between the CLIP embedding and the open-CLIP embedding. Forcing CLIP to use torch.float32 as well by satisfying the conditional statement underneath the comment "set CLIP and open-CLIPs precision to the same precision" demonstrates how this error between the CLIP and open-CLIP embeddings shrinks towards the numerical precision limited 0.

Findings:

CLIP uses a mix between torch.float32 and torch.float16
open-CLIP uses torch.float32 all the way trough
This results in slightly different embeddings if one compares the output of CLIP and open-CLIP (Using the RN50 argument)
This difference was confusing for me as a first time user, since I expected that both commands would load the exact same model when using clip.load("RN50") and open_clip.create_model_and_transforms('RN50', pretrained='openai'), which should generate the exact same embeddings. Which is not true, due to the different datatypes in use.
It is a very satisfying feeling to finally understand the details of CLIP :)

import clip
import open_clip

# Load CLIP and open-CLIP
model, preprocess = clip.load("RN50")
model.cuda()
open_model, _, open_preprocess = open_clip.create_model_and_transforms('RN50', pretrained='openai')
open_model.cuda()
    
import skimage
import os
from PIL import Image

# Load and preprocess image
orig_img = Image.open(os.path.join(skimage.data_dir, 'astronaut.png')).convert("RGB")
img = preprocess(orig_img).cuda().unsqueeze(dim=0)
open_img = open_preprocess(orig_img).cuda().unsqueeze(dim=0)

# set CLIP and open-CLIPs precision to the same precision
if False:
    for p in model.parameters(): 
        p.data = p.data.float()
        
    for p in open_model.parameters(): 
        p.data = p.data.float()

# Print embedding to detect possible differences
print(model.encode_image(img))
print(open_model.encode_image(open_img))

import matplotlib.pyplot as plt

# Plot differences between CLIP and open-CLIP embeddings
res = (model.encode_image(img).cpu().detach().numpy() - open_model.encode_image(img).cpu().detach().numpy())
plt.plot(np.array(list(range(res.shape[1]))), res.T)
plt.xlabel("Image embedding index")
plt.ylabel("Error")
plt.title(f"Error mean: {res.mean()}, error standard deviation: {res.std()}")
plt.show()

rwightman · 2023-01-09T20:14:15Z

rwightman
Jan 9, 2023
Maintainer

@PatCH0816 FYI, OpenCLIP was intended to be use used under AMP autocast context for mixed-precision where as CLIP is using a 'manual' mixed precision of sorts and float16 weights.

1 reply

PatCH0816 Jan 9, 2023
Author

Perfect, this kind of answer is exactly what I was looking for.

ChauChorHim · 2025-01-09T09:32:05Z

ChauChorHim
Jan 9, 2025

In order to ensure the outputs from CLIP and open CLIP the same, model and open_model should be set to evaluation mode, so that the models behave consistently during inference.

# Load CLIP and open-CLIP
model, preprocess = clip.load("RN50")
model.cuda()
open_model, _, open_preprocess = open_clip.create_model_and_transforms(
    'RN50-quickgelu', pretrained='openai')  # Use the "-quickgelu" suffix
open_model.cuda()

model.eval()
open_model.eval()

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is there a difference in the text encoder between CLIP and open-CLIP? #337

{{title}}

Replies: 6 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Why is there a difference in the text encoder between CLIP and open-CLIP? #337

PatCH0816 Jan 4, 2023

Replies: 6 comments · 3 replies

PatCH0816 Jan 4, 2023 Author

rom1504 Jan 4, 2023 Maintainer

PatCH0816 Jan 6, 2023 Author

rom1504 Jan 6, 2023 Maintainer

rwightman Jan 7, 2023 Maintainer

PatCH0816 Jan 9, 2023 Author

rwightman Jan 9, 2023 Maintainer

PatCH0816 Jan 9, 2023 Author

ChauChorHim Jan 9, 2025

PatCH0816
Jan 4, 2023

Replies: 6 comments 3 replies

PatCH0816
Jan 4, 2023
Author

rom1504
Jan 4, 2023
Maintainer

PatCH0816
Jan 6, 2023
Author

rom1504 Jan 6, 2023
Maintainer

rwightman
Jan 7, 2023
Maintainer

PatCH0816 Jan 9, 2023
Author

rwightman
Jan 9, 2023
Maintainer

PatCH0816 Jan 9, 2023
Author

ChauChorHim
Jan 9, 2025