Training open_clip with custom text and image encoder #237

nahidalam · 2022-11-19T16:08:03Z

I understand you can specify --model while training open clip if you want to use a different image or text encoder.

But I want to use custom image AND text encoder simultaneously to train the open clip model. How do I specify both?

FYI - I have the weights of those model. Might also explore picking up something from the Timm library.

The text was updated successfully, but these errors were encountered:

rom1504 · 2022-11-19T16:30:09Z

Currently the only way is to load a checkpoint and plug your weights in. That's what I did for ViT h + xlm roberta large Would be best to add a feature to do it automatically indeed

…

On Sat, Nov 19, 2022, 17:08 nahidalam ***@***.***> wrote: I understand you can specify --model while training open clip if you want to use a different image or text encoder. But I want to use custom image AND text encoder simultaneously to train the open clip model. How do I specify both? FYI - I have the weights of those model. Might also explore picking up something from the Timm library. — Reply to this email directly, view it on GitHub <#237>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437TYPTNONLQF7BPAVNTWJD3O5ANCNFSM6AAAAAASFMHSJ4> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

nahidalam · 2022-11-19T16:45:06Z

Thanks for the quick reply @rom1504
Can you please share an example of what you did for ViT h + xlm roberta large

rom1504 · 2022-11-21T22:39:47Z

yeah, so it looks like this:

import torch
import open_clip

# get text pretrained tower
model_text, _, _ = open_clip.create_model_and_transforms('xlm-roberta-large-ViT-H-14')
state_merged = model_text.state_dict()

# get image pretrained tower
model_visual, _, _ = open_clip.create_model_and_transforms('ViT-H-14', pretrained="laion2b_s32b_b79k")
state_visual = model_visual.state_dict()

# merge into state_merged
visual_keys = [k for k in state_visual.keys() if 'visual' in k]
for k in visual_keys:
        state_merged[k] = state_visual[k]

# save
with open("merged.pt", "wb") as f:
  torch.save({"epoch": 0, "name": "go", "state_dict": state_merged}, f)

# check it works
model, _, preprocess = open_clip.create_model_and_transforms('xlm-roberta-large-ViT-H-14', pretrained="merged.pt")

then you can give that merged.pt to --pretrained param of the training script and it'll initialize from that.

We probably need an automated way to do that directly in the training script but for now this works

iejMac · 2022-11-23T04:45:15Z

@rom1504 Just getting to this now... hmmm so the problem is we have a bunch of pretrained (image_encoder, text_encoder) pairs and up until now they have just been that - pairs, because of how you train CLIP. Now we want to explore starting with pretrained models and then post-pretraining with CLIP to align the latent spaces so it would indeed be useful to have a more flexible structure which allows you to specify singular encoders...

what about having functions like create_model_and_transform take either 1 arg - the config name or 2 named args - image_encoder, text_encoder. Then we add some code to put them together like you do... Now looking at it more maybe we can just take the image config from one and the text config from the other

open_clip/src/open_clip/factory.py

Line 66 in bb6e834

def get_model_config(model_name):

And also something similar for loading pretrained weights

rom1504 · 2022-11-23T18:29:47Z

yeah seems reasonable!

Quan-Sun · 2022-11-30T11:50:08Z

@nahidalam In #255, you can specify --pretrained-image and --pretrained-text to simultaneously load custom image and text encoder.

rom1504 added the new feature label Nov 28, 2022

rom1504 mentioned this issue Dec 30, 2022

Best practice for supporting initialization from pretrained image tower with custom text tower? #332

Open

rom1504 added the important label Jan 7, 2023

ytzeng1 mentioned this issue Aug 15, 2024

Incorporate other embedding models such as DINOv2? rom1504/clip-retrieval#388

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training open_clip with custom text and image encoder #237

Training open_clip with custom text and image encoder #237

nahidalam commented Nov 19, 2022

rom1504 commented Nov 19, 2022 via email

nahidalam commented Nov 19, 2022

rom1504 commented Nov 21, 2022 •

edited

Loading

iejMac commented Nov 23, 2022

rom1504 commented Nov 23, 2022

Quan-Sun commented Nov 30, 2022

Training open_clip with custom text and image encoder #237

Training open_clip with custom text and image encoder #237

Comments

nahidalam commented Nov 19, 2022

rom1504 commented Nov 19, 2022 via email

nahidalam commented Nov 19, 2022

rom1504 commented Nov 21, 2022 • edited Loading

iejMac commented Nov 23, 2022

rom1504 commented Nov 23, 2022

Quan-Sun commented Nov 30, 2022

rom1504 commented Nov 21, 2022 •

edited

Loading