Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training open_clip with custom text and image encoder #237

Open
nahidalam opened this issue Nov 19, 2022 · 6 comments
Open

Training open_clip with custom text and image encoder #237

nahidalam opened this issue Nov 19, 2022 · 6 comments

Comments

@nahidalam
Copy link

I understand you can specify --model while training open clip if you want to use a different image or text encoder.

But I want to use custom image AND text encoder simultaneously to train the open clip model. How do I specify both?

FYI - I have the weights of those model. Might also explore picking up something from the Timm library.

@rom1504
Copy link
Collaborator

rom1504 commented Nov 19, 2022 via email

@nahidalam
Copy link
Author

Thanks for the quick reply @rom1504
Can you please share an example of what you did for ViT h + xlm roberta large

@rom1504
Copy link
Collaborator

rom1504 commented Nov 21, 2022

yeah, so it looks like this:

import torch
import open_clip

# get text pretrained tower
model_text, _, _ = open_clip.create_model_and_transforms('xlm-roberta-large-ViT-H-14')
state_merged = model_text.state_dict()

# get image pretrained tower
model_visual, _, _ = open_clip.create_model_and_transforms('ViT-H-14', pretrained="laion2b_s32b_b79k")
state_visual = model_visual.state_dict()

# merge into state_merged
visual_keys = [k for k in state_visual.keys() if 'visual' in k]
for k in visual_keys:
        state_merged[k] = state_visual[k]

# save
with open("merged.pt", "wb") as f:
  torch.save({"epoch": 0, "name": "go", "state_dict": state_merged}, f)

# check it works
model, _, preprocess = open_clip.create_model_and_transforms('xlm-roberta-large-ViT-H-14', pretrained="merged.pt")

then you can give that merged.pt to --pretrained param of the training script and it'll initialize from that.

We probably need an automated way to do that directly in the training script but for now this works

@iejMac
Copy link
Contributor

iejMac commented Nov 23, 2022

@rom1504 Just getting to this now... hmmm so the problem is we have a bunch of pretrained (image_encoder, text_encoder) pairs and up until now they have just been that - pairs, because of how you train CLIP. Now we want to explore starting with pretrained models and then post-pretraining with CLIP to align the latent spaces so it would indeed be useful to have a more flexible structure which allows you to specify singular encoders...

what about having functions like create_model_and_transform take either 1 arg - the config name or 2 named args - image_encoder, text_encoder. Then we add some code to put them together like you do... Now looking at it more maybe we can just take the image config from one and the text config from the other

def get_model_config(model_name):

And also something similar for loading pretrained weights

@rom1504
Copy link
Collaborator

rom1504 commented Nov 23, 2022

yeah seems reasonable!

@Quan-Sun
Copy link
Contributor

@nahidalam In #255, you can specify --pretrained-image and --pretrained-text to simultaneously load custom image and text encoder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants