-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training open_clip with custom text and image encoder #237
Comments
Currently the only way is to load a checkpoint and plug your weights in.
That's what I did for ViT h + xlm roberta large
Would be best to add a feature to do it automatically indeed
…On Sat, Nov 19, 2022, 17:08 nahidalam ***@***.***> wrote:
I understand you can specify --model while training open clip if you want
to use a different image or text encoder.
But I want to use custom image AND text encoder simultaneously to train
the open clip model. How do I specify both?
FYI - I have the weights of those model. Might also explore picking up
something from the Timm library.
—
Reply to this email directly, view it on GitHub
<#237>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437TYPTNONLQF7BPAVNTWJD3O5ANCNFSM6AAAAAASFMHSJ4>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thanks for the quick reply @rom1504 |
yeah, so it looks like this: import torch
import open_clip
# get text pretrained tower
model_text, _, _ = open_clip.create_model_and_transforms('xlm-roberta-large-ViT-H-14')
state_merged = model_text.state_dict()
# get image pretrained tower
model_visual, _, _ = open_clip.create_model_and_transforms('ViT-H-14', pretrained="laion2b_s32b_b79k")
state_visual = model_visual.state_dict()
# merge into state_merged
visual_keys = [k for k in state_visual.keys() if 'visual' in k]
for k in visual_keys:
state_merged[k] = state_visual[k]
# save
with open("merged.pt", "wb") as f:
torch.save({"epoch": 0, "name": "go", "state_dict": state_merged}, f)
# check it works
model, _, preprocess = open_clip.create_model_and_transforms('xlm-roberta-large-ViT-H-14', pretrained="merged.pt") then you can give that We probably need an automated way to do that directly in the training script but for now this works |
@rom1504 Just getting to this now... hmmm so the problem is we have a bunch of pretrained (image_encoder, text_encoder) pairs and up until now they have just been that - pairs, because of how you train CLIP. Now we want to explore starting with pretrained models and then post-pretraining with CLIP to align the latent spaces so it would indeed be useful to have a more flexible structure which allows you to specify singular encoders... what about having functions like create_model_and_transform take either 1 arg - the config name or 2 named args - image_encoder, text_encoder. Then we add some code to put them together like you do... Now looking at it more maybe we can just take the image config from one and the text config from the other open_clip/src/open_clip/factory.py Line 66 in bb6e834
And also something similar for loading pretrained weights |
yeah seems reasonable! |
@nahidalam In #255, you can specify --pretrained-image and --pretrained-text to simultaneously load custom image and text encoder. |
I understand you can specify
--model
while training open clip if you want to use a different image or text encoder.But I want to use custom image AND text encoder simultaneously to train the open clip model. How do I specify both?
FYI - I have the weights of those model. Might also explore picking up something from the Timm library.
The text was updated successfully, but these errors were encountered: