-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train Model With Multiple Input Images #435
Comments
There are multiple ways to do this
1. Late fusion: Use the existing model on all your images then aggregate
the embeddings (using a simple average or something stronger)
2. Early fusion: for this you would indeed need to adapt the openclip code,
then produce a large such (N image, text) dataset then retrain
I advise to start with 1
…On Sat, Feb 18, 2023, 11:42 Neltherion ***@***.***> wrote:
Is it possible to change the model to accept more than one image as the
input?
If I'm not mistaken, CLIP takes an image and a text as the inputs,
extracts the features of these two inputs and finally gives us the logits
of the distance of the image to the text.
So, is it possible to give two (or more) input images and extract ONE
feature from the input images (just like before)?
I want to somehow mix the two inputs. For example, inputting an image
alongside it's semantic segmentation as the input to the model. If it's
possible, what parts of the code should I change? Or is this already
implemented and usable?
Thanks.
—
Reply to this email directly, view it on GitHub
<#435>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437XU63JFLDISYV3GMGTWYCRQRANCNFSM6AAAAAAVAI2QZQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thanks, I was really looking for approach #2😁 Is it possible and if it is, any hints on where to start? |
Maybe you could have a look at
https://github.com/LAION-AI/temporal-embedding-aggregation first
…On Sat, Feb 18, 2023, 22:27 Neltherion ***@***.***> wrote:
Thanks, I was really looking for approach #2
<#2>😁 Is it possible
and if it is, any hints on where to start?
—
Reply to this email directly, view it on GitHub
<#435 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437W4K5CH42PXB74QCS3WYE5EZANCNFSM6AAAAAAVAI2QZQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is it possible to change the model to accept more than one image as the input?
If I'm not mistaken, CLIP takes an image and a text as the inputs, extracts the features of these two inputs and finally gives us the logits of the distance of the image to the text.
So, is it possible to give two (or more) input images and extract ONE feature from the input images (just like before)?
I want to somehow mix the two inputs. For example, inputting an image alongside it's semantic segmentation as the input to the model. If it's possible, what parts of the code should I change? Or is this already implemented and usable?
Thanks.
The text was updated successfully, but these errors were encountered: