Train Model With Multiple Input Images #435

Neltherion · 2023-02-18T10:42:05Z

Is it possible to change the model to accept more than one image as the input?

If I'm not mistaken, CLIP takes an image and a text as the inputs, extracts the features of these two inputs and finally gives us the logits of the distance of the image to the text.

So, is it possible to give two (or more) input images and extract ONE feature from the input images (just like before)?

I want to somehow mix the two inputs. For example, inputting an image alongside it's semantic segmentation as the input to the model. If it's possible, what parts of the code should I change? Or is this already implemented and usable?

Thanks.

rom1504 · 2023-02-18T12:39:05Z

There are multiple ways to do this 1. Late fusion: Use the existing model on all your images then aggregate the embeddings (using a simple average or something stronger) 2. Early fusion: for this you would indeed need to adapt the openclip code, then produce a large such (N image, text) dataset then retrain I advise to start with 1

…

On Sat, Feb 18, 2023, 11:42 Neltherion ***@***.***> wrote: Is it possible to change the model to accept more than one image as the input? If I'm not mistaken, CLIP takes an image and a text as the inputs, extracts the features of these two inputs and finally gives us the logits of the distance of the image to the text. So, is it possible to give two (or more) input images and extract ONE feature from the input images (just like before)? I want to somehow mix the two inputs. For example, inputting an image alongside it's semantic segmentation as the input to the model. If it's possible, what parts of the code should I change? Or is this already implemented and usable? Thanks. — Reply to this email directly, view it on GitHub <#435>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437XU63JFLDISYV3GMGTWYCRQRANCNFSM6AAAAAAVAI2QZQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Neltherion · 2023-02-18T21:27:30Z

Thanks, I was really looking for approach #2😁 Is it possible and if it is, any hints on where to start?

rom1504 · 2023-02-18T22:01:06Z

Maybe you could have a look at https://github.com/LAION-AI/temporal-embedding-aggregation first

…

On Sat, Feb 18, 2023, 22:27 Neltherion ***@***.***> wrote: Thanks, I was really looking for approach #2 <#2>😁 Is it possible and if it is, any hints on where to start? — Reply to this email directly, view it on GitHub <#435 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437W4K5CH42PXB74QCS3WYE5EZANCNFSM6AAAAAAVAI2QZQ> . You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train Model With Multiple Input Images #435

Train Model With Multiple Input Images #435

Neltherion commented Feb 18, 2023

rom1504 commented Feb 18, 2023 via email

Neltherion commented Feb 18, 2023

rom1504 commented Feb 18, 2023 via email

Train Model With Multiple Input Images #435

Train Model With Multiple Input Images #435

Comments

Neltherion commented Feb 18, 2023

rom1504 commented Feb 18, 2023 via email

Neltherion commented Feb 18, 2023

rom1504 commented Feb 18, 2023 via email