InfoNCELoss seems different from ImageBind paper

              > this is handled by the masking. Notice that we set the similarity score between a sample and itself to 0. Additionally we mask one half out and keep the other for computing the similarity between the images and other (in this case, text) modalities. The positive sample is the only one corresponding to the image `batch_size/2` samples away. The remaining text samples are negative samples

<img width="399" alt="image" src="https://github.com/fabawi/ImageBind-LoRA/assets/61657290/0ecb1f5e-7bcb-40ff-aa19-2ab5320f36c1">

`
# Mask out cosine similarity to itself
            self_mask = torch.eye(
                cos_sim.shape[0], dtype=torch.bool, device=cos_sim.device)
            cos_sim.masked_fill_(self_mask, -9e15)
`

Based on your example explanation, lets say our q_i is one image. Then our positive sample is the only one text corresponding to the image batch_size/2 samples away. But negative samples are not only remaining text samples, we also add remaining image samples don't we? In this code you only mask for sample itself and leave other image samples to be added. I think this is adding q_i dot q_j (i != j), k_i dot k_j (i != j).

_Originally posted by @alex6095 in https://github.com/fabawi/ImageBind-LoRA/issues/7#issuecomment-1804086364_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InfoNCELoss seems different from ImageBind paper #10

Mask out cosine similarity to itself

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

InfoNCELoss seems different from ImageBind paper #10

Description

Mask out cosine similarity to itself

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions