> this is handled by the masking. Notice that we set the similarity score between a sample and itself to 0. Additionally we mask one half out and keep the other for computing the similarity between the images and other (in this case, text) modalities. The positive sample is the only one corresponding to the image `batch_size/2` samples away. The remaining text samples are negative samples
`
Mask out cosine similarity to itself
self_mask = torch.eye(
cos_sim.shape[0], dtype=torch.bool, device=cos_sim.device)
cos_sim.masked_fill_(self_mask, -9e15)
`
Based on your example explanation, lets say our q_i is one image. Then our positive sample is the only one text corresponding to the image batch_size/2 samples away. But negative samples are not only remaining text samples, we also add remaining image samples don't we? In this code you only mask for sample itself and leave other image samples to be added. I think this is adding q_i dot q_j (i != j), k_i dot k_j (i != j).
Originally posted by @alex6095 in #7 (comment)
`
Mask out cosine similarity to itself
`
Based on your example explanation, lets say our q_i is one image. Then our positive sample is the only one text corresponding to the image batch_size/2 samples away. But negative samples are not only remaining text samples, we also add remaining image samples don't we? In this code you only mask for sample itself and leave other image samples to be added. I think this is adding q_i dot q_j (i != j), k_i dot k_j (i != j).
Originally posted by @alex6095 in #7 (comment)