Skip to content

InfoNCELoss seems different from ImageBind paper #10

@alex6095

Description

@alex6095
          > this is handled by the masking. Notice that we set the similarity score between a sample and itself to 0. Additionally we mask one half out and keep the other for computing the similarity between the images and other (in this case, text) modalities. The positive sample is the only one corresponding to the image `batch_size/2` samples away. The remaining text samples are negative samples
image

`

Mask out cosine similarity to itself

        self_mask = torch.eye(
            cos_sim.shape[0], dtype=torch.bool, device=cos_sim.device)
        cos_sim.masked_fill_(self_mask, -9e15)

`

Based on your example explanation, lets say our q_i is one image. Then our positive sample is the only one text corresponding to the image batch_size/2 samples away. But negative samples are not only remaining text samples, we also add remaining image samples don't we? In this code you only mask for sample itself and leave other image samples to be added. I think this is adding q_i dot q_j (i != j), k_i dot k_j (i != j).

Originally posted by @alex6095 in #7 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions