Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential duplicates in rewritten subsets #92

Open
21chenb opened this issue Jan 18, 2025 · 0 comments
Open

Potential duplicates in rewritten subsets #92

21chenb opened this issue Jan 18, 2025 · 0 comments

Comments

@21chenb
Copy link

21chenb commented Jan 18, 2025

Hello DataComp team!

I'm seeking some clarification on the problem setup. To my understanding, when specifying a subset, if I assign a weight > 1 to a particular datapoint, it can appear multiple times in the rewritten dataset. This duplication may result in the same datapoint appearing twice in the same batch during contrastive training, potentially degrading performance (as the same datapoint would be contrasted against another copy of itself).

Do you have any mechanisms or suggestions within DataComp to help detect or handle these duplicate datapoints? If not, how would you recommend mitigating potential issues caused by having duplicates in the final dataset?

Thank you in advance for your guidance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant