Dataset mixture at the shard level #425

gabrielilharco · 2023-02-13T23:22:22Z

Following #398, this replaces the logic for mixing datasets from the sample level to the shard level.

Empirically this is giving better results than the previous logic at the sample level, and also simplifies the code. On two experiments I tried when mixing 6 data sources (60.1% vs 59.5% and 61.6% vs 59.8% ImageNet accuracy respectively for shard level vs sample level). This is in line with findings from #107, https://arxiv.org/abs/2112.09331.

I also tested this with the standard workflow (i.e. no --train-data-weights is used), and did not observe any impact on the standard workflow when training on CC12M (25.04% vs 24.95% zero-shot ImageNet accuracy).

CC @rwightman @rom1504

rwightman · 2023-02-14T00:34:00Z

so, these sorts of changes are pretty high risk re changing train behaviour, I'm not updating current code on cluster because last one got merged earlier than I would have liked (w/o verification of value).

Are we sure this is correct? It's still not equivalent to #107 which is sampling per local batch from the same dataset (which I think is probably a bit more desired), if sampling across datasets at the sample level is not as good, then the batch at the transition between two shards is maybe not as good, and then the mixing local batches from diff datasets into a global batch is not as good as having a global batch all from one dataset (that last one not easily solvable)....

Either way, my point is doing experiments and figuring out what the best approach is not the best to do on main for this sort of change...

rwightman · 2023-02-14T00:40:10Z

Why these changes are (subtle) trouble and need lots of testing. This change will alter the sample progression for same seed in training.

https://docs.python.org/dev/library/random.html#random.choices:

For a given seed, the choices() function with equal weighting typically produces a different sequence than repeated calls to choice(). The algorithm used by choices() uses floating point arithmetic for internal consistency and speed. The algorithm used by choice() defaults to integer arithmetic with repeated selections to avoid small biases from round-off error.

gabrielilharco · 2023-02-14T00:47:57Z

Thanks for the comments! I agree with you that we should be careful about these changes. I also agree that this is not equivalent to #107 (it is not intended to be). Maybe we could implement it, but I think this should be in a separate PR (this PR deals with upsampling/downsampling data sources, we could have a separate one to give the option to have global batches being formed by only one data source).

Re. correctness, I've tested both #398 and this PR on standard runs (without using the new flag), and observed virtually no difference in performance when training on CC12M. I'm happy to do other training runs you think might be better to gauge the impact of this PR.

This code has also been tested on two other experiments where the upsampling weights are integers. The new code yielded the same result as copying the shards that are being upsampled multiple times in the input string.

gabrielilharco · 2023-02-14T00:49:53Z

Re. random seeds, indeed currently the code won't be exactly the same as before when using the same random seed because of the difference between choices and choice. There is a simple fix for this though, we can just check the new flag before sampling the shards and call choice if the user did not pass in --train-data-weights to ensure backwards compatibility for the same random seed. Do you think this would be better?

src/training/data.py

rwightman · 2023-02-14T01:15:07Z

@gabrielilharco yeah, that was actually the review comments, but again, always forget to hit the final button to make the review active. If no weights are used, self.weights should be None. If weights are None, use rng.choice otherwise use choices with the weight

gabrielilharco · 2023-02-14T01:22:01Z

Great, I'll make the change and test it with another CC12M run and will let you know once it's done

rwightman · 2023-02-14T01:39:55Z

oh yeah, one other comment... does this actually sample dataset A 2x more frequently (in samples) than B? wouldn't the mix of samples seen from the datasets change significantly based on the shard composition (# samples per shard) for the two datasets and be more complicated than just 1::2?

EDIT looping back to the original motiviation to switch to this approach from the per-sample, when you compared them, was there verification that the ratios of the samples seen (across the datasets) was the same?

gabrielilharco · 2023-02-14T02:29:00Z

In this PR 1::2 weights for datasets A::B does not mean that B will be sampled twice as often as A, it means that we will sample from B 2x more often than normal (and 1x for A).

This is different from the previous PR, where a 1::1 meant sampling from both datasets with equal frequencies. In this PR 1::1 is equal to not passing the new flag (in expectation we sample proportionally to the size of the datasets). I took this into account when comparing the PRs

rwightman · 2023-02-14T02:38:57Z

1::1 made intuitive sense for the previous approach though, selecting per sample or per batch from 1 of N datasets with those ratios you know what you get.

Mixing across datasets like this though, the end mix depends on how the datasets are sharded, and that changes per instance, if you mix say ImageNet-22k and LAION-2B on two different clusters you'd end up with different ratios of samples from the same 'input' weight which seems rather non-obvious and confusing.

gabrielilharco · 2023-02-14T02:51:18Z

It should depend only on the dataset sizes (not the shard sizes) in expectation. With this approach sampling with equal frequency from the difference sources requires knowing the sizes of the datasets though (i.e. if dataset A has size 10 and dataset B has size 100, using 10::1 would lead to seeing each dataset with equal frequency in expectation). We could add another flag for specifying the sizes, but that seems a bit messy to me. Maybe we can change the flag name to avoid confusion? E.g. --train-data-upsampling-factor?

gabrielilharco · 2023-02-14T15:14:58Z

We now use choice when sampling shards if the new flag is not used, as requested. I ran a new CC12M experiment after these changes, again observing virtually no difference in the standard case (25.07% vs 25.04% final ImageNet zero-shot accuracy).

I also added some new tests at tests/test_wds.py to test the new upsampling code. Let me know if you'd like any more changes or testing

rwightman · 2023-02-14T15:34:35Z

@gabrielilharco right, I convinced myself that changing the ratio between # of shards and samples per shard for same ds sample count does not alter the overall sampling frequency per dataset

rwightman · 2023-02-14T22:27:50Z

@gabrielilharco tests are great to have, thanks!!

gabrielilharco added 11 commits January 31, 2023 16:59

Add support for dataset mixtures with different sampling weights

e8484cf

Handle re-weighting at the sample level instead of the shard level

cfb5758

cleaning up

ccddc2d

adding documentation

9fa4349

iterator fix

09a68fe

sampling at the shard level instead of the sample level

90e804f

merging

12c86ab

remove script

ef2724a

removing scripts

f0668fc

fixing bad merge

e62d830

clarifying readme

5566005

rwightman requested changes Feb 14, 2023

View reviewed changes

src/training/data.py Show resolved Hide resolved

src/training/data.py Outdated Show resolved Hide resolved

changing defaults, adding tests and renaming flag

6bc024b

rwightman approved these changes Feb 14, 2023

View reviewed changes

test data is created on the fly

82cc840

rwightman merged commit 00681ad into mlfoundations:main Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset mixture at the shard level #425

Dataset mixture at the shard level #425

gabrielilharco commented Feb 13, 2023

rwightman commented Feb 14, 2023 •

edited

Loading

rwightman commented Feb 14, 2023 •

edited

Loading

gabrielilharco commented Feb 14, 2023 •

edited

Loading

gabrielilharco commented Feb 14, 2023 •

edited

Loading

rwightman commented Feb 14, 2023 •

edited

Loading

gabrielilharco commented Feb 14, 2023

rwightman commented Feb 14, 2023 •

edited

Loading

gabrielilharco commented Feb 14, 2023

rwightman commented Feb 14, 2023

gabrielilharco commented Feb 14, 2023

gabrielilharco commented Feb 14, 2023

rwightman commented Feb 14, 2023

rwightman commented Feb 14, 2023

Dataset mixture at the shard level #425

Dataset mixture at the shard level #425

Conversation

gabrielilharco commented Feb 13, 2023

rwightman commented Feb 14, 2023 • edited Loading

rwightman commented Feb 14, 2023 • edited Loading

gabrielilharco commented Feb 14, 2023 • edited Loading

gabrielilharco commented Feb 14, 2023 • edited Loading

rwightman commented Feb 14, 2023 • edited Loading

gabrielilharco commented Feb 14, 2023

rwightman commented Feb 14, 2023 • edited Loading

gabrielilharco commented Feb 14, 2023

rwightman commented Feb 14, 2023

gabrielilharco commented Feb 14, 2023

gabrielilharco commented Feb 14, 2023

rwightman commented Feb 14, 2023

rwightman commented Feb 14, 2023

rwightman commented Feb 14, 2023 •

edited

Loading

rwightman commented Feb 14, 2023 •

edited

Loading

gabrielilharco commented Feb 14, 2023 •

edited

Loading

gabrielilharco commented Feb 14, 2023 •

edited

Loading

rwightman commented Feb 14, 2023 •

edited

Loading

rwightman commented Feb 14, 2023 •

edited

Loading