DCLM-baseline dataset for nanochat (drop-in replacement) #469
ddudek
started this conversation in
Show and tell
Replies: 1 comment
-
|
Disappointing given the DCLM paper, where they claim a gap on fineweb. But there are many things to be careful with. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
👋
I've done repackaging of the DCLM-baseline dataset to test it out on nanochat, and see how it impacts the training in general, CORE and other metrics.
It's a 4% / ~150BT sample of the full DCLM dataset - similar to Andrej's 100B sample of the Fineweb-edu, this one is 1.5x more, and I guess that's more than plenty for the scales we're testing now.
My motivation was also driven by some people being curious too in the initial threads here and asking why it's not used instead of fineweb-edu, based on the claims that it's a better dataset.
Here's the dataset:
https://huggingface.co/datasets/ddudek/nanochat-dclm-baseline-150b-shuffle
It's a drop-in replacement, so the only thing you need is just to replace the link in the
nanochat/dataset.py:Feel free to use it in any way, test by yourself, share your findings.
Initial testing
I did some basic runs on D12 and D14. The main point of interest was the impact on core and other benchmarks, as loss values are not comparable due to a different dataset.
Hypothesis or Goal:
Use of the DCLM dataset would improve core metric or some of the benchmarks.
Results:
TL;DR:
No gains in CORE on small models, better perfomance on lambada_openai, but worse performance on ARC benchmarks and maybe also on some others.
Methodology:
6a477ee"fix: pass device_type to...."CORE:
D12:


D14:
Lambada openai
This benchmark seems to show an improvement over the fineweb on both model sizes:


D12:
D14:
ARC and some others:
Unfortunately the dataset also seems to show negative impact on some of the benchmarks, example of ARC:


D12:
D14:
other benchmarks were not consistent across the sizes or too noisy to get any conclusions.
Full set of results:
D12: https://api.wandb.ai/links/ddudek-ai/hc8nuvr0
D14: https://api.wandb.ai/links/ddudek-ai/bn7mqnos
Beta Was this translation helpful? Give feedback.
All reactions