DCLM-baseline dataset for nanochat (drop-in replacement) #469

ddudek · 2026-01-28T14:03:40Z

ddudek
Jan 28, 2026

👋

I've done repackaging of the DCLM-baseline dataset to test it out on nanochat, and see how it impacts the training in general, CORE and other metrics.

It's a 4% / ~150BT sample of the full DCLM dataset - similar to Andrej's 100B sample of the Fineweb-edu, this one is 1.5x more, and I guess that's more than plenty for the scales we're testing now.

My motivation was also driven by some people being curious too in the initial threads here and asking why it's not used instead of fineweb-edu, based on the claims that it's a better dataset.

Here's the dataset:
https://huggingface.co/datasets/ddudek/nanochat-dclm-baseline-150b-shuffle

It's a drop-in replacement, so the only thing you need is just to replace the link in the nanochat/dataset.py:

BASE_URL = "https://huggingface.co/datasets/ddudek/nanochat-dclm-baseline-150b-shuffle/resolve/main"
MAX_SHARD = 2621

Feel free to use it in any way, test by yourself, share your findings.

Initial testing

I did some basic runs on D12 and D14. The main point of interest was the impact on core and other benchmarks, as loss values are not comparable due to a different dataset.

Hypothesis or Goal:

Use of the DCLM dataset would improve core metric or some of the benchmarks.

Results:

TL;DR:

No gains in CORE on small models, better perfomance on lambada_openai, but worse performance on ARC benchmarks and maybe also on some others.

Methodology:

I used a higher token:param ratio = 12 (than latest compute optimal ~4-8) to increase the data usage of the dataset on the models, these are not compute optimal, might be good to re-do these on default params
nanochat repo at Jan 20, 6a477ee "fix: pass device_type to...."
2 runs each Baseline and DCLM
2 sizes: D12 and D14
single H100 on runpod, increased device batch size to 64 as the models fit, warmdown 0.4

CORE:

D12:

D14:

Lambada openai

This benchmark seems to show an improvement over the fineweb on both model sizes:
D12:

D14:

ARC and some others:

Unfortunately the dataset also seems to show negative impact on some of the benchmarks, example of ARC:
D12:

D14:

other benchmarks were not consistent across the sizes or too noisy to get any conclusions.

Full set of results:

D12: https://api.wandb.ai/links/ddudek-ai/hc8nuvr0
D14: https://api.wandb.ai/links/ddudek-ai/bn7mqnos

karpathy · 2026-01-29T01:22:49Z

karpathy
Jan 29, 2026
Maintainer

Disappointing given the DCLM paper, where they claim a gap on fineweb. But there are many things to be careful with.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DCLM-baseline dataset for nanochat (drop-in replacement) #469

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DCLM-baseline dataset for nanochat (drop-in replacement) #469

Uh oh!

Uh oh!

ddudek Jan 28, 2026

Initial testing

Hypothesis or Goal:

Results:

TL;DR:

Methodology:

CORE:

Lambada openai

ARC and some others:

Full set of results:

Replies: 1 comment

Uh oh!

karpathy Jan 29, 2026 Maintainer

ddudek
Jan 28, 2026

karpathy
Jan 29, 2026
Maintainer