-
Notifications
You must be signed in to change notification settings - Fork 91
Description
Answering some questions from Cheolmin Kim sent via email:
While preparing the datasets, I encountered a few issues and was hoping you could provide some guidance.
a) algebraic-stack, open-web-math- pes2o, and wiki: I downloaded these and ran the Dolma tokenizer for each directory with num_processes equal to the number of files. However, I noticed the file ordering is not preserved in the output. For example, wiki-part-00-00000.npy corresponds to wiki-0001.json.gz, while wiki-part-01-00000.npy and wiki-part-01-00001.npy corresponds to wiki-0000.json.gz. In your config, it is wiki-part-00 that has two sub-partitions not wiki-part-01. Did you make specific modifications to preserve order, or is this random?
I don't recall making modifications to preserve order; all the data is randomized prior to training anyways
b) arxiv: I downloaded the dataset from https://huggingface.co/datasets/EleutherAI/proof-pile-2 and tried running the Dolma tokenizer after converting it to .json.gz format. However, I ran into a "missing id" error. Is this expected, and is there a recommended fix?
I havn't faced this error - Maybe it is some IDs that can be skipped?
c) starcoder: I ran the Dolma tokenizer with --processes 49 to match the partition counts in your config. However, the sub-partition counts do not match. I am wondering how you obtained sub-partitions up to 0003 (4 files). Did you use --max_size 2_147_483_648? The Starcoder datasets on Hugging Face are only 102GB in total, so I am surprised that the tokenized datasets resulted in 3-4 sub-partitions of 4GB each.
Hm yeah maybe it is a different max size
d) dclm: How did you tokenize this dataset? Did you run the Dolma tokenizer on all 1969 files, and did you adjust the --max_size?
Yeah I think we ran the dolma tokenizer on everything
cc @soldni may have better answers to some of these