Skip to content

Dataset questions #41

@Muennighoff

Description

@Muennighoff

Answering some questions from Cheolmin Kim sent via email:

While preparing the datasets, I encountered a few issues and was hoping you could provide some guidance.
a) algebraic-stack, open-web-math- pes2o, and wiki: I downloaded these and ran the Dolma tokenizer for each directory with num_processes equal to the number of files. However, I noticed the file ordering is not preserved in the output. For example, wiki-part-00-00000.npy corresponds to wiki-0001.json.gz, while wiki-part-01-00000.npy and wiki-part-01-00001.npy corresponds to wiki-0000.json.gz. In your config, it is wiki-part-00 that has two sub-partitions not wiki-part-01. Did you make specific modifications to preserve order, or is this random?

I don't recall making modifications to preserve order; all the data is randomized prior to training anyways

b) arxiv: I downloaded the dataset from https://huggingface.co/datasets/EleutherAI/proof-pile-2 and tried running the Dolma tokenizer after converting it to .json.gz format. However, I ran into a "missing id" error. Is this expected, and is there a recommended fix?

I havn't faced this error - Maybe it is some IDs that can be skipped?

c) starcoder: I ran the Dolma tokenizer with --processes 49 to match the partition counts in your config. However, the sub-partition counts do not match. I am wondering how you obtained sub-partitions up to 0003 (4 files). Did you use --max_size 2_147_483_648? The Starcoder datasets on Hugging Face are only 102GB in total, so I am surprised that the tokenized datasets resulted in 3-4 sub-partitions of 4GB each.

Hm yeah maybe it is a different max size

d) dclm: How did you tokenize this dataset? Did you run the Dolma tokenizer on all 1969 files, and did you adjust the --max_size?

Yeah I think we ran the dolma tokenizer on everything

cc @soldni may have better answers to some of these

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions