📑 Paper | 🔨 fastText Classifier | 🤗 Released Dataset | 📦 Repo
We release our trained fasttext calssifier and a 100B token filtered high-quality dataset in Huggingface for direct use.
| Name | Type | Huggingface Link |
|---|---|---|
| preselect-fasttext-calssifier | Model | 🤗Huggingface |
| preselect-100B | Dataset | 🤗Huggingface |
We provide a dockerfile that contains the environment for filtering, trianing and evaluation.
docker build -t preselect:latest .
docker run --gpus all --network host -it --shm-size=20g --privileged preselect:latestAfter that, you need to prepare your pretrianing corpus (i.e. download commoncrawl subset). We provide a example to download the DCLM's Refinedweb. Note this will require you to set up aws service beforehand.
cd data_processing/data/clean_pool
pythondownload.py
python unzip.pyYou can also prepare your own data.
If you want to directly use our trained fasttext, you can download it from huggingface and run the following code:
import os
import argparse
from pathlib import Path
parser = argparse.ArgumentParser("Filter")
parser.add_argument("--input_path",type=str, help="input path name")
parser.add_argument("--output_path",type=str, help="output path name")
args = parser.parse_args()
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.filters import FastTextClassifierFilter
from datatrove.pipeline.readers import ParquetReader,JsonlReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
Path(f"{args.output_path}").mkdir(parents=True,exist_ok=True)
dist_executor = LocalPipelineExecutor(
skip_completed=True,
pipeline=[
JsonlReader(f"{args.input_path}", text_key="text", default_metadata= {}),
FastTextClassifierFilter(f"PreSelect-classifier.bin", keep_labels=[("1",0.5)]),
JsonlWriter(f"{args.output_path}", compression=None)
],
tasks=100,
)
dist_executor.run()The first step is to pick a small subset and calculate the bpc for each example for each model.
cd data_processing/bpc
python -u main.py\
--model_name {MODEL_NAME}\
--block_size 1900\
--stride 512 \
--batch_size 1\Then you can train the fasttext using the data computed in Step 1.
cd data_processing/fasttext
python train_fasttext.pyFinally, you can filter your large corpus using the fasttext. The provided script works on one cpu machine, but it can be easily extend to multi machine filtering.
bash pipelie.sh {FASTTEXT_NAME} filter NO NO NO NO 0 NO 1 0.1If you are training with single node (e.g. 8 gpus), you can use the following command
bash pipeline.sh {FASTTEXT_NAME} NO tokenize train convert NO 0 NO 1 0.1 {HOME_PATH} 1 {TRAINING_STEPS}If you are training with multi node (e.g. 8 gpus * 4 node), you can use the following command
bash pipeline_multi_node.sh {FASTTEXT_NAME} NO tokenize train convert NO {MAIN_NODE_ADDRESS} NO 1 0.1 {HOME_PATH} {N_NODE} {TRAINING_STEPS}For more information, you can refer to the pipeline script.
You can refer Opencompass and LM-Evaluation-Harness to setup the evaluation for trained checkpoints to fit your need.
If you find this work helpful, please kindly cite as:
@article{shum2025predictivedataselectiondata,
title={Predictive Data Selection: The Data That Predicts Is the Data That Teaches},
author={Kashun Shum and Yuzhen Huang and Hongjian Zou and Ding Qi and Yixuan Liao and Xiaoxin Chen and Qian Liu and Junxian He},
journal={arXiv preprint arXiv:2503.00808},
year={2025},
eprint={2503.00808},
}
Thanks for the open-source of the following projects where some code in this project is cited and modified from them:
