Skip to content

Code for FinerWeb-10BT – tools for cleaning web data line by line using LLMs

License

Notifications You must be signed in to change notification settings

TurkuNLP/finerweb-10bt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FinerWeb-10BT

Code for paper "FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering". This repository contains tools for cleaning web data line by line using LLMs and training language models on the filtered data.

Pipeline Steps

1. Generate LLM Annotations

Generate initial line quality annotations using GPT-4o mini:

python annotate_line_quality.py

2. Train and Use Classifier

Navigate to classifier directory:

cd classifier

Prepare Data

Stratify the data for training:

python stratify.py

Train Model

Fine-tune DeBERTa-v3 classifier:

python run.py --train --base_model "microsoft/deberta-v3-base"

Create Platt Scaler

Calibrate model probabilities:

python run.py --finetuned_model_path FINETUNED_MODEL_PATH \
              --base_model "microsoft/deberta-v3-base" --platt

Predict Quality Scores

Apply classifier to FineWeb-10BT:

python run.py --finetuned_model_path FINETUNED_MODEL_PATH \
              --base_model "microsoft/deberta-v3-base" --predict_fineweb

3. Tokenize Data

Prepare data for GPT-2 training:

cd GPT2/scripts
python tokenize_documents_parallel_exquisiteweb.py \
    --dataset_path PATH_TO_FINERWEB-10BT \
    --local_dir LOCAL_DIR \
    --quality_threshold QUALITY_THRESHOLD

4. Train GPT-2

Train GPT-2 on filtered data:

cd GPT2
./slurm.sh torchrun --standalone --nproc_per_node=4 scripts/train_gpt.py \
    --data_root TOKENIZED_DATASET \
    --n_batches 16 \
    --n_tokens 1024 \
    --vocab_size 50304 \
    --emb_dim 768 \
    --context_length 1024 \
    --n_layers 12 \
    --n_heads 12 \
    --log_dir LOG_DIRECTORY

Data

The enhanced FineWeb-10BT dataset with quality scores is available at: https://huggingface.co/datasets/TurkuNLP/finerweb-10bt

Citation

[Coming soon]

License

MIT

About

Code for FinerWeb-10BT – tools for cleaning web data line by line using LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published