[Dataset Performance] Add num workers on dataset processing - labels, tokenization #1189

horheynm · 2025-02-25T15:54:40Z

SUMMARY:

Add preprocessing_num_workers to run dataset processing in parallel for 2:4 example.

Before:
Tokenizing: 371.12 examples/s,
Adding labels: 1890.18 examples/s,
Tokenizing: 333.39 examples/s

Tokenizing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12802/12802 [00:34<00:00, 371.12 examples/s]
Adding labels: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12802/12802 [00:06<00:00, 1890.18 examples/s]
Tokenizing:   9%|█████████▌                                                                                                     | 22077/256032 [00:59<11:41, 333.39 examples/s

After (num_proc=8):
Tokenizing: 2703.93 examples/s,
Adding labels: 5524.98 examples/s,
Tokenizing: 2925.98 examples/s

Tokenizing (num_proc=8): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 12802/12802 [00:04<00:00, 2703.93 examples/s]
Adding labels (num_proc=8): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 12802/12802 [00:02<00:00, 5524.98 examples/s]
Tokenizing (num_proc=8): 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 256032/256032 [01:27<00:00, 2925.98 examples/s]

TEST PLAN:

Pass existing tests

github-actions · 2025-02-25T15:54:57Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

kylesayrs

Nice, thanks

horheynm added 2 commits February 25, 2025 10:47

add num workers on dataset processing - labels, tokenization

379c188

Merge branch 'main' into num-proc-dataset

70ce6ba

horheynm added the ready When a PR is ready for review label Feb 25, 2025

brian-dellabetta approved these changes Feb 25, 2025

View reviewed changes

kylesayrs approved these changes Feb 25, 2025

View reviewed changes

Merge branch 'main' into num-proc-dataset

cb458e2

dsikka enabled auto-merge (squash) February 25, 2025 20:33

dsikka merged commit 77e4f4c into main Feb 25, 2025
7 checks passed

dsikka deleted the num-proc-dataset branch February 25, 2025 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dataset Performance] Add num workers on dataset processing - labels, tokenization #1189

[Dataset Performance] Add num workers on dataset processing - labels, tokenization #1189

horheynm commented Feb 25, 2025

github-actions bot commented Feb 25, 2025

kylesayrs left a comment

[Dataset Performance] Add num workers on dataset processing - labels, tokenization #1189

[Dataset Performance] Add num workers on dataset processing - labels, tokenization #1189

Conversation

horheynm commented Feb 25, 2025

github-actions bot commented Feb 25, 2025

kylesayrs left a comment

Choose a reason for hiding this comment