[Dataset Performance] Add num workers on dataset processing - labels, tokenization #1189
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
SUMMARY:
preprocessing_num_workers
to run dataset processing in parallel for 2:4 example.Before:
Tokenizing: 371.12 examples/s,
Adding labels: 1890.18 examples/s,
Tokenizing: 333.39 examples/s
After (num_proc=8):
Tokenizing: 2703.93 examples/s,
Adding labels: 5524.98 examples/s,
Tokenizing: 2925.98 examples/s
TEST PLAN: