From a2579f6b090aa29dde9a98dcbe9d68771812e0ae Mon Sep 17 00:00:00 2001 From: Svyatoslav Pchelintsev Date: Mon, 19 May 2025 20:12:06 +0300 Subject: [PATCH] Fixed typos --- CONTRIBUTING.md | 2 +- docs/source/stream.mdx | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index d20d8a08b54..f1f022b6fd7 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -95,7 +95,7 @@ Note that if any files were formatted by `pre-commit` hooks during committing, y git push -u origin a-descriptive-name-for-my-changes ``` - Go the webpage of your fork on GitHub. Click on "Pull request" to send your to the project maintainers for review. + Go the webpage of your fork on GitHub. Click on "Pull request" to send your changes to the project maintainers for review. ## Datasets on Hugging Face diff --git a/docs/source/stream.mdx b/docs/source/stream.mdx index 2073785aebf..13450a4d971 100644 --- a/docs/source/stream.mdx +++ b/docs/source/stream.mdx @@ -190,7 +190,7 @@ Define sampling probabilities from each of the original datasets for more contro {'text': 'Chevrolet Cavalier Usados en Bogota - Carros en Vent...'}] ``` -Around 80% of the final dataset is made of the `en_dataset`, and 20% of the `fr_dataset`. +Around 80% of the final dataset is made of the `es_dataset`, and 20% of the `fr_dataset`. You can also specify the `stopping_strategy`. The default strategy, `first_exhausted`, is a subsampling strategy, i.e the dataset construction is stopped as soon one of the dataset runs out of samples. You can specify `stopping_strategy=all_exhausted` to execute an oversampling strategy. In this case, the dataset construction is stopped as soon as every samples in every dataset has been added at least once. In practice, it means that if a dataset is exhausted, it will return to the beginning of this dataset until the stop criterion has been reached.