Migrate train_bi-encoder_mnrl.py from v2 to v3 by ritoban23 · Pull Request #3634 · huggingface/sentence-transformers

ritoban23 · 2026-01-29T03:49:52Z

Part of #3621

Changes

Migrated examples/sentence_transformer/training/ms_marco/train_bi-encoder_mnrl.py from v2 to v3
Replaced model.fit() with SentenceTransformerTrainer and SentenceTransformerTrainingArguments

tomaarsen · 2026-01-29T13:55:29Z

I'm afraid this won't work very nicely. The SentenceTransformerTrainer expects its train_dataset to be a datasets.Dataset, and not a torch.utils.data.Dataset. This is definitely one of the more difficult files that hasn't been updated to the v3 format.

Tom Aarsen

ritoban23 · 2026-01-29T15:37:17Z

@tomaarsen
Yes, I caught that earlier..The torch Dataset didn't have the column_names attribute that validate_column_names expects.

I'm thinking to:

Replace the MSMARCODataset class with a function that uses Dataset.from_dict()
Create anchor/positive/negative columns for the triplets

Question: the original MSMARCODataset class rotates through multiple positives/negatives per query across batches (using pop/append). Should I:

use first pos/neg per query and rely on trainer shuffling?
Would you suggest a different approach?

tomaarsen · 2026-01-29T16:12:18Z

I think you've effectively found what made this file so hard to upgrade, the MSMARCODataset pos/neg rotations. I think if we just take the first pos/neg and run 10 epochs like the script currently does, then we'll likely have worse performance than the old script.
But if we can create a datasets.Dataset with triplets that contains many more triplets than the existing MSMARCODataset, then we should be able to get equivalent results. To give you some more context, I believe that https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives/resolve/main/msmarco-hard-negatives.jsonl.gz is simply a zip of this folder: https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives/tree/main/nearest_neighbors, and the files from that folder have been uploaded in separate datasets here: https://huggingface.co/collections/sentence-transformers/ms-marco-mined-triplets
And the https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives/resolve/main/cross-encoder-ms-marco-MiniLM-L-6-v2-scores.pkl.gz file has been uploaded here in a few different formats: https://huggingface.co/datasets/sentence-transformers/msmarco-scores-ms-marco-MiniLM-L6-v2

The triplet-hard splits should already have been equivalent to the ce_score_threshold that the script does, so in theory it might be possible to (roughly):

Load the triplet-hard or triplet-hard-ids datasets from the "systems to use"
Maybe apply some more filtering for e.g. num_negs_per_system per dataset to not have too many negatives for the same query.
concatenate_datasets the datasets together into one big train dataset
Pass this big dataset to the Trainer and rely on the Trainer shuffling to help out.
Use batch_samplers=BatchSamplers.NO_DUPLICATES: the idea is that this prevents the same text from occurring multiple times in a batch, which can be bad for performance as the loss here uses in-batch negatives, i.e. other texts in the batch are always used as negatives (even if they might be the same as the positive).
Always use num_epochs=1, but now allow setting a max_steps instead, e.g. defaulting to 1e-7. Users can then specify None for "all data".
Perhaps use evaluator = NanoBEIREvaluator(dataset_names["msmarco", "nq"]) with eval_steps=0.1 to get some evaluation during training. I like the "ratio" options for eval_steps/save_steps/logging_steps so you always know that you're e.g. saving and evaluating 30 times and logging 300 times.

I think something like that could work, what are your impressions?

Tom Aarsen

Migrate train_bi-encoder_mnrl.py from v2 to v3

f865405

tomaarsen mentioned this pull request Feb 2, 2026

replace ms-marco datasets and migrate examples #3649

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate train_bi-encoder_mnrl.py from v2 to v3#3634

Migrate train_bi-encoder_mnrl.py from v2 to v3#3634
ritoban23 wants to merge 1 commit intohuggingface:mainfrom
ritoban23:migrate-train-bi-encoder-mnrl-v2-to-v3

ritoban23 commented Jan 29, 2026

Uh oh!

tomaarsen commented Jan 29, 2026

Uh oh!

ritoban23 commented Jan 29, 2026

Uh oh!

tomaarsen commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ritoban23 commented Jan 29, 2026

Changes

Uh oh!

tomaarsen commented Jan 29, 2026

Uh oh!

ritoban23 commented Jan 29, 2026

Uh oh!

tomaarsen commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants