replace `ms-marco` datasets and migrate examples by omkar-334 · Pull Request #3649 · huggingface/sentence-transformers

omkar-334 · 2026-02-02T19:54:59Z

tomaarsen · 2026-02-02T21:04:29Z

The dataset replacement part might not be necessary if we follow an approach like in #3634 (comment). It could simplify things a lot.

Tom Aarsen

omkar-334 · 2026-02-02T21:07:07Z

The dataset replacement part might not be necessary if we follow an approach like in #3634 (comment). It could simplify things a lot.

Tom Aarsen

yes, i am going through your comments at #3634 and #3635. thanks!

omkar-334 · 2026-02-06T17:51:32Z

I've replaced the datasets and migrated the training logic. Now i have to check the performance, and look into Tom's suggestions

omkar-334 · 2026-02-08T09:06:18Z

Old approach -

...
New approach -

omkar-334 · 2026-02-08T09:07:04Z

Issue with queries_hf = load_dataset("sentence-transformers/msmarco-corpus", "query", split="train") is that it contains train, test and val splits all merged into a single split. This causes errors in the dataset logic where some keys are not found. I uploaded the original queries files to my user and this works perfectly.
Issue with

hard_negatives_hf = load_dataset(
    "sentence-transformers/msmarco-hard-negatives",
    split="train",
    streaming=True,  # ← key
)

The column structure is different, leading to cast errors. The solution to this is to enforce streaming=True, but this approach takes more than 10 minutes to load the final dataset. So I reverted this back to original file, using it with hf_hub_download. This works perfectly.

omkar-334 · 2026-02-08T09:59:12Z

examples/sentence_transformer/training/ms_marco/train_bi-encoder_margin-mse.py

-            self.queries[qid]["pos"] = list(self.queries[qid]["pos"])
-            self.queries[qid]["neg"] = list(self.queries[qid]["neg"])
-            random.shuffle(self.queries[qid]["neg"])
+            neg = random.choice(negs)


This step ensures that the number of training samples is same as the old approach.
The alternative is to loop through negs and yield a sample for each item.
What do you think, @tomaarsen ?

But if we can create a datasets.Dataset with triplets that contains many more triplets than the existing MSMARCODataset, then we should be able to get equivalent results.

Context

omkar-334 · 2026-02-10T07:21:59Z

TrainOutput(global_step=7859, training_loss=6.986085102718619, metrics={'train_runtime': 5418.4237, 'train_samples_per_second': 92.82, 'train_steps_per_second': 1.45, 'total_flos': 0.0, 'train_loss': 6.986085102718619, 'epoch': 1.0})

The result for training bi-encoder_margin_ mse

omkar-334 · 2026-02-11T19:48:47Z

hey @tomaarsen could you review this once when you're available

tomaarsen · 2026-02-11T20:27:57Z

Will try to have a look tomorrow. I've been updating #3554 locally considerably the last few days.

Tom Aarsen

omkar-334 · 2026-02-11T20:40:29Z

Thanks! Is there anything I can help out with?

tomaarsen · 2026-02-12T22:45:56Z

Not at this point, I think. There's still a lot of back and forth in terms of the implementation. My apologies also, I haven't gotten around to PR reviews today.

Tom Aarsen

tomaarsen · 2026-02-13T17:15:25Z

examples/sentence_transformer/training/ms_marco/train_bi-encoder_margin-mse.py

-        neg_id = query["neg"].pop(0)  # Pop negative and add at end
-        neg_text = self.corpus[neg_id]
-        query["neg"].append(neg_id)
+train_dataset = Dataset.from_generator(build_samples)


Hmm, does this create the batch once and then reuse it if you have multiple epochs? The old one had different batches each time.

examples/sentence_transformer/training/ms_marco/train_bi-encoder_margin-mse.py

tomaarsen · 2026-02-13T17:19:22Z

examples/sentence_transformer/training/ms_marco/train_bi-encoder_margin-mse.py

+hard_negatives_filepath = hf_hub_download(
+    repo_id="sentence-transformers/msmarco-hard-negatives",
+    filename="msmarco-hard-negatives.jsonl.gz",
+    repo_type="dataset",
+)


Is this roughly the same data as in these? https://huggingface.co/collections/sentence-transformers/ms-marco-mined-triplets

cc @omkar-334. I'd like to avoid the msmarco-hard-negatives.jsonl.gz file itself if possible, and I think these datasets are in the collection I linked.

hey @tomaarsen , apolgies for the delay on this. I was initially confused - i was looking for one dataset in the collection that was a replica of the old file. I inspected a bit more with the help of claude and I realised the collection is structured differently: each system (BM25, distilbert-tas-b, etc.) is its own dataset, and to reconstruct the original format you need to load all 13 and merge by query_id.

I did that and compared against the old file:

Common qids: 502,939

Missing from new: 305,792 — but these all have empty pos lists in the old file, so they were never usable for training anyway

Extra in new: 0

On my local(mac 24gb ram),
The old file used to take 2.5-4minutes to load, whereas the new approach takes 8-9 minutes.

Old Approach -

Here's the script for the new approach -

import pandas as pd from datasets import load_dataset SYSTEMS = { "bm25": "sentence-transformers/msmarco-bm25", "msmarco-distilbert-base-tas-b": "sentence-transformers/msmarco-msmarco-distilbert-base-tas-b", "msmarco-distilbert-base-v3": "sentence-transformers/msmarco-msmarco-distilbert-base-v3", "msmarco-MiniLM-L-6-v3": "sentence-transformers/msmarco-msmarco-MiniLM-L6-v3", "distilbert-margin_mse-cls-dot-v2": "sentence-transformers/msmarco-distilbert-margin-mse-cls-dot-v2", "distilbert-margin_mse-cls-dot-v1": "sentence-transformers/msmarco-distilbert-margin-mse-cls-dot-v1", "distilbert-margin_mse-mean-dot-v1": "sentence-transformers/msmarco-distilbert-margin-mse-mean-dot-v1", "mpnet-margin_mse-mean-v1": "sentence-transformers/msmarco-mpnet-margin-mse-mean-v1", "co-condenser-margin_mse-cls-v1": "sentence-transformers/msmarco-co-condenser-margin-mse-cls-v1", "distilbert-margin_mse-mnrl-mean-v1": "sentence-transformers/msmarco-distilbert-margin-mse-mnrl-mean-v1", "distilbert-margin_mse-sym_mnrl-mean-v1": "sentence-transformers/msmarco-distilbert-margin-mse-sym-mnrl-mean-v1", "distilbert-margin_mse-sym_mnrl-mean-v2": "sentence-transformers/msmarco-distilbert-margin-mse-sym-mnrl-mean-v2", "co-condenser-margin_mse-sym_mnrl-mean-v1": "sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1", } NEG_COLS = [f"negative_{i}" for i in range(1, 51)] KEEP_COLS = ["query", "positive"] + NEG_COLS # adjust if some datasets use _id suffix dfs = [] for system_key, repo_id in SYSTEMS.items(): print(f"Loading {system_key}...") ds = load_dataset(repo_id, "triplet-50-ids", split="train") # Normalize column names to query/positive regardless of dataset rename = {} if "query_id" in ds.column_names: rename["query_id"] = "query" if "positive_id" in ds.column_names: rename["positive_id"] = "positive" # Drop text columns if any exist (keep only ID columns) drop = [c for c in ds.column_names if c not in KEEP_COLS and c not in rename] ds = ds.remove_columns(drop) if rename: ds = ds.rename_columns(rename) df = ds.to_pandas() df["system"] = system_key dfs.append(df) # Single concatenated dataframe: one row per (query, system) print("Concatenating...") combined_df = pd.concat(dfs, ignore_index=True) del dfs # free memory # Aggregate: group by query, collect pos and all negs per system print("Aggregating...") def agg(group): pos = group["positive"].unique().tolist() neg = {row["system"]: [row[c] for c in NEG_COLS if pd.notna(row[c])] for _, row in group.iterrows()} return pd.Series({"pos": pos, "neg": neg}) df_new = combined_df.groupby("query").apply(agg).reset_index() df_new.rename(columns={"query": "qid"}, inplace=True)

Comparison -

what do you think about this? I'm still a bit confused on the time taken and memory

hey @tomaarsen , just a nudge

examples/sentence_transformer/training/ms_marco/train_bi-encoder_mnrl.py

I've not had time to run these yet, so I don't have nice defaults, but the structure might be nice

tomaarsen · 2026-03-02T14:03:11Z

Apologies for the delay. I took some time today to revisit this, and I pushed some changes primarily to simplify it all. The performance will definitely be different (e.g. I'm training with n-tuples instead of purely triplets), and I haven't had time to run this myself yet. So the comments and defaults are still a bit off.

It also requires #3680

Tom Aarsen

omkar-334 · 2026-03-02T15:09:29Z

I took some time today to revisit this, and I pushed some changes primarily to simplify it all. The performance will definitely be different (e.g. I'm training with n-tuples instead of purely triplets), and I haven't had time to run this myself yet. So the comments and defaults are still a bit off.

Thanks for taking the time and checking this out... Out of interest, were there particular parts of my script that influenced this commit, or did you mostly rework it from scratch?

Also, what do you think about the time it takes to load and merge the datasets? Is it better to upload the final merged dataset to huggingface so we can just load that instead?

simplify args

98073fa

omkar-334 changed the title ~~migrate train_bi-encoder_margin-mse.py from v2 and v3~~ migrate train_bi-encoder_margin-mse.py from v2 to v3 Feb 2, 2026

omkar-334 added 3 commits February 6, 2026 16:01

Merge branch 'main' into bienc-margin

c2afcc7

replace ms-marco datasets

f0d8729

trainer fixes

5f280f3

omkar-334 marked this pull request as ready for review February 6, 2026 17:48

rename variables and change dataset

5c60476

omkar-334 changed the title ~~migrate train_bi-encoder_margin-mse.py from v2 to v3~~ replace ms-marco datasets and migrate train_bi-encoder_margin-mse.py from v2 to v3 Feb 6, 2026

revert 2 dataset changes

645f190

omkar-334 added 2 commits February 8, 2026 14:50

max_steps

e7c024e

simplify dataset logic

afea249

omkar-334 commented Feb 8, 2026

View reviewed changes

omkar-334 added 2 commits February 10, 2026 12:17

migrate train_bi-encoder_mnrl.py

dd8f96d

Merge branch 'main' into bienc-margin

1df3ef8

omkar-334 changed the title ~~replace ms-marco datasets and migrate train_bi-encoder_margin-mse.py from v2 to v3~~ replace ms-marco datasets and migrate examples Feb 10, 2026

tomaarsen reviewed Feb 13, 2026

View reviewed changes

examples/sentence_transformer/training/ms_marco/train_bi-encoder_margin-mse.py Outdated Show resolved Hide resolved

tomaarsen reviewed Feb 13, 2026

View reviewed changes

Merge branch 'main' into bienc-margin

dd4efc6

tomaarsen reviewed Feb 23, 2026

View reviewed changes

examples/sentence_transformer/training/ms_marco/train_bi-encoder_mnrl.py Outdated Show resolved Hide resolved

omkar-334 and others added 3 commits February 25, 2026 02:00

Merge branch 'main' into bienc-margin

a798c53

Merge branch 'main' into bienc-margin

6d655a1

Simplified bi-encoder MarginMSE/MNRL implementations using set_transform

9b1e974

I've not had time to run these yet, so I don't have nice defaults, but the structure might be nice

Merge branch 'main' into bienc-margin

a876dee

Conversation

omkar-334 commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomaarsen commented Feb 2, 2026

Uh oh!

omkar-334 commented Feb 2, 2026

Uh oh!

omkar-334 commented Feb 6, 2026

Uh oh!

omkar-334 commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

omkar-334 commented Feb 8, 2026

Uh oh!

omkar-334 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

omkar-334 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

omkar-334 commented Feb 10, 2026

Uh oh!

omkar-334 commented Feb 11, 2026

Uh oh!

tomaarsen commented Feb 11, 2026

Uh oh!

omkar-334 commented Feb 11, 2026

Uh oh!

tomaarsen commented Feb 12, 2026

Uh oh!

tomaarsen Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tomaarsen Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

tomaarsen Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

omkar-334 Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

omkar-334 Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

omkar-334 Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tomaarsen commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

omkar-334 commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

omkar-334 commented Feb 2, 2026 •

edited

Loading

omkar-334 commented Feb 8, 2026 •

edited

Loading

omkar-334 Feb 24, 2026 •

edited

Loading

omkar-334 Feb 24, 2026 •

edited

Loading

tomaarsen commented Mar 2, 2026 •

edited

Loading