Skip to content

replace ms-marco datasets and migrate examples#3649

Open
omkar-334 wants to merge 15 commits intohuggingface:mainfrom
omkar-334:bienc-margin
Open

replace ms-marco datasets and migrate examples#3649
omkar-334 wants to merge 15 commits intohuggingface:mainfrom
omkar-334:bienc-margin

Conversation

@omkar-334
Copy link
Contributor

@omkar-334 omkar-334 commented Feb 2, 2026

Part of #3620 and #3621

@omkar-334 omkar-334 changed the title migrate train_bi-encoder_margin-mse.py from v2 and v3 migrate train_bi-encoder_margin-mse.py from v2 to v3 Feb 2, 2026
@tomaarsen
Copy link
Member

The dataset replacement part might not be necessary if we follow an approach like in #3634 (comment). It could simplify things a lot.

  • Tom Aarsen

@omkar-334
Copy link
Contributor Author

The dataset replacement part might not be necessary if we follow an approach like in #3634 (comment). It could simplify things a lot.

  • Tom Aarsen

yes, i am going through your comments at #3634 and #3635. thanks!

@omkar-334 omkar-334 marked this pull request as ready for review February 6, 2026 17:48
@omkar-334
Copy link
Contributor Author

I've replaced the datasets and migrated the training logic. Now i have to check the performance, and look into Tom's suggestions

@omkar-334 omkar-334 changed the title migrate train_bi-encoder_margin-mse.py from v2 to v3 replace ms-marco datasets and migrate train_bi-encoder_margin-mse.py from v2 to v3 Feb 6, 2026
@omkar-334
Copy link
Contributor Author

omkar-334 commented Feb 8, 2026

Old approach -
Screenshot 2026-02-08 at 2 32 53 PM
...
New approach -
Screenshot 2026-02-08 at 2 33 21 PM

@omkar-334
Copy link
Contributor Author

  1. Issue with queries_hf = load_dataset("sentence-transformers/msmarco-corpus", "query", split="train") is that it contains train, test and val splits all merged into a single split. This causes errors in the dataset logic where some keys are not found. I uploaded the original queries files to my user and this works perfectly.

  2. Issue with

hard_negatives_hf = load_dataset(
    "sentence-transformers/msmarco-hard-negatives",
    split="train",
    streaming=True,  # ← key
)

The column structure is different, leading to cast errors. The solution to this is to enforce streaming=True, but this approach takes more than 10 minutes to load the final dataset. So I reverted this back to original file, using it with hf_hub_download. This works perfectly.

self.queries[qid]["pos"] = list(self.queries[qid]["pos"])
self.queries[qid]["neg"] = list(self.queries[qid]["neg"])
random.shuffle(self.queries[qid]["neg"])
neg = random.choice(negs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step ensures that the number of training samples is same as the old approach.
The alternative is to loop through negs and yield a sample for each item.
What do you think, @tomaarsen ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we can create a datasets.Dataset with triplets that contains many more triplets than the existing MSMARCODataset, then we should be able to get equivalent results.

Context

@omkar-334 omkar-334 changed the title replace ms-marco datasets and migrate train_bi-encoder_margin-mse.py from v2 to v3 replace ms-marco datasets and migrate examples Feb 10, 2026
@omkar-334
Copy link
Contributor Author

TrainOutput(global_step=7859, training_loss=6.986085102718619, metrics={'train_runtime': 5418.4237, 'train_samples_per_second': 92.82, 'train_steps_per_second': 1.45, 'total_flos': 0.0, 'train_loss': 6.986085102718619, 'epoch': 1.0})

The result for training bi-encoder_margin_ mse

@omkar-334
Copy link
Contributor Author

hey @tomaarsen could you review this once when you're available

@tomaarsen
Copy link
Member

Will try to have a look tomorrow. I've been updating #3554 locally considerably the last few days.

  • Tom Aarsen

@omkar-334
Copy link
Contributor Author

Thanks! Is there anything I can help out with?

@tomaarsen
Copy link
Member

Not at this point, I think. There's still a lot of back and forth in terms of the implementation. My apologies also, I haven't gotten around to PR reviews today.

  • Tom Aarsen

neg_id = query["neg"].pop(0) # Pop negative and add at end
neg_text = self.corpus[neg_id]
query["neg"].append(neg_id)
train_dataset = Dataset.from_generator(build_samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, does this create the batch once and then reuse it if you have multiple epochs? The old one had different batches each time.

Comment on lines +69 to +73
hard_negatives_filepath = hf_hub_download(
repo_id="sentence-transformers/msmarco-hard-negatives",
filename="msmarco-hard-negatives.jsonl.gz",
repo_type="dataset",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @omkar-334. I'd like to avoid the msmarco-hard-negatives.jsonl.gz file itself if possible, and I think these datasets are in the collection I linked.

Copy link
Contributor Author

@omkar-334 omkar-334 Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @tomaarsen , apolgies for the delay on this. I was initially confused - i was looking for one dataset in the collection that was a replica of the old file. I inspected a bit more with the help of claude and I realised the collection is structured differently: each system (BM25, distilbert-tas-b, etc.) is its own dataset, and to reconstruct the original format you need to load all 13 and merge by query_id.

I did that and compared against the old file:

  • Common qids: 502,939
  • Missing from new: 305,792 — but these all have empty pos lists in the old file, so they were never usable for training anyway
  • Extra in new: 0

On my local(mac 24gb ram),
The old file used to take 2.5-4minutes to load, whereas the new approach takes 8-9 minutes.

Old Approach -

Screenshot 2026-02-25 at 1 56 25 AM

Here's the script for the new approach -

import pandas as pd
from datasets import load_dataset

SYSTEMS = {
    "bm25": "sentence-transformers/msmarco-bm25",
    "msmarco-distilbert-base-tas-b": "sentence-transformers/msmarco-msmarco-distilbert-base-tas-b",
    "msmarco-distilbert-base-v3": "sentence-transformers/msmarco-msmarco-distilbert-base-v3",
    "msmarco-MiniLM-L-6-v3": "sentence-transformers/msmarco-msmarco-MiniLM-L6-v3",
    "distilbert-margin_mse-cls-dot-v2": "sentence-transformers/msmarco-distilbert-margin-mse-cls-dot-v2",
    "distilbert-margin_mse-cls-dot-v1": "sentence-transformers/msmarco-distilbert-margin-mse-cls-dot-v1",
    "distilbert-margin_mse-mean-dot-v1": "sentence-transformers/msmarco-distilbert-margin-mse-mean-dot-v1",
    "mpnet-margin_mse-mean-v1": "sentence-transformers/msmarco-mpnet-margin-mse-mean-v1",
    "co-condenser-margin_mse-cls-v1": "sentence-transformers/msmarco-co-condenser-margin-mse-cls-v1",
    "distilbert-margin_mse-mnrl-mean-v1": "sentence-transformers/msmarco-distilbert-margin-mse-mnrl-mean-v1",
    "distilbert-margin_mse-sym_mnrl-mean-v1": "sentence-transformers/msmarco-distilbert-margin-mse-sym-mnrl-mean-v1",
    "distilbert-margin_mse-sym_mnrl-mean-v2": "sentence-transformers/msmarco-distilbert-margin-mse-sym-mnrl-mean-v2",
    "co-condenser-margin_mse-sym_mnrl-mean-v1": "sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1",
}

NEG_COLS = [f"negative_{i}" for i in range(1, 51)]
KEEP_COLS = ["query", "positive"] + NEG_COLS  # adjust if some datasets use _id suffix

dfs = []

for system_key, repo_id in SYSTEMS.items():
    print(f"Loading {system_key}...")
    ds = load_dataset(repo_id, "triplet-50-ids", split="train")

    # Normalize column names to query/positive regardless of dataset
    rename = {}
    if "query_id" in ds.column_names:
        rename["query_id"] = "query"
    if "positive_id" in ds.column_names:
        rename["positive_id"] = "positive"

    # Drop text columns if any exist (keep only ID columns)
    drop = [c for c in ds.column_names if c not in KEEP_COLS and c not in rename]
    ds = ds.remove_columns(drop)
    if rename:
        ds = ds.rename_columns(rename)

    df = ds.to_pandas()
    df["system"] = system_key
    dfs.append(df)

# Single concatenated dataframe: one row per (query, system)
print("Concatenating...")
combined_df = pd.concat(dfs, ignore_index=True)
del dfs  # free memory

# Aggregate: group by query, collect pos and all negs per system
print("Aggregating...")


def agg(group):
    pos = group["positive"].unique().tolist()
    neg = {row["system"]: [row[c] for c in NEG_COLS if pd.notna(row[c])] for _, row in group.iterrows()}
    return pd.Series({"pos": pos, "neg": neg})


df_new = combined_df.groupby("query").apply(agg).reset_index()
df_new.rename(columns={"query": "qid"}, inplace=True)

Comparison -
Screenshot 2026-02-25 at 1 56 51 AM

Copy link
Contributor Author

@omkar-334 omkar-334 Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about this? I'm still a bit confused on the time taken and memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @tomaarsen , just a nudge

omkar-334 and others added 3 commits February 25, 2026 02:00
@tomaarsen
Copy link
Member

tomaarsen commented Mar 2, 2026

Apologies for the delay. I took some time today to revisit this, and I pushed some changes primarily to simplify it all. The performance will definitely be different (e.g. I'm training with n-tuples instead of purely triplets), and I haven't had time to run this myself yet. So the comments and defaults are still a bit off.

It also requires #3680

  • Tom Aarsen

@omkar-334
Copy link
Contributor Author

I took some time today to revisit this, and I pushed some changes primarily to simplify it all. The performance will definitely be different (e.g. I'm training with n-tuples instead of purely triplets), and I haven't had time to run this myself yet. So the comments and defaults are still a bit off.

Thanks for taking the time and checking this out... Out of interest, were there particular parts of my script that influenced this commit, or did you mostly rework it from scratch?

Also, what do you think about the time it takes to load and merge the datasets? Is it better to upload the final merged dataset to huggingface so we can just load that instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants