replace ms-marco datasets and migrate examples#3649
replace ms-marco datasets and migrate examples#3649omkar-334 wants to merge 15 commits intohuggingface:mainfrom
ms-marco datasets and migrate examples#3649Conversation
train_bi-encoder_margin-mse.py from v2 and v3train_bi-encoder_margin-mse.py from v2 to v3
|
The dataset replacement part might not be necessary if we follow an approach like in #3634 (comment). It could simplify things a lot.
|
yes, i am going through your comments at #3634 and #3635. thanks! |
|
I've replaced the datasets and migrated the training logic. Now i have to check the performance, and look into Tom's suggestions |
train_bi-encoder_margin-mse.py from v2 to v3ms-marco datasets and migrate train_bi-encoder_margin-mse.py from v2 to v3
hard_negatives_hf = load_dataset(
"sentence-transformers/msmarco-hard-negatives",
split="train",
streaming=True, # ← key
)The column structure is different, leading to cast errors. The solution to this is to enforce |
| self.queries[qid]["pos"] = list(self.queries[qid]["pos"]) | ||
| self.queries[qid]["neg"] = list(self.queries[qid]["neg"]) | ||
| random.shuffle(self.queries[qid]["neg"]) | ||
| neg = random.choice(negs) |
There was a problem hiding this comment.
This step ensures that the number of training samples is same as the old approach.
The alternative is to loop through negs and yield a sample for each item.
What do you think, @tomaarsen ?
There was a problem hiding this comment.
But if we can create a
datasets.Datasetwith triplets that contains many more triplets than the existing MSMARCODataset, then we should be able to get equivalent results.
Context
ms-marco datasets and migrate train_bi-encoder_margin-mse.py from v2 to v3ms-marco datasets and migrate examples
The result for training |
|
hey @tomaarsen could you review this once when you're available |
|
Will try to have a look tomorrow. I've been updating #3554 locally considerably the last few days.
|
|
Thanks! Is there anything I can help out with? |
|
Not at this point, I think. There's still a lot of back and forth in terms of the implementation. My apologies also, I haven't gotten around to PR reviews today.
|
| neg_id = query["neg"].pop(0) # Pop negative and add at end | ||
| neg_text = self.corpus[neg_id] | ||
| query["neg"].append(neg_id) | ||
| train_dataset = Dataset.from_generator(build_samples) |
There was a problem hiding this comment.
Hmm, does this create the batch once and then reuse it if you have multiple epochs? The old one had different batches each time.
examples/sentence_transformer/training/ms_marco/train_bi-encoder_margin-mse.py
Outdated
Show resolved
Hide resolved
| hard_negatives_filepath = hf_hub_download( | ||
| repo_id="sentence-transformers/msmarco-hard-negatives", | ||
| filename="msmarco-hard-negatives.jsonl.gz", | ||
| repo_type="dataset", | ||
| ) |
There was a problem hiding this comment.
Is this roughly the same data as in these? https://huggingface.co/collections/sentence-transformers/ms-marco-mined-triplets
There was a problem hiding this comment.
cc @omkar-334. I'd like to avoid the msmarco-hard-negatives.jsonl.gz file itself if possible, and I think these datasets are in the collection I linked.
There was a problem hiding this comment.
hey @tomaarsen , apolgies for the delay on this. I was initially confused - i was looking for one dataset in the collection that was a replica of the old file. I inspected a bit more with the help of claude and I realised the collection is structured differently: each system (BM25, distilbert-tas-b, etc.) is its own dataset, and to reconstruct the original format you need to load all 13 and merge by query_id.
I did that and compared against the old file:
- Common qids: 502,939
- Missing from new: 305,792 — but these all have empty
poslists in the old file, so they were never usable for training anyway - Extra in new: 0
On my local(mac 24gb ram),
The old file used to take 2.5-4minutes to load, whereas the new approach takes 8-9 minutes.
Old Approach -
Here's the script for the new approach -
import pandas as pd
from datasets import load_dataset
SYSTEMS = {
"bm25": "sentence-transformers/msmarco-bm25",
"msmarco-distilbert-base-tas-b": "sentence-transformers/msmarco-msmarco-distilbert-base-tas-b",
"msmarco-distilbert-base-v3": "sentence-transformers/msmarco-msmarco-distilbert-base-v3",
"msmarco-MiniLM-L-6-v3": "sentence-transformers/msmarco-msmarco-MiniLM-L6-v3",
"distilbert-margin_mse-cls-dot-v2": "sentence-transformers/msmarco-distilbert-margin-mse-cls-dot-v2",
"distilbert-margin_mse-cls-dot-v1": "sentence-transformers/msmarco-distilbert-margin-mse-cls-dot-v1",
"distilbert-margin_mse-mean-dot-v1": "sentence-transformers/msmarco-distilbert-margin-mse-mean-dot-v1",
"mpnet-margin_mse-mean-v1": "sentence-transformers/msmarco-mpnet-margin-mse-mean-v1",
"co-condenser-margin_mse-cls-v1": "sentence-transformers/msmarco-co-condenser-margin-mse-cls-v1",
"distilbert-margin_mse-mnrl-mean-v1": "sentence-transformers/msmarco-distilbert-margin-mse-mnrl-mean-v1",
"distilbert-margin_mse-sym_mnrl-mean-v1": "sentence-transformers/msmarco-distilbert-margin-mse-sym-mnrl-mean-v1",
"distilbert-margin_mse-sym_mnrl-mean-v2": "sentence-transformers/msmarco-distilbert-margin-mse-sym-mnrl-mean-v2",
"co-condenser-margin_mse-sym_mnrl-mean-v1": "sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1",
}
NEG_COLS = [f"negative_{i}" for i in range(1, 51)]
KEEP_COLS = ["query", "positive"] + NEG_COLS # adjust if some datasets use _id suffix
dfs = []
for system_key, repo_id in SYSTEMS.items():
print(f"Loading {system_key}...")
ds = load_dataset(repo_id, "triplet-50-ids", split="train")
# Normalize column names to query/positive regardless of dataset
rename = {}
if "query_id" in ds.column_names:
rename["query_id"] = "query"
if "positive_id" in ds.column_names:
rename["positive_id"] = "positive"
# Drop text columns if any exist (keep only ID columns)
drop = [c for c in ds.column_names if c not in KEEP_COLS and c not in rename]
ds = ds.remove_columns(drop)
if rename:
ds = ds.rename_columns(rename)
df = ds.to_pandas()
df["system"] = system_key
dfs.append(df)
# Single concatenated dataframe: one row per (query, system)
print("Concatenating...")
combined_df = pd.concat(dfs, ignore_index=True)
del dfs # free memory
# Aggregate: group by query, collect pos and all negs per system
print("Aggregating...")
def agg(group):
pos = group["positive"].unique().tolist()
neg = {row["system"]: [row[c] for c in NEG_COLS if pd.notna(row[c])] for _, row in group.iterrows()}
return pd.Series({"pos": pos, "neg": neg})
df_new = combined_df.groupby("query").apply(agg).reset_index()
df_new.rename(columns={"query": "qid"}, inplace=True)There was a problem hiding this comment.
what do you think about this? I'm still a bit confused on the time taken and memory
examples/sentence_transformer/training/ms_marco/train_bi-encoder_mnrl.py
Outdated
Show resolved
Hide resolved
I've not had time to run these yet, so I don't have nice defaults, but the structure might be nice
|
Apologies for the delay. I took some time today to revisit this, and I pushed some changes primarily to simplify it all. The performance will definitely be different (e.g. I'm training with n-tuples instead of purely triplets), and I haven't had time to run this myself yet. So the comments and defaults are still a bit off. It also requires #3680
|
Thanks for taking the time and checking this out... Out of interest, were there particular parts of my script that influenced this commit, or did you mostly rework it from scratch? Also, what do you think about the time it takes to load and merge the datasets? Is it better to upload the final merged dataset to huggingface so we can just load that instead? |



Part of #3620 and #3621