Skip to content

Migrate train_bi-encoder_mnrl.py from v2 to v3#3634

Open
ritoban23 wants to merge 1 commit intohuggingface:mainfrom
ritoban23:migrate-train-bi-encoder-mnrl-v2-to-v3
Open

Migrate train_bi-encoder_mnrl.py from v2 to v3#3634
ritoban23 wants to merge 1 commit intohuggingface:mainfrom
ritoban23:migrate-train-bi-encoder-mnrl-v2-to-v3

Conversation

@ritoban23
Copy link

Part of #3621

Changes

  • Migrated examples/sentence_transformer/training/ms_marco/train_bi-encoder_mnrl.py from v2 to v3
  • Replaced model.fit() with SentenceTransformerTrainer and SentenceTransformerTrainingArguments

@tomaarsen
Copy link
Member

I'm afraid this won't work very nicely. The SentenceTransformerTrainer expects its train_dataset to be a datasets.Dataset, and not a torch.utils.data.Dataset. This is definitely one of the more difficult files that hasn't been updated to the v3 format.

  • Tom Aarsen

@ritoban23
Copy link
Author

@tomaarsen
Yes, I caught that earlier..The torch Dataset didn't have the column_names attribute that validate_column_names expects.

I'm thinking to:

  • Replace the MSMARCODataset class with a function that uses Dataset.from_dict()
  • Create anchor/positive/negative columns for the triplets

Question: the original MSMARCODataset class rotates through multiple positives/negatives per query across batches (using pop/append). Should I:

  1. use first pos/neg per query and rely on trainer shuffling?
  2. Would you suggest a different approach?

@tomaarsen
Copy link
Member

I think you've effectively found what made this file so hard to upgrade, the MSMARCODataset pos/neg rotations. I think if we just take the first pos/neg and run 10 epochs like the script currently does, then we'll likely have worse performance than the old script.
But if we can create a datasets.Dataset with triplets that contains many more triplets than the existing MSMARCODataset, then we should be able to get equivalent results. To give you some more context, I believe that https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives/resolve/main/msmarco-hard-negatives.jsonl.gz is simply a zip of this folder: https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives/tree/main/nearest_neighbors, and the files from that folder have been uploaded in separate datasets here: https://huggingface.co/collections/sentence-transformers/ms-marco-mined-triplets
And the https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives/resolve/main/cross-encoder-ms-marco-MiniLM-L-6-v2-scores.pkl.gz file has been uploaded here in a few different formats: https://huggingface.co/datasets/sentence-transformers/msmarco-scores-ms-marco-MiniLM-L6-v2

The triplet-hard splits should already have been equivalent to the ce_score_threshold that the script does, so in theory it might be possible to (roughly):

  1. Load the triplet-hard or triplet-hard-ids datasets from the "systems to use"
  2. Maybe apply some more filtering for e.g. num_negs_per_system per dataset to not have too many negatives for the same query.
  3. concatenate_datasets the datasets together into one big train dataset
  4. Pass this big dataset to the Trainer and rely on the Trainer shuffling to help out.
  5. Use batch_samplers=BatchSamplers.NO_DUPLICATES: the idea is that this prevents the same text from occurring multiple times in a batch, which can be bad for performance as the loss here uses in-batch negatives, i.e. other texts in the batch are always used as negatives (even if they might be the same as the positive).
  6. Always use num_epochs=1, but now allow setting a max_steps instead, e.g. defaulting to 1e-7. Users can then specify None for "all data".
  7. Perhaps use evaluator = NanoBEIREvaluator(dataset_names["msmarco", "nq"]) with eval_steps=0.1 to get some evaluation during training. I like the "ratio" options for eval_steps/save_steps/logging_steps so you always know that you're e.g. saving and evaluating 30 times and logging 300 times.

I think something like that could work, what are your impressions?

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants