Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need Advice on Training Data Prep #3205

Closed
rupeshgx opened this issue Jan 31, 2025 · 1 comment
Closed

Need Advice on Training Data Prep #3205

rupeshgx opened this issue Jan 31, 2025 · 1 comment

Comments

@rupeshgx
Copy link

I have a training dataset where a sample record looks like:
query, [positive1, positive2], [negative1, negative2, negative3, negative4, negative5]

where positive1, positive2 are documents that are clicked, and negative1, negative2, negative3, negative4, negative5 are documents that are not clicked.

I want to use this dataset to fine-tune an embedding model. Now, if I want to use Multiple Negatives Ranking loss then I would need to create multiple records from the above record, so that each record contains only one positive document:
record1: query, positive1, [negative1, negative2, negative3, negative4, negative5]
record2: query, positive2, [negative1, negative2, negative3, negative4, negative5]

However, if I understand correctly, if the above 2 records fall in the same batch during training then positive2 will be treated as negative for (query, positive1) and positive1 will be treated as a negative for (query, positive2).

Any advice on how to construct a good quality training dataset? I can throw away record2 but I would like to use that information if possible.

@tomaarsen
Copy link
Collaborator

Hello!

You can still create records like

record1: query, positive1, negative1, negative2, negative3, negative4, negative5
record2: query, positive2, negative1, negative2, negative3, negative4, negative5

(i.e. a dataset with 7 columns)

And then use it together with the NoDuplicatesBatchSampler: https://sbert.net/docs/package_reference/sentence_transformer/sampler.html#sentence_transformers.training_args.BatchSamplers
This batch sampler will iterate over all samples randomly, and iteratively check if they should be included in the batch that it's creating. It does so by maintaining a set of all texts that have already been included in the batch. If the new candidate sample contains any text that is already in the batch, then this sample will be skipped for this batch!. The sample can be considered again for the next batch.

If you have enough data and not that much overlap between all samples, then you can reasonably expect that this batch sampler can create normal batches out of your data, such that you will never get a positive value accidentally being marked as an in-batch negative.

  • Tom Aarsen

@rupeshgx rupeshgx closed this as completed Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants