You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a training dataset where a sample record looks like:
query, [positive1, positive2], [negative1, negative2, negative3, negative4, negative5]
where positive1, positive2 are documents that are clicked, and negative1, negative2, negative3, negative4, negative5 are documents that are not clicked.
I want to use this dataset to fine-tune an embedding model. Now, if I want to use Multiple Negatives Ranking loss then I would need to create multiple records from the above record, so that each record contains only one positive document:
record1: query, positive1, [negative1, negative2, negative3, negative4, negative5]
record2: query, positive2, [negative1, negative2, negative3, negative4, negative5]
However, if I understand correctly, if the above 2 records fall in the same batch during training then positive2 will be treated as negative for (query, positive1) and positive1 will be treated as a negative for (query, positive2).
Any advice on how to construct a good quality training dataset? I can throw away record2 but I would like to use that information if possible.
The text was updated successfully, but these errors were encountered:
And then use it together with the NoDuplicatesBatchSampler: https://sbert.net/docs/package_reference/sentence_transformer/sampler.html#sentence_transformers.training_args.BatchSamplers
This batch sampler will iterate over all samples randomly, and iteratively check if they should be included in the batch that it's creating. It does so by maintaining a set of all texts that have already been included in the batch. If the new candidate sample contains any text that is already in the batch, then this sample will be skipped for this batch!. The sample can be considered again for the next batch.
If you have enough data and not that much overlap between all samples, then you can reasonably expect that this batch sampler can create normal batches out of your data, such that you will never get a positive value accidentally being marked as an in-batch negative.
I have a training dataset where a sample record looks like:
query, [positive1, positive2], [negative1, negative2, negative3, negative4, negative5]
where positive1, positive2 are documents that are clicked, and negative1, negative2, negative3, negative4, negative5 are documents that are not clicked.
I want to use this dataset to fine-tune an embedding model. Now, if I want to use Multiple Negatives Ranking loss then I would need to create multiple records from the above record, so that each record contains only one positive document:
record1: query, positive1, [negative1, negative2, negative3, negative4, negative5]
record2: query, positive2, [negative1, negative2, negative3, negative4, negative5]
However, if I understand correctly, if the above 2 records fall in the same batch during training then positive2 will be treated as negative for (query, positive1) and positive1 will be treated as a negative for (query, positive2).
Any advice on how to construct a good quality training dataset? I can throw away record2 but I would like to use that information if possible.
The text was updated successfully, but these errors were encountered: