You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Multi-process tokenization (Linux only) for the model encode function. Significant speed-up when encoding large sets
Tokenization of datasets for training can now run in parallel (Linux Only)
New example for Quora Duplicate Questions Retrieval: See examples-folder
Many small improvements for training better models for Information Retrieval
Fixed LabelSampler (can be used to get batches with certain number of matching labels. Used for BatchHardTripletLoss). Moved it to DatasetFolder
Added new Evaluators for ParaphraseMining and InformationRetrieval
evaluation.BinaryEmbeddingSimilarityEvaluator no longer assumes a 50-50 split of the dataset. It computes the optimal threshold and measure accuracy
model.encode - When the convert_to_numpy parameter is set, the method returns a numpy matrix instead of a list of numpy vectors
New function: util.paraphrase_mining to perform paraphrase mining in a corpus. For an example see examples/training_quora_duplicate_questions/
New function: util.information_retrieval to perform information retrieval / semantic search in a corpus. For an example see examples/training_quora_duplicate_questions/
Breaking Changes
The evaluators (like EmbeddingSimilarityEvaluator) no longer accept a DataLoader as argument. Instead, the sentence and scores are directly passed. Old code that uses the previous evaluators needs to be changed. They can use the class method from_input_examples(). See examples/training_transformers/training_nli.py how to use the new evaluators.