Skip to content

v0.3.3 - Multi-Process Tokenization and Information Retrieval Improvements

Compare
Choose a tag to compare
@nreimers nreimers released this 06 Aug 08:16
· 1398 commits to master since this release

New Functions

  • Multi-process tokenization (Linux only) for the model encode function. Significant speed-up when encoding large sets
  • Tokenization of datasets for training can now run in parallel (Linux Only)
  • New example for Quora Duplicate Questions Retrieval: See examples-folder
  • Many small improvements for training better models for Information Retrieval
  • Fixed LabelSampler (can be used to get batches with certain number of matching labels. Used for BatchHardTripletLoss). Moved it to DatasetFolder
  • Added new Evaluators for ParaphraseMining and InformationRetrieval
  • evaluation.BinaryEmbeddingSimilarityEvaluator no longer assumes a 50-50 split of the dataset. It computes the optimal threshold and measure accuracy
  • model.encode - When the convert_to_numpy parameter is set, the method returns a numpy matrix instead of a list of numpy vectors
  • New function: util.paraphrase_mining to perform paraphrase mining in a corpus. For an example see examples/training_quora_duplicate_questions/
  • New function: util.information_retrieval to perform information retrieval / semantic search in a corpus. For an example see examples/training_quora_duplicate_questions/

Breaking Changes

  • The evaluators (like EmbeddingSimilarityEvaluator) no longer accept a DataLoader as argument. Instead, the sentence and scores are directly passed. Old code that uses the previous evaluators needs to be changed. They can use the class method from_input_examples(). See examples/training_transformers/training_nli.py how to use the new evaluators.