Skip to content

Releases: UKPLab/sentence-transformers

v0.4.0 - Upgrade Transformers Version

22 Dec 13:42
Compare
Choose a tag to compare
  • Updated the dependencies so that it works with Huggingface Transformers version 4. Sentence-Transformers still works with huggingface transformers version 3, but an update to version 4 of transformers is recommended. Future changes might break with transformers version 3.
  • New naming of pre-trained models. Models will be named: {task}-{transformer_model}. So 'bert-base-nli-stsb-mean-tokens' becomes 'stsb-bert-base'. Models will still be available under their old names, but newer models will follow the updated naming scheme.
  • New application example for information retrieval and question answering retrieval. Together with respective pre-trained models

v0.3.9 - Small updates

18 Nov 08:25
Compare
Choose a tag to compare

This release only include some smaller updates:

  • Code was tested with transformers 3.5.1, requirement was updated so that it works with transformers 3.5.1
  • As some parts and models require Pytorch >= 1.6.0, requirement was updated to require at least pytorch 1.6.0. Most of the code and models will work with older pytorch versions.
  • model.encode() stored the embeddings on the GPU, which required quite a lot of GPU memory when encoding millions of sentences. The embeddings are now moved to CPU once they are computed.
  • The CrossEncoder-Class now accepts a max_length parameter to control the truncation of inputs
  • The Cross-Encoder predict method has now a apply_softmax parameter, that allows to apply softmax on-top of a multi-class output.

v0.3.8 - CrossEncoder, Data Augmentation, new Models

19 Oct 14:23
Compare
Choose a tag to compare
  • Add support training and using CrossEncoder
  • Data Augmentation method AugSBERT added
  • New model trained on large scale paraphrase data. Models works on internal benchmark much better than previous models: distilroberta-base-paraphrase-v1 and xlm-r-distilroberta-base-paraphrase-v1
  • New model for Information Retrieval trained on MS Marco: distilroberta-base-msmarco-v1
  • Improved MultipleNegativesRankingLoss loss function: Similarity function can be changed and is now cosine similarity (was dot-product before), further, similarity scores can be multiplied by a scaling factor. This allows the usage of NTXentLoss / InfoNCE loss.
  • New MegaBatchMarginLoss, inspired from the paper ParaNMT-Paper.

Smaller changes:

  • Update InformationRetrievalEvaluator, so that it can work with large corpora (Millions of entries). Removed the query_chunk_size parameter from the evaluator
  • SentenceTransformer.encode method detaches tensors from compute graph
  • SentenceTransformer.fit() method - Parameter output_path_ignore_not_empty deprecated. No longer checks that target folder must be empty

v0.3.7 - Upgrade transformers, Model Distillation Example, Multi-Input to Transformers Model

29 Sep 20:17
Compare
Choose a tag to compare
  • Upgrade transformers dependency, transformers 3.1.0, 3.2.0 and 3.3.1 are working
  • Added example code for model distillation: Sentence Embeddings models can be drastically reduced to e.g. only 2-4 layers while keeping 98+% of their performance. Code can be found in examples/training/distillation
  • Transformer models can now accepts two inputs ['sentence 1', 'context for sent1'], which are encoded as the two inputs for BERT.

Minor changes:

  • Tokenization in the multi-processes encoding setup now happens in the child processes, not in the parent process.
  • Added models.Normalize() to allow the normalization of embeddings to unit length

v0.3.6 - Update transformers to v3.1.0

11 Sep 08:06
Compare
Choose a tag to compare

Hugginface Transformers version 3.1.0 had a breaking change with previous version 3.0.2

This release fixes the issue so that Sentence-Transformers is compatible with Huggingface Transformers 3.1.0. Note, that this and future version will not be compatible with transformers < 3.1.0.

v0.3.5 - Automatic Mixed Precision & Bugfixes

01 Sep 13:09
Compare
Choose a tag to compare
  • The old FP16 training code in model.fit() was replaced by using Pytorch 1.6.0 automatic mixed precision (AMP). When setting model.fit(use_amp=True), AMP will be used. On suitable GPUs, this leads to a significant speed-up while requiring less memory.
  • Performance improvements in paraphrase mining & semantic search by replacing np.argpartition with torch.topk
  • If a sentence-transformer model is not found, it will fall back to huggingface transformers repository and create it with mean pooling.
  • Fixing huggingface transformers to version 3.0.2. Next release will make it compatible with huggingface transformers 3.1.0
  • Several bugfixes: Downloading of files, mutli-GPU-encoding

v0.3.4 - Improved Documentation, Improved Tokenization Speed, Mutli-GPU encoding

24 Aug 16:24
Compare
Choose a tag to compare
  • The documentation is substantially improved and can be found at: www.SBERT.net - Feedback welcome
  • The dataset to hold training InputExamples (dataset.SentencesDataset) now uses lazy tokenization, i.e., examples are tokenized once they are needed for a batch. If you set num_workers to a positive integer in your DataLoader, tokenization will happen in a background thread. This substantially increases the start-up time for training.
  • model.encode() uses also a PyTorch DataSet + DataLoader. If you set num_workers to a positive integer, tokenization will happen in the background leading to faster encoding speed for large corpora.
  • Added functions and an example for mutli-GPU encoding - This method can be used to encode a corpus with multiple GPUs in parallel. No multi-GPU support for training yet.
  • Removed parallel_tokenization parameters from encode & SentencesDatasets - No longer needed with lazy tokenization and DataLoader worker threads.
  • Smaller bugfixes

Breaking changes:

  • Renamed evaluation.BinaryEmbeddingSimilarityEvaluator to evaluation.BinaryClassificationEvaluator

v0.3.3 - Multi-Process Tokenization and Information Retrieval Improvements

06 Aug 08:16
Compare
Choose a tag to compare

New Functions

  • Multi-process tokenization (Linux only) for the model encode function. Significant speed-up when encoding large sets
  • Tokenization of datasets for training can now run in parallel (Linux Only)
  • New example for Quora Duplicate Questions Retrieval: See examples-folder
  • Many small improvements for training better models for Information Retrieval
  • Fixed LabelSampler (can be used to get batches with certain number of matching labels. Used for BatchHardTripletLoss). Moved it to DatasetFolder
  • Added new Evaluators for ParaphraseMining and InformationRetrieval
  • evaluation.BinaryEmbeddingSimilarityEvaluator no longer assumes a 50-50 split of the dataset. It computes the optimal threshold and measure accuracy
  • model.encode - When the convert_to_numpy parameter is set, the method returns a numpy matrix instead of a list of numpy vectors
  • New function: util.paraphrase_mining to perform paraphrase mining in a corpus. For an example see examples/training_quora_duplicate_questions/
  • New function: util.information_retrieval to perform information retrieval / semantic search in a corpus. For an example see examples/training_quora_duplicate_questions/

Breaking Changes

  • The evaluators (like EmbeddingSimilarityEvaluator) no longer accept a DataLoader as argument. Instead, the sentence and scores are directly passed. Old code that uses the previous evaluators needs to be changed. They can use the class method from_input_examples(). See examples/training_transformers/training_nli.py how to use the new evaluators.

v0.3.2 - Lazy tokenization for Parallel Sentence Training & Improved Semantic Search

23 Jul 15:03
Compare
Choose a tag to compare

This is a minor release. There should be no breaking changes.

  • ParallelSentencesDataset: Datasets are tokenized on-the-fly, saving some start-up time
  • util.pytorch_cos_sim - Method. New method to compute cosine similarity with pytorch. About 100 times faster than scipy cdist. semantic_search.py example has been updated accordingly.
  • SentenceTransformer.encode: New parameter: convert_to_tensor. If set to true, encode returns one large pytorch tensor with your embeddings

v0.3.1 - Updates on Multilingual Training

22 Jul 13:54
Compare
Choose a tag to compare

This is a minor update that changes some classes for training & evaluating multilingual sentence embedding methods.

The examples for training multi-lingual sentence embeddings models have been significantly extended. See docs/training/multilingual-models.md for details. An automatic script that downloads suitable data and extends sentence embeddings to multiple languages has been added.

The following classes/files have been changed:

  • datasets/ParallelSentencesDataset.py: The dataset with parallel sentences is encoded on-the-fly, reducing the start-up time for extending a sentence embedding model to new languages. An embedding cache can be configure to store previously computed sentence embeddings during training.

New evaluation files:

  • evaluation/MSEEvaluator.py - breaking change. Now, this class expects lists of strings with parallel (translated) sentences. The old class has been renamed to MSEEvaluatorFromDataLoader.py
  • evaluation/EmbeddingSimilarityEvaluatorFromList.py - Semantic Textual Similarity data can be passed as lists of strings & scores
  • evaluation/MSEEvaluatorFromDataFrame.py - MSE Evaluation of teacher and student embeddings based on data in a data frame
  • evaluation/MSEEvaluatorFromDataLoader.py - MSE Evaluation if data is passed as a data loader

Bugfixes:

  • model.encode() failed to sort sentences by length. This function has been fixed to boost encoding speed by reducing overhead of padding tokens.