Releases: UKPLab/sentence-transformers
v0.4.0 - Upgrade Transformers Version
- Updated the dependencies so that it works with Huggingface Transformers version 4. Sentence-Transformers still works with huggingface transformers version 3, but an update to version 4 of transformers is recommended. Future changes might break with transformers version 3.
- New naming of pre-trained models. Models will be named: {task}-{transformer_model}. So 'bert-base-nli-stsb-mean-tokens' becomes 'stsb-bert-base'. Models will still be available under their old names, but newer models will follow the updated naming scheme.
- New application example for information retrieval and question answering retrieval. Together with respective pre-trained models
v0.3.9 - Small updates
This release only include some smaller updates:
- Code was tested with transformers 3.5.1, requirement was updated so that it works with transformers 3.5.1
- As some parts and models require Pytorch >= 1.6.0, requirement was updated to require at least pytorch 1.6.0. Most of the code and models will work with older pytorch versions.
- model.encode() stored the embeddings on the GPU, which required quite a lot of GPU memory when encoding millions of sentences. The embeddings are now moved to CPU once they are computed.
- The CrossEncoder-Class now accepts a max_length parameter to control the truncation of inputs
- The Cross-Encoder predict method has now a apply_softmax parameter, that allows to apply softmax on-top of a multi-class output.
v0.3.8 - CrossEncoder, Data Augmentation, new Models
- Add support training and using CrossEncoder
- Data Augmentation method AugSBERT added
- New model trained on large scale paraphrase data. Models works on internal benchmark much better than previous models: distilroberta-base-paraphrase-v1 and xlm-r-distilroberta-base-paraphrase-v1
- New model for Information Retrieval trained on MS Marco: distilroberta-base-msmarco-v1
- Improved MultipleNegativesRankingLoss loss function: Similarity function can be changed and is now cosine similarity (was dot-product before), further, similarity scores can be multiplied by a scaling factor. This allows the usage of NTXentLoss / InfoNCE loss.
- New MegaBatchMarginLoss, inspired from the paper ParaNMT-Paper.
Smaller changes:
- Update InformationRetrievalEvaluator, so that it can work with large corpora (Millions of entries). Removed the query_chunk_size parameter from the evaluator
- SentenceTransformer.encode method detaches tensors from compute graph
- SentenceTransformer.fit() method - Parameter output_path_ignore_not_empty deprecated. No longer checks that target folder must be empty
v0.3.7 - Upgrade transformers, Model Distillation Example, Multi-Input to Transformers Model
- Upgrade transformers dependency, transformers 3.1.0, 3.2.0 and 3.3.1 are working
- Added example code for model distillation: Sentence Embeddings models can be drastically reduced to e.g. only 2-4 layers while keeping 98+% of their performance. Code can be found in examples/training/distillation
- Transformer models can now accepts two inputs ['sentence 1', 'context for sent1'], which are encoded as the two inputs for BERT.
Minor changes:
- Tokenization in the multi-processes encoding setup now happens in the child processes, not in the parent process.
- Added models.Normalize() to allow the normalization of embeddings to unit length
v0.3.6 - Update transformers to v3.1.0
Hugginface Transformers version 3.1.0 had a breaking change with previous version 3.0.2
This release fixes the issue so that Sentence-Transformers is compatible with Huggingface Transformers 3.1.0. Note, that this and future version will not be compatible with transformers < 3.1.0.
v0.3.5 - Automatic Mixed Precision & Bugfixes
- The old FP16 training code in model.fit() was replaced by using Pytorch 1.6.0 automatic mixed precision (AMP). When setting
model.fit(use_amp=True)
, AMP will be used. On suitable GPUs, this leads to a significant speed-up while requiring less memory. - Performance improvements in paraphrase mining & semantic search by replacing np.argpartition with torch.topk
- If a sentence-transformer model is not found, it will fall back to huggingface transformers repository and create it with mean pooling.
- Fixing huggingface transformers to version 3.0.2. Next release will make it compatible with huggingface transformers 3.1.0
- Several bugfixes: Downloading of files, mutli-GPU-encoding
v0.3.4 - Improved Documentation, Improved Tokenization Speed, Mutli-GPU encoding
- The documentation is substantially improved and can be found at: www.SBERT.net - Feedback welcome
- The dataset to hold training InputExamples (dataset.SentencesDataset) now uses lazy tokenization, i.e., examples are tokenized once they are needed for a batch. If you set
num_workers
to a positive integer in yourDataLoader
, tokenization will happen in a background thread. This substantially increases the start-up time for training. model.encode()
uses also a PyTorch DataSet + DataLoader. If you setnum_workers
to a positive integer, tokenization will happen in the background leading to faster encoding speed for large corpora.- Added functions and an example for mutli-GPU encoding - This method can be used to encode a corpus with multiple GPUs in parallel. No multi-GPU support for training yet.
- Removed parallel_tokenization parameters from encode & SentencesDatasets - No longer needed with lazy tokenization and DataLoader worker threads.
- Smaller bugfixes
Breaking changes:
- Renamed evaluation.BinaryEmbeddingSimilarityEvaluator to evaluation.BinaryClassificationEvaluator
v0.3.3 - Multi-Process Tokenization and Information Retrieval Improvements
New Functions
- Multi-process tokenization (Linux only) for the model encode function. Significant speed-up when encoding large sets
- Tokenization of datasets for training can now run in parallel (Linux Only)
- New example for Quora Duplicate Questions Retrieval: See examples-folder
- Many small improvements for training better models for Information Retrieval
- Fixed LabelSampler (can be used to get batches with certain number of matching labels. Used for BatchHardTripletLoss). Moved it to DatasetFolder
- Added new Evaluators for ParaphraseMining and InformationRetrieval
- evaluation.BinaryEmbeddingSimilarityEvaluator no longer assumes a 50-50 split of the dataset. It computes the optimal threshold and measure accuracy
- model.encode - When the convert_to_numpy parameter is set, the method returns a numpy matrix instead of a list of numpy vectors
- New function: util.paraphrase_mining to perform paraphrase mining in a corpus. For an example see examples/training_quora_duplicate_questions/
- New function: util.information_retrieval to perform information retrieval / semantic search in a corpus. For an example see examples/training_quora_duplicate_questions/
Breaking Changes
- The evaluators (like EmbeddingSimilarityEvaluator) no longer accept a DataLoader as argument. Instead, the sentence and scores are directly passed. Old code that uses the previous evaluators needs to be changed. They can use the class method from_input_examples(). See examples/training_transformers/training_nli.py how to use the new evaluators.
v0.3.2 - Lazy tokenization for Parallel Sentence Training & Improved Semantic Search
This is a minor release. There should be no breaking changes.
- ParallelSentencesDataset: Datasets are tokenized on-the-fly, saving some start-up time
- util.pytorch_cos_sim - Method. New method to compute cosine similarity with pytorch. About 100 times faster than scipy cdist. semantic_search.py example has been updated accordingly.
- SentenceTransformer.encode: New parameter: convert_to_tensor. If set to true, encode returns one large pytorch tensor with your embeddings
v0.3.1 - Updates on Multilingual Training
This is a minor update that changes some classes for training & evaluating multilingual sentence embedding methods.
The examples for training multi-lingual sentence embeddings models have been significantly extended. See docs/training/multilingual-models.md for details. An automatic script that downloads suitable data and extends sentence embeddings to multiple languages has been added.
The following classes/files have been changed:
- datasets/ParallelSentencesDataset.py: The dataset with parallel sentences is encoded on-the-fly, reducing the start-up time for extending a sentence embedding model to new languages. An embedding cache can be configure to store previously computed sentence embeddings during training.
New evaluation files:
- evaluation/MSEEvaluator.py - breaking change. Now, this class expects lists of strings with parallel (translated) sentences. The old class has been renamed to MSEEvaluatorFromDataLoader.py
- evaluation/EmbeddingSimilarityEvaluatorFromList.py - Semantic Textual Similarity data can be passed as lists of strings & scores
- evaluation/MSEEvaluatorFromDataFrame.py - MSE Evaluation of teacher and student embeddings based on data in a data frame
- evaluation/MSEEvaluatorFromDataLoader.py - MSE Evaluation if data is passed as a data loader
Bugfixes:
- model.encode() failed to sort sentences by length. This function has been fixed to boost encoding speed by reducing overhead of padding tokens.