Skip to content

v0.3.2 - Lazy tokenization for Parallel Sentence Training & Improved Semantic Search

Compare
Choose a tag to compare
@nreimers nreimers released this 23 Jul 15:03
· 1424 commits to master since this release

This is a minor release. There should be no breaking changes.

  • ParallelSentencesDataset: Datasets are tokenized on-the-fly, saving some start-up time
  • util.pytorch_cos_sim - Method. New method to compute cosine similarity with pytorch. About 100 times faster than scipy cdist. semantic_search.py example has been updated accordingly.
  • SentenceTransformer.encode: New parameter: convert_to_tensor. If set to true, encode returns one large pytorch tensor with your embeddings