Skip to content

Commit

Permalink
Merge branch 'main' into update_docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Samoed authored Jan 26, 2025
2 parents 6be88c9 + 7e7571e commit d9a5d99
Show file tree
Hide file tree
Showing 49 changed files with 1,932 additions and 196 deletions.
13 changes: 10 additions & 3 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,18 @@
<!-- add additional description, question etc. related to the new dataset -->


## Checklist
### Code Quality
<!-- Please do not delete this -->
- [ ] **Code Formatted**: Format the code using `make lint` to maintain consistent style.

- [ ] Run tests locally to make sure nothing is broken using `make test`.
- [ ] Run the formatter to format the code using `make lint`.
### Documentation
<!-- Please do not delete this -->
- [ ] **Updated Documentation**: Add or update documentation to reflect the changes introduced in this PR.

### Testing
<!-- Please do not delete this -->
- [ ] **New Tests Added**: Write tests to cover new functionality. Validate with `make test-with-coverage`.
- [ ] **Tests Passed**: Run tests locally using `make test` or `make test-with-coverage` to ensure no existing functionality is broken.


### Adding datasets checklist
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -517,5 +517,6 @@ You may also want to read and cite the amazing work that has extended MTEB & int
- Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, Luca Soldaini. "[FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions](https://arxiv.org/abs/2403.15246)" arXiv 2024
- Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li. "[LongEmbed: Extending Embedding Models for Long Context Retrieval](https://arxiv.org/abs/2404.12096)" arXiv 2024
- Kenneth Enevoldsen, Márton Kardos, Niklas Muennighoff, Kristoffer Laigaard Nielbo. "[The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding](https://arxiv.org/abs/2406.02396)" arXiv 2024
- Ali Shiraee Kasmaee, Mohammad Khodadad, Mohammad Arshi Saloot, Nick Sherck, Stephen Dokas, Hamidreza Mahyar, Soheila Samiee. "[ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain](https://arxiv.org/abs/2412.00532)" arXiv 2024

For works that have used MTEB for benchmarking, you can find them on the [leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
17 changes: 17 additions & 0 deletions docs/adding_a_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ Internally, `mteb` uses `query` for encoding the queries and `passage` as the pr

You can directly add the prompts when saving and uploading your model to the Hub. For an example, refer to this [configuration file](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5/blob/3b5a16eaf17e47bd997da998988dce5877a57092/config_sentence_transformers.json). These prompts can then be specified in the ModelMeta object.


```python
model = ModelMeta(
loader=partial( # type: ignore
Expand All @@ -115,3 +116,19 @@ model = ModelMeta(
),
)
```
If you are unable to directly add the prompts in the model configuration, you can instantiate the model using the `sentence_transformers_loader` and pass `prompts` as an argument. For more details, see the `mteb/models/bge_models.py` file.

##### Adding instruction models

Models that use instructions can use the [`InstructSentenceTransformerWrapper`](../mteb/models/instruct_wrapper.py). For example:
```python
model = ModelMeta(
loader=partial(
InstructSentenceTransformerWrapper,
model="nvidia/NV-Embed-v1",
revision="7604d305b621f14095a1aa23d351674c2859553a",
instruction_template="Instruct: {instruction}\nQuery: ",
),
...
)
```
33 changes: 22 additions & 11 deletions docs/benchmarks.md

Large diffs are not rendered by default.

200 changes: 100 additions & 100 deletions docs/mmteb/points_table.md

Large diffs are not rendered by default.

55 changes: 41 additions & 14 deletions docs/tasks.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions mteb/abstasks/TaskMetadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
"Web",
"Written",
"Programming",
"Chemistry",
]

SAMPLE_CREATION_METHOD = Literal[
Expand Down
44 changes: 44 additions & 0 deletions mteb/benchmarks/benchmarks.py
Original file line number Diff line number Diff line change
Expand Up @@ -1232,3 +1232,47 @@ def load_results(
primaryClass={cs.CL}
}""",
)

CHEMTEB = Benchmark(
name="ChemTEB",
tasks=get_tasks(
tasks=[
"PubChemSMILESBitextMining",
"SDSEyeProtectionClassification",
"SDSGlovesClassification",
"WikipediaBioMetChemClassification",
"WikipediaGreenhouseEnantiopureClassification",
"WikipediaSolidStateColloidalClassification",
"WikipediaOrganicInorganicClassification",
"WikipediaCryobiologySeparationClassification",
"WikipediaChemistryTopicsClassification",
"WikipediaTheoreticalAppliedClassification",
"WikipediaChemFieldsClassification",
"WikipediaLuminescenceClassification",
"WikipediaIsotopesFissionClassification",
"WikipediaSaltsSemiconductorsClassification",
"WikipediaBiolumNeurochemClassification",
"WikipediaCrystallographyAnalyticalClassification",
"WikipediaCompChemSpectroscopyClassification",
"WikipediaChemEngSpecialtiesClassification",
"WikipediaChemistryTopicsClustering",
"WikipediaSpecialtiesInChemistryClustering",
"PubChemAISentenceParaphrasePC",
"PubChemSMILESPC",
"PubChemSynonymPC",
"PubChemWikiParagraphsPC",
"PubChemWikiPairClassification",
"ChemNQRetrieval",
"ChemHotpotQARetrieval",
],
),
description="ChemTEB evaluates the performance of text embedding models on chemical domain data.",
reference="https://arxiv.org/abs/2412.00532",
citation="""
@article{kasmaee2024chemteb,
title={ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance \& Efficiency on a Specific Domain},
author={Kasmaee, Ali Shiraee and Khodadad, Mohammad and Saloot, Mohammad Arshi and Sherck, Nick and Dokas, Stephen and Mahyar, Hamidreza and Samiee, Soheila},
journal={arXiv preprint arXiv:2412.00532},
year={2024}
}""",
)
264 changes: 264 additions & 0 deletions mteb/models/bedrock_models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
from __future__ import annotations

import json
import logging
import re
from functools import partial
from typing import Any

import numpy as np
import tqdm

from mteb.encoder_interface import PromptType
from mteb.model_meta import ModelMeta
from mteb.models.cohere_models import model_prompts as cohere_model_prompts
from mteb.models.cohere_models import supported_languages as cohere_supported_languages
from mteb.requires_package import requires_package

from .wrapper import Wrapper

logger = logging.getLogger(__name__)


class BedrockWrapper(Wrapper):
def __init__(
self,
model_id: str,
provider: str,
max_tokens: int,
model_prompts: dict[str, str] | None = None,
**kwargs,
) -> None:
requires_package(self, "boto3", "The AWS SDK for Python")
import boto3

boto3_session = boto3.session.Session()
region_name = boto3_session.region_name
self._client = boto3.client("bedrock-runtime", region_name)

self._model_id = model_id
self._provider = provider.lower()

if self._provider == "cohere":
self.model_prompts = (
self.validate_task_to_prompt_name(model_prompts)
if model_prompts
else None
)
self._max_batch_size = 96
self._max_sequence_length = max_tokens * 4
else:
self._max_tokens = max_tokens

def encode(
self,
sentences: list[str],
*,
task_name: str | None = None,
prompt_type: PromptType | None = None,
**kwargs: Any,
) -> np.ndarray:
requires_package(self, "boto3", "Amazon Bedrock")
show_progress_bar = (
False
if "show_progress_bar" not in kwargs
else kwargs.pop("show_progress_bar")
)
if self._provider == "amazon":
return self._encode_amazon(sentences, show_progress_bar)
elif self._provider == "cohere":
prompt_name = self.get_prompt_name(
self.model_prompts, task_name, prompt_type
)
cohere_task_type = self.model_prompts.get(prompt_name, "search_document")
return self._encode_cohere(sentences, cohere_task_type, show_progress_bar)
else:
raise ValueError(
f"Unknown provider '{self._provider}'. Must be 'amazon' or 'cohere'."
)

def _encode_amazon(
self, sentences: list[str], show_progress_bar: bool = False
) -> np.ndarray:
from botocore.exceptions import ValidationError

all_embeddings = []
# https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html
max_sequence_length = int(self._max_tokens * 4.5)

for sentence in tqdm.tqdm(
sentences, leave=False, disable=not show_progress_bar
):
if len(sentence) > max_sequence_length:
truncated_sentence = sentence[:max_sequence_length]
else:
truncated_sentence = sentence

try:
embedding = self._embed_amazon(truncated_sentence)
all_embeddings.append(embedding)

except ValidationError as e:
error_str = str(e)
pattern = r"request input token count:\s*(\d+)"
match = re.search(pattern, error_str)
if match:
num_tokens = int(match.group(1))

ratio = 0.9 * (self._max_tokens / num_tokens)
dynamic_cutoff = int(len(truncated_sentence) * ratio)

embedding = self._embed_amazon(truncated_sentence[:dynamic_cutoff])
all_embeddings.append(embedding)
else:
raise e

return np.array(all_embeddings)

def _encode_cohere(
self,
sentences: list[str],
cohere_task_type: str,
show_progress_bar: bool = False,
) -> np.ndarray:
batches = [
sentences[i : i + self._max_batch_size]
for i in range(0, len(sentences), self._max_batch_size)
]

all_embeddings = []

for batch in tqdm.tqdm(batches, leave=False, disable=not show_progress_bar):
response = self._client.invoke_model(
body=json.dumps(
{
"texts": [sent[: self._max_sequence_length] for sent in batch],
"input_type": cohere_task_type,
}
),
modelId=self._model_id,
accept="*/*",
contentType="application/json",
)
all_embeddings.extend(self._to_numpy(response))

return np.array(all_embeddings)

def _embed_amazon(self, sentence: str) -> np.ndarray:
response = self._client.invoke_model(
body=json.dumps({"inputText": sentence}),
modelId=self._model_id,
accept="application/json",
contentType="application/json",
)
return self._to_numpy(response)

def _to_numpy(self, embedding_response) -> np.ndarray:
response = json.loads(embedding_response.get("body").read())
key = "embedding" if self._provider == "amazon" else "embeddings"
return np.array(response[key])


amazon_titan_embed_text_v1 = ModelMeta(
name="bedrock/amazon-titan-embed-text-v1",
revision="1",
release_date="2023-09-27",
languages=None, # not specified
loader=partial(
BedrockWrapper,
model_id="amazon.titan-embed-text-v1",
provider="amazon",
max_tokens=8192,
),
max_tokens=8192,
embed_dim=1536,
open_weights=False,
n_parameters=None,
public_training_code=None,
public_training_data=None, # assumed
training_datasets=None,
license=None,
reference="https://aws.amazon.com/about-aws/whats-new/2023/09/amazon-titan-embeddings-generally-available/",
similarity_fn_name="cosine",
framework=["API"],
use_instructions=False,
)

amazon_titan_embed_text_v2 = ModelMeta(
name="bedrock/amazon-titan-embed-text-v2",
revision="1",
release_date="2024-04-30",
languages=None, # not specified
loader=partial(
BedrockWrapper,
model_id="amazon.titan-embed-text-v2:0",
provider="amazon",
max_tokens=8192,
),
max_tokens=8192,
embed_dim=1024,
open_weights=False,
n_parameters=None,
public_training_code=None,
public_training_data=None, # assumed
training_datasets=None,
license=None,
reference="https://aws.amazon.com/about-aws/whats-new/2024/04/amazon-titan-text-embeddings-v2-amazon-bedrock/",
similarity_fn_name="cosine",
framework=["API"],
use_instructions=False,
)
# Note: For the original Cohere API implementation, refer to:
# https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/cohere_models.py
# This implementation uses the Amazon Bedrock endpoint for Cohere models.
cohere_embed_english_v3 = ModelMeta(
loader=partial(
BedrockWrapper,
model_id="cohere.embed-english-v3",
provider="cohere",
max_tokens=512,
model_prompts=cohere_model_prompts,
),
name="bedrock/cohere-embed-english-v3",
languages=["eng-Latn"],
open_weights=False,
reference="https://cohere.com/blog/introducing-embed-v3",
revision="1",
release_date="2023-11-02",
n_parameters=None,
public_training_code=None,
public_training_data=None, # assumed
training_datasets=None,
max_tokens=512,
embed_dim=1024,
license=None,
similarity_fn_name="cosine",
framework=["API"],
use_instructions=True,
)

cohere_embed_multilingual_v3 = ModelMeta(
loader=partial(
BedrockWrapper,
model_id="cohere.embed-multilingual-v3",
provider="cohere",
max_tokens=512,
model_prompts=cohere_model_prompts,
),
name="bedrock/cohere-embed-multilingual-v3",
languages=cohere_supported_languages,
open_weights=False,
reference="https://cohere.com/blog/introducing-embed-v3",
revision="1",
release_date="2023-11-02",
n_parameters=None,
public_training_code=None,
public_training_data=None, # assumed
training_datasets=None,
max_tokens=512,
embed_dim=1024,
license=None,
similarity_fn_name="cosine",
framework=["API"],
use_instructions=True,
)
1 change: 0 additions & 1 deletion mteb/models/gme_models.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
from __future__ import annotations

import logging
from functools import partial

from mteb.model_meta import ModelMeta

Expand Down
Loading

0 comments on commit d9a5d99

Please sign in to comment.