Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in MTEB Benchmark Results for jinaai/jina-embeddings-v2-small-en #1862

Closed
noworneverev opened this issue Jan 23, 2025 · 4 comments
Labels
replication question and issues related to replication

Comments

@noworneverev
Copy link

noworneverev commented Jan 23, 2025

I tried reproducing the results for the jinaai/jina-embeddings-v2-small-en on the MTEB HotpotQA task, but my results are significantly lower than the self-reported results shared on HuggingFace. Specifically, my ndcg_at_10 score is 1.966, while the result reported on HuggingFace for the same task and metric is 56.482.

Here’s the code I used:

import mteb

model_names = [  
    "jinaai/jina-embeddings-v2-small-en", 
]

tasks = [
    mteb.get_task("HotpotQA", languages = ["eng"]),
]

for model_name in model_names:
    model = mteb.get_model(model_name) # if the model is not implemented in MTEB it will be eq. to SentenceTransformer(model_name)    
    evaluation = mteb.MTEB(tasks=tasks)
    results = evaluation.run(model, output_folder=f"./results", batch_size=2)

Below is my result for HotpotQA:

{
  "dataset_revision": "ab518f4d6fcca38d87c25209f94beba119d02014",
  "task_name": "HotpotQA",
  "mteb_version": "1.28.6",
  "scores": {
    "test": [
      {
        "ndcg_at_1": 0.02026,
        "ndcg_at_3": 0.01637,
        "ndcg_at_5": 0.01779,
        "ndcg_at_10": 0.01966,
        "ndcg_at_20": 0.02163,
        "ndcg_at_100": 0.02613,
        "ndcg_at_1000": 0.03391,
        "map_at_1": 0.01013,
        "map_at_3": 0.01259,
        "map_at_5": 0.01327,
        "map_at_10": 0.01392,
        "map_at_20": 0.01436,
        "map_at_100": 0.01487,
        "map_at_1000": 0.01507,
        "recall_at_1": 0.01013,
        "recall_at_3": 0.01573,
        "recall_at_5": 0.01857,
        "recall_at_10": 0.0233,
        "recall_at_20": 0.02971,
        "recall_at_100": 0.04997,
        "recall_at_1000": 0.10324,
        "precision_at_1": 0.02026,
        "precision_at_3": 0.01049,
        "precision_at_5": 0.00743,
        "precision_at_10": 0.00466,
        "precision_at_20": 0.00297,
        "precision_at_100": 0.001,
        "precision_at_1000": 0.00021,
        "mrr_at_1": 0.020257,
        "mrr_at_3": 0.024916,
        "mrr_at_5": 0.026144,
        "mrr_at_10": 0.027337,
        "mrr_at_20": 0.028173,
        "mrr_at_100": 0.029057,
        "mrr_at_1000": 0.029395,
        "nauc_ndcg_at_1_max": 0.394906,
        "nauc_ndcg_at_1_std": -0.018751,
        "nauc_ndcg_at_1_diff1": 0.463571,
        "nauc_ndcg_at_3_max": 0.311009,
        "nauc_ndcg_at_3_std": 0.015145,
        "nauc_ndcg_at_3_diff1": 0.364772,
        "nauc_ndcg_at_5_max": 0.28746,
        "nauc_ndcg_at_5_std": 0.032298,
        "nauc_ndcg_at_5_diff1": 0.32874,
        "nauc_ndcg_at_10_max": 0.275356,
        "nauc_ndcg_at_10_std": 0.042002,
        "nauc_ndcg_at_10_diff1": 0.304791,
        "nauc_ndcg_at_20_max": 0.267466,
        "nauc_ndcg_at_20_std": 0.046603,
        "nauc_ndcg_at_20_diff1": 0.278059,
        "nauc_ndcg_at_100_max": 0.239217,
        "nauc_ndcg_at_100_std": 0.053876,
        "nauc_ndcg_at_100_diff1": 0.234819,
        "nauc_ndcg_at_1000_max": 0.219262,
        "nauc_ndcg_at_1000_std": 0.062277,
        "nauc_ndcg_at_1000_diff1": 0.200398,
        "nauc_map_at_1_max": 0.394906,
        "nauc_map_at_1_std": -0.018751,
        "nauc_map_at_1_diff1": 0.463571,
        "nauc_map_at_3_max": 0.328157,
        "nauc_map_at_3_std": 0.008326,
        "nauc_map_at_3_diff1": 0.387754,
        "nauc_map_at_5_max": 0.311805,
        "nauc_map_at_5_std": 0.020036,
        "nauc_map_at_5_diff1": 0.365008,
        "nauc_map_at_10_max": 0.30468,
        "nauc_map_at_10_std": 0.026132,
        "nauc_map_at_10_diff1": 0.351843,
        "nauc_map_at_20_max": 0.301578,
        "nauc_map_at_20_std": 0.028036,
        "nauc_map_at_20_diff1": 0.341274,
        "nauc_map_at_100_max": 0.295421,
        "nauc_map_at_100_std": 0.029309,
        "nauc_map_at_100_diff1": 0.330846,
        "nauc_map_at_1000_max": 0.293883,
        "nauc_map_at_1000_std": 0.030278,
        "nauc_map_at_1000_diff1": 0.328026,
        "nauc_recall_at_1_max": 0.394906,
        "nauc_recall_at_1_std": -0.018751,
        "nauc_recall_at_1_diff1": 0.463571,
        "nauc_recall_at_3_max": 0.267543,
        "nauc_recall_at_3_std": 0.032556,
        "nauc_recall_at_3_diff1": 0.313037,
        "nauc_recall_at_5_max": 0.229513,
        "nauc_recall_at_5_std": 0.062325,
        "nauc_recall_at_5_diff1": 0.251517,
        "nauc_recall_at_10_max": 0.215012,
        "nauc_recall_at_10_std": 0.076261,
        "nauc_recall_at_10_diff1": 0.214116,
        "nauc_recall_at_20_max": 0.209414,
        "nauc_recall_at_20_std": 0.080019,
        "nauc_recall_at_20_diff1": 0.171319,
        "nauc_recall_at_100_max": 0.165092,
        "nauc_recall_at_100_std": 0.086918,
        "nauc_recall_at_100_diff1": 0.112164,
        "nauc_recall_at_1000_max": 0.146849,
        "nauc_recall_at_1000_std": 0.092047,
        "nauc_recall_at_1000_diff1": 0.080955,
        "nauc_precision_at_1_max": 0.394906,
        "nauc_precision_at_1_std": -0.018751,
        "nauc_precision_at_1_diff1": 0.463571,
        "nauc_precision_at_3_max": 0.267543,
        "nauc_precision_at_3_std": 0.032556,
        "nauc_precision_at_3_diff1": 0.313037,
        "nauc_precision_at_5_max": 0.229513,
        "nauc_precision_at_5_std": 0.062325,
        "nauc_precision_at_5_diff1": 0.251517,
        "nauc_precision_at_10_max": 0.215012,
        "nauc_precision_at_10_std": 0.076261,
        "nauc_precision_at_10_diff1": 0.214116,
        "nauc_precision_at_20_max": 0.209414,
        "nauc_precision_at_20_std": 0.080019,
        "nauc_precision_at_20_diff1": 0.171319,
        "nauc_precision_at_100_max": 0.165092,
        "nauc_precision_at_100_std": 0.086918,
        "nauc_precision_at_100_diff1": 0.112164,
        "nauc_precision_at_1000_max": 0.146849,
        "nauc_precision_at_1000_std": 0.092047,
        "nauc_precision_at_1000_diff1": 0.080955,
        "nauc_mrr_at_1_max": 0.394906,
        "nauc_mrr_at_1_std": -0.018751,
        "nauc_mrr_at_1_diff1": 0.463571,
        "nauc_mrr_at_3_max": 0.331074,
        "nauc_mrr_at_3_std": 0.007043,
        "nauc_mrr_at_3_diff1": 0.386003,
        "nauc_mrr_at_5_max": 0.316823,
        "nauc_mrr_at_5_std": 0.017103,
        "nauc_mrr_at_5_diff1": 0.363237,
        "nauc_mrr_at_10_max": 0.310292,
        "nauc_mrr_at_10_std": 0.021951,
        "nauc_mrr_at_10_diff1": 0.351562,
        "nauc_mrr_at_20_max": 0.306103,
        "nauc_mrr_at_20_std": 0.024121,
        "nauc_mrr_at_20_diff1": 0.340879,
        "nauc_mrr_at_100_max": 0.300178,
        "nauc_mrr_at_100_std": 0.02535,
        "nauc_mrr_at_100_diff1": 0.331282,
        "nauc_mrr_at_1000_max": 0.298741,
        "nauc_mrr_at_1000_std": 0.025996,
        "nauc_mrr_at_1000_diff1": 0.328679,
        "main_score": 0.01966,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 1736.7203087806702,
  "kg_co2_emissions": null
}

Are there additional configurations or preprocessing steps needed to achieve the reported performance?

@Samoed Samoed added the replication question and issues related to replication label Jan 23, 2025
@Samoed
Copy link
Collaborator

Samoed commented Jan 23, 2025

CC @bwanglzu

@bwanglzu
Copy link
Contributor

Hi, most likely you didn't do the trust_remote_code thing, please do replace mteb.get_model with

mode = SentenceTranformer(name, trust_remote_code=True)

@Samoed
Copy link
Collaborator

Samoed commented Jan 25, 2025

Strange that model loading correctly without trust_remote_code=True

@Samoed Samoed mentioned this issue Jan 25, 2025
2 tasks
@Samoed Samoed closed this as completed Jan 26, 2025
@bwanglzu
Copy link
Contributor

I think it's how transformers work, it will allow the model to load but the weights are randomly initialized. A warning message Wii show up in terminal but won't disturb the script

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
replication question and issues related to replication
Projects
None yet
Development

No branches or pull requests

3 participants