Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MTEB retrieval results for spice #100

Closed
wants to merge 3 commits into from

Conversation

iamgroot42
Copy link

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the results files checker make pre-push.

Adding a model checklist

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @iamgroot42; formatting is fine. However, some of the scores seem high. I would love some deliberation on these

"ndcg_at_1": 0.72,
"ndcg_at_3": 0.79577,
"ndcg_at_5": 0.82586,
"ndcg_at_10": 0.84129,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This score seems quite high. I could imagine that you might have trained scifact is that true?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the retriever is trained, it is likely to perform well on some kind of data more than the other. The values are not very high (as you can see on the leaderboard) and the model's overall rank is not very good either (although we will be releasing a larger version of the same model that should perform better).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is among models fine-tuned on the training set, which will be removed in v2 of leaderboard (it is possible to disable the zero-shot filter)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned we cannot reveal the exact data distribution used for training, but can confirmed that neither of the test or evaluation sets (or any data that the leaderboard uses) were used in training the model.

"ndcg_at_1": 0.465,
"ndcg_at_3": 0.38659,
"ndcg_at_5": 0.34364,
"ndcg_at_10": 0.40687,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar here (I have not reviewed the rest)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned above, there will be quite some variance in performance depending on the kind of data, and performance on some datasets is better than others.

@x-tabdeveloping
Copy link
Contributor

x-tabdeveloping commented Jan 29, 2025

We have discussed this case with the team. Based on your previous submission of identical scores to voyage-3-exp for your own model, we will not accept your submission, unless you provide a detailed description of the training procedure and data.

@iamgroot42
Copy link
Author

. Based on your previous submission of identical scores to voyage-3-exp for your own model

@x-tabdeveloping Is there some mixup here? This is my first ever submission to the MTEB leaderboard. The model is available on HuggingFace - if you think the scores are fabricated, feel free to evaluate it for yourself.

@KennethEnevoldsen can you please help explain what just happened? I am very confused

@x-tabdeveloping
Copy link
Contributor

Your model (iamgroot42/rover_nexus) had identical scores to voyage-3-exp and was in first place on the MTEB leaderboard and until it was removed approximately one and a half weeks ago.

Are we missing something here?

@iamgroot42
Copy link
Author

Please look at the pull request again- this is for iamgroot42/spice. The iamgroot42/rover_nexus model was just me testing how the whole model uploading and mteb evaluations work, since this was my first time uploading models to HuggingFace and running evaluations. It is not intended to be for the leaderboard and I never submitted any entry for it to the leaderboard.

I understood that entries on the leaderboard only appear when someone submits a PR to add them, which I never did for iamgroot42/rover_nexus. If you decided to add it of your own volition, respectfully, that is not on me. It was one of the BGE models as is, so of course, it was going to have similar scores.

I still don't get the connection to voyage-3-exp, though.

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Jan 29, 2025

Let us continue the discussion here instead of in new issues.

Sorry for the confusion, I will try to clarify here. This submission has indeed raised concerns due to the relatively small model size and impressive retrieval results. This in itself is not a problem; it simply means that the results warrant some examination as there could be an issue (intentional or not).

Since the model might be trained on the training set of some tasks, this could account for some of these scores. Of course, novel training methods could also account for these (and such methods are not required to be public).

I understood that entries on the leaderboard only appear when someone submits a PR to add them, which I never did for iamgroot42/rover_nexus

It is correct that no PR was submitted. The leaderboard auto-fetches if you include MTEB results in your model card's frontmatter (the new leaderboard does not do this). I understand this may not have been known, and we may have been overly cautious.

Re. the scores of the previous model: It is naturally hard to check now since the model is no longer up, but I can find a message in my inbox noting that the scores of the model [iamgroot42/rover_nexus] are identical to voyage-3 while being only 33M params. This might be a mistake, but sadly, I can't determine if it is. A solution is to make the model public again, and then I can check it.

Re. the current model under review:
After checking the model, It seems like the model might be adapted from an existing model (BAAI_bge-small-en); if so, could you fill out the key: adapted_from="{source-model}".

Because of the debate, I will rerun some of the reported results and temporarily remove iamgroot42/spice from the current MTEB leaderboard until this discussion is resolved. I hope that is understandable and will calm some concerns.

I will also re-open this until the discussion is resolved.

@iamgroot42
Copy link
Author

iamgroot42 commented Jan 29, 2025

@KennethEnevoldsen thank you for the clear explamnation.

I made the model private after @x-tabdeveloping's accusatory messages (without any clear explanations of what was going on). For that project I cloned an existing model (I think some BGE model) and tried putting in some dummy values while uploading it. I was under the impression that these do not show up on the leaderboard unless a PR is made, which I can now see is not the case (which then makes me wonder what the point of submitting PRs is if results are auto-fetched, but that is an orthogonal discussion). I can make it public again if you want but as I just mentioned, these values are arbitraty and were never meant to show up on MTEB - I only keep it private so that it does not pollute the leaderboard at least.

You are right about the base model architecture being based on the one in BAAI_bge-small-en - I have added the key.

@KennethEnevoldsen
Copy link
Contributor

I can make it public again if you want but as I just mentioned, these values are arbitraty and were never meant to show up on MTEB

Please do, then I'll look at it (understanding that it was not intended as a submission).

@iamgroot42
Copy link
Author

@KennethEnevoldsen it's public now

@KennethEnevoldsen
Copy link
Contributor

Thanks @iamgroot42. I have checked the scores and they do indeed seem to align with the voyage-3-exp model as mentioned by @x-tabdeveloping.

I hope you understand that this is quite frustrating as you explicitly state that you have no idea about the connection with the model.

I also ran the code using the spice branch using the following:

import mteb
model = mteb.get_model("iamgroot42/spice")

task = mteb.get_task("SCIDOCS")

eval = mteb.MTEB(tasks = [task])
r = eval.run(model)

I am also unable to reproduce the stated results, but instead get the following:

{
  "dataset_revision": "f8c2fcf00f625baaa80f62ec5bd9e1fff3b8ae88",
  "task_name": "SCIDOCS",
  "mteb_version": "1.29.10",  # version seems odd - will just resolve this and double check
  "scores": {
    "test": [
      {
        "ndcg_at_1": 0.536,
        "ndcg_at_3": 0.45036,
        "ndcg_at_5": 0.39081,
        "ndcg_at_10": 0.45652,
        ...
        }
],
...
}

Which is oddly enough better than the stated results. The version here though suggested that I might have messed up the installation (it does not correspond to the version on the branch). So I resolved that and reran it. This reproduced my own scores once again:

{
  "dataset_revision": "f8c2fcf00f625baaa80f62ec5bd9e1fff3b8ae88",
  "task_name": "SCIDOCS",
  "mteb_version": "1.31.5",
  "scores": {
    "test": [
      {
        "ndcg_at_1": 0.536,
        "ndcg_at_3": 0.44936,
        "ndcg_at_5": 0.3901,
        "ndcg_at_10": 0.45568,
        "ndcg_at_20": 0.49508,
        "ndcg_at_100": 0.5485,
        "ndcg_at_1000": 0.58523,
        "map_at_1": 0.10975,
...

@iamgroot42
Copy link
Author

Thanks for following up @KennethEnevoldsen . I understand your frustration, but all I can do is assure you that I haven't even heard of the voyager family of models until it was mentioned here. The fact that that model (which was just for testing, as I clarified earlier) wasn't even the one for which I submitted the PR was even more confusing to me.

I'm not sure where the discrepancy is coming from, but I am glad that at least the results are better than what I have on the leaderboard! I am using transformers 4.44.2 (installed directly via the transformers git repository, not pip), pytorch 2.5.1, and python 3.12.8

import mteb


TASK_LIST_RETRIEVAL = [
    "ArguAna",
    "ClimateFEVER",
    "CQADupstackAndroidRetrieval",
    "CQADupstackEnglishRetrieval",
    "CQADupstackGamingRetrieval",
    "CQADupstackGisRetrieval",
    "CQADupstackMathematicaRetrieval",
    "CQADupstackPhysicsRetrieval",
    "CQADupstackProgrammersRetrieval",
    "CQADupstackStatsRetrieval",
    "CQADupstackTexRetrieval",
    "CQADupstackUnixRetrieval",
    "CQADupstackWebmastersRetrieval",
    "CQADupstackWordpressRetrieval",
    "DBPedia",
    "FEVER",
    "FiQA2018",
    "HotpotQA"
    "NFCorpus",
    "NQ",
    "QuoraRetrieval",
    "SCIDOCS",
    "SciFact",
    "Touche2020",
    "TRECCOVID",
    "MSMARCO"
]


model = mteb.get_model("iamgroot42/spice")

evaluation = mteb.MTEB(tasks=TASK_LIST_RETRIEVAL, task_langs=["en"])
evaluation.run(model, output_folder="results",
               encode_kwargs={"batch_size": 128})

@iamgroot42
Copy link
Author

That being said, I understand if you want to close the PR for now if you are unable to reproduce the results :)

@iamgroot42 iamgroot42 closed this Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants