-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MTEB retrieval results for spice #100
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @iamgroot42; formatting is fine. However, some of the scores seem high. I would love some deliberation on these
"ndcg_at_1": 0.72, | ||
"ndcg_at_3": 0.79577, | ||
"ndcg_at_5": 0.82586, | ||
"ndcg_at_10": 0.84129, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This score seems quite high. I could imagine that you might have trained scifact is that true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way the retriever is trained, it is likely to perform well on some kind of data more than the other. The values are not very high (as you can see on the leaderboard) and the model's overall rank is not very good either (although we will be releasing a larger version of the same model that should perform better).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is among models fine-tuned on the training set, which will be removed in v2 of leaderboard (it is possible to disable the zero-shot filter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned we cannot reveal the exact data distribution used for training, but can confirmed that neither of the test or evaluation sets (or any data that the leaderboard uses) were used in training the model.
"ndcg_at_1": 0.465, | ||
"ndcg_at_3": 0.38659, | ||
"ndcg_at_5": 0.34364, | ||
"ndcg_at_10": 0.40687, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar here (I have not reviewed the rest)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned above, there will be quite some variance in performance depending on the kind of data, and performance on some datasets is better than others.
We have discussed this case with the team. Based on your previous submission of identical scores to voyage-3-exp for your own model, we will not accept your submission, unless you provide a detailed description of the training procedure and data. |
@x-tabdeveloping Is there some mixup here? This is my first ever submission to the MTEB leaderboard. The model is available on HuggingFace - if you think the scores are fabricated, feel free to evaluate it for yourself. @KennethEnevoldsen can you please help explain what just happened? I am very confused |
Your model (iamgroot42/rover_nexus) had identical scores to voyage-3-exp and was in first place on the MTEB leaderboard and until it was removed approximately one and a half weeks ago. Are we missing something here? |
Please look at the pull request again- this is for I understood that entries on the leaderboard only appear when someone submits a PR to add them, which I never did for I still don't get the connection to |
Let us continue the discussion here instead of in new issues. Sorry for the confusion, I will try to clarify here. This submission has indeed raised concerns due to the relatively small model size and impressive retrieval results. This in itself is not a problem; it simply means that the results warrant some examination as there could be an issue (intentional or not). Since the model might be trained on the training set of some tasks, this could account for some of these scores. Of course, novel training methods could also account for these (and such methods are not required to be public).
It is correct that no PR was submitted. The leaderboard auto-fetches if you include MTEB results in your model card's frontmatter (the new leaderboard does not do this). I understand this may not have been known, and we may have been overly cautious. Re. the scores of the previous model: It is naturally hard to check now since the model is no longer up, but I can find a message in my inbox noting that the scores of the model [iamgroot42/rover_nexus] are identical to voyage-3 while being only 33M params. This might be a mistake, but sadly, I can't determine if it is. A solution is to make the model public again, and then I can check it. Re. the current model under review: Because of the debate, I will rerun some of the reported results and temporarily remove I will also re-open this until the discussion is resolved. |
@KennethEnevoldsen thank you for the clear explamnation. I made the model private after @x-tabdeveloping's accusatory messages (without any clear explanations of what was going on). For that project I cloned an existing model (I think some BGE model) and tried putting in some dummy values while uploading it. I was under the impression that these do not show up on the leaderboard unless a PR is made, which I can now see is not the case (which then makes me wonder what the point of submitting PRs is if results are auto-fetched, but that is an orthogonal discussion). I can make it public again if you want but as I just mentioned, these values are arbitraty and were never meant to show up on MTEB - I only keep it private so that it does not pollute the leaderboard at least. You are right about the base model architecture being based on the one in |
Please do, then I'll look at it (understanding that it was not intended as a submission). |
@KennethEnevoldsen it's public now |
Thanks @iamgroot42. I have checked the scores and they do indeed seem to align with the voyage-3-exp model as mentioned by @x-tabdeveloping. I hope you understand that this is quite frustrating as you explicitly state that you have no idea about the connection with the model. I also ran the code using the import mteb
model = mteb.get_model("iamgroot42/spice")
task = mteb.get_task("SCIDOCS")
eval = mteb.MTEB(tasks = [task])
r = eval.run(model) I am also unable to reproduce the stated results, but instead get the following: {
"dataset_revision": "f8c2fcf00f625baaa80f62ec5bd9e1fff3b8ae88",
"task_name": "SCIDOCS",
"mteb_version": "1.29.10", # version seems odd - will just resolve this and double check
"scores": {
"test": [
{
"ndcg_at_1": 0.536,
"ndcg_at_3": 0.45036,
"ndcg_at_5": 0.39081,
"ndcg_at_10": 0.45652,
...
}
],
...
} Which is oddly enough better than the stated results. The version here though suggested that I might have messed up the installation (it does not correspond to the version on the branch). So I resolved that and reran it. This reproduced my own scores once again: {
"dataset_revision": "f8c2fcf00f625baaa80f62ec5bd9e1fff3b8ae88",
"task_name": "SCIDOCS",
"mteb_version": "1.31.5",
"scores": {
"test": [
{
"ndcg_at_1": 0.536,
"ndcg_at_3": 0.44936,
"ndcg_at_5": 0.3901,
"ndcg_at_10": 0.45568,
"ndcg_at_20": 0.49508,
"ndcg_at_100": 0.5485,
"ndcg_at_1000": 0.58523,
"map_at_1": 0.10975,
... |
Thanks for following up @KennethEnevoldsen . I understand your frustration, but all I can do is assure you that I haven't even heard of the voyager family of models until it was mentioned here. The fact that that model (which was just for testing, as I clarified earlier) wasn't even the one for which I submitted the PR was even more confusing to me. I'm not sure where the discrepancy is coming from, but I am glad that at least the results are better than what I have on the leaderboard! I am using transformers 4.44.2 (installed directly via the transformers git repository, not pip), pytorch 2.5.1, and python 3.12.8 import mteb
TASK_LIST_RETRIEVAL = [
"ArguAna",
"ClimateFEVER",
"CQADupstackAndroidRetrieval",
"CQADupstackEnglishRetrieval",
"CQADupstackGamingRetrieval",
"CQADupstackGisRetrieval",
"CQADupstackMathematicaRetrieval",
"CQADupstackPhysicsRetrieval",
"CQADupstackProgrammersRetrieval",
"CQADupstackStatsRetrieval",
"CQADupstackTexRetrieval",
"CQADupstackUnixRetrieval",
"CQADupstackWebmastersRetrieval",
"CQADupstackWordpressRetrieval",
"DBPedia",
"FEVER",
"FiQA2018",
"HotpotQA"
"NFCorpus",
"NQ",
"QuoraRetrieval",
"SCIDOCS",
"SciFact",
"Touche2020",
"TRECCOVID",
"MSMARCO"
]
model = mteb.get_model("iamgroot42/spice")
evaluation = mteb.MTEB(tasks=TASK_LIST_RETRIEVAL, task_langs=["en"])
evaluation.run(model, output_folder="results",
encode_kwargs={"batch_size": 128}) |
That being said, I understand if you want to close the PR for now if you are unable to reproduce the results :) |
Checklist
make test
.make pre-push
.Adding a model checklist
mteb/models/
directory. Instruction to add a model can be found here in the following PR Add model spice mteb#1884