Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepencies in CoIR results #1861

Open
bclavie opened this issue Jan 23, 2025 · 5 comments
Open

Discrepencies in CoIR results #1861

bclavie opened this issue Jan 23, 2025 · 5 comments
Labels
replication question and issues related to replication

Comments

@bclavie
Copy link

bclavie commented Jan 23, 2025

Hi there!

When reviewing the new gte-modernbert-base model, I noticed I struggled to reproduce their CoIR results with the coir library. After a bit of digging and a pointer from the authors, it appears that the mteb library matches their results, but that those are wildly different from what coir reports!

Recently, there's also been some discussions about code retrieval mismatched results about the new SFR model vs Voyager (here), and while I haven't yet had time to test it out, the magnitude of the discrepancies appear to be fairly similar to what I'm seeing, so this could be the issue.

Even more puzzling, in trying to figure out which one was correct, I whipped up an extremely simple ST + ranx notebook and it gave me results that... matched neither library 😭 although it was way closer to mteb than to coir. This was put together very quickly late at night, so there might be one silly mistake somewhere in there causing the issues.

I've put together a repository to reproduce the exact issue with minimal scripts, using exactly the code I ran.

Direct links:

Let me know if I can do anything else to help diagnose this!

cc @tomaarsen @orionw @Muennighoff

@bclavie
Copy link
Author

bclavie commented Jan 23, 2025

Sister issue in coir: CoIR-team/coir#14

@Samoed
Copy link
Collaborator

Samoed commented Jan 23, 2025

This is probably because MTEB adds the title to the task, whereas in the authors' version, they don’t utilize title when evaluating results. However, since cosqa doesn’t have a title, this might not be the case. #1130 (comment)

I also looked into your implementation of COIR, and you're using YourCustomDEModel from the library, but it adds prefixes, which might cause some issues as well. https://github.com/CoIR-team/coir/blob/5cbef452f990605e9cc5cd763d4c8de11fcb092b/coir/models.py#L65

In your custom implementation you normalize your embeddings, maybe this gives mismates too

@bclavie
Copy link
Author

bclavie commented Jan 23, 2025

Hey, thanks for the response!

This is probably because MTEB adds the title to the task, whereas in the authors' version, they don’t utilize title when evaluating results. However, since cosqa doesn’t have a title, this might not be the case. #1130 (comment)

I thought so as well so I used cosqa for the example, to avoid things being messed up by different title behaviour.

In your custom implementation you normalize your embeddings, maybe this gives mismates too

I ran it both ways to be sure. The non-normalized ndcg@10 is 0.3867, which is sensibly in the same ballpark as running it with normalization.

I also looked into your implementation of COIR, and you're using YourCustomDEModel from the library, but it adds prefixes, which might cause some issues as well. https://github.com/CoIR-team/coir/blob/5cbef452f990605e9cc5cd763d4c8de11fcb092b/coir/models.py#L65

Oh well spotted! I think this is the culprit. I didn't assume that it'd add prefixes by default, but re-running it gets me within ~1NDCG@10 of the MTEB results. Somewhat curious why the manual implementation is well off 🤔

@Samoed
Copy link
Collaborator

Samoed commented Jan 23, 2025

Maybe the problem is that you're using dot as the similarity function, while MTEB uses cosine by default. However, this difference would likely result in only a ≈0.01 variation.

@bclavie
Copy link
Author

bclavie commented Jan 23, 2025

Maybe the problem is that you're using dot as the similarity function, while MTEB uses cosine by default. However, this difference would likely result in only a ≈0.01 variation.

Cos sim is generally always ran on normalized vectors, in which case np.dot would be mathematically equivalent wouldn't it? eg the cos_sim function normalizes vectors itself:

"""Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j.

Anyhow, this is a much smaller issue than it seemed -- the main problem is that CoIR shouldn't add prefixes by default!

@isaac-chung isaac-chung added the replication question and issues related to replication label Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
replication question and issues related to replication
Projects
None yet
Development

No branches or pull requests

3 participants