Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaderboard: What metric should we sort on? #1813

Open
Samoed opened this issue Jan 15, 2025 · 11 comments
Open

Leaderboard: What metric should we sort on? #1813

Samoed opened this issue Jan 15, 2025 · 11 comments
Labels
leaderboard issues related to the leaderboard

Comments

@Samoed
Copy link
Collaborator

Samoed commented Jan 15, 2025

I think that non-default benchmarks have sorting problems, e.g. MTEB(classic, eng).
http://mteb-leaderboard-2-demo.hf.space/?benchmark_name=MTEB%28eng%2C+classic%29
I think that ranking not updating
image

@Samoed Samoed added the leaderboard issues related to the leaderboard label Jan 15, 2025
@x-tabdeveloping
Copy link
Collaborator

Are you certain that these are incorrect? It could very well be the case that the Borda rank doesn't match with rankings based on the mean.

@Samoed
Copy link
Collaborator Author

Samoed commented Jan 15, 2025

Yes, that is the case and I think this is strange a bit

@x-tabdeveloping
Copy link
Collaborator

So what's the conclusion? Is this a bug? As far as I can tell it works as intended.
The sorting also gets recomputed (probably correctly whenever you choose a different benchmark), so it seems to me to be working.

@Samoed
Copy link
Collaborator Author

Samoed commented Jan 15, 2025

I just open this leaderboard without any additional sorting and model with Rank 1 has nan as mean score and I don't understand what rank means in that case

@x-tabdeveloping
Copy link
Collaborator

x-tabdeveloping commented Jan 15, 2025

I'm copying this from the explanation under the leaderboard:

Rank(borda) is computed based on the borda count, where each task is treated as a preference voter, which gives votes on the models in accordance with their relative performance on the task. The best model obtains the highest number of votes. The model with the highest number of votes across tasks obtains the highest rank. The Borda rank tends to prefer models that perform well broadly across tasks. However, given that it is a rank it can be unclear if the two models perform similarly.

So models can get a Borda rank even if they haven't been run on all tasks, and a model can rank higher on borda count than based on mean performance, because it needs to balance out performance on all tasks.

The problem with just looking at the mean is that if a model is good at a task that has high variance, but one of the worst on another one that has low variance in scores, it will get a really high mean, even though it might at best be mediocre.
Borda rank requires models to be good on all tasks. Not necessarily the best on all, but good across the board.

It might be a bit unintuitive, but as far as I can tell things work as they were intended to work.

@Samoed
Copy link
Collaborator Author

Samoed commented Jan 15, 2025

Thanks! I missed that. Maybe it should be renamed to something less confusing, like voting rank. Also, I think it would be helpful to add a column that enumerates the current order, placing it first, and making the borda rank second.

@x-tabdeveloping
Copy link
Collaborator

That's how it used to be but then we agreed with @KennethEnevoldsen to only keep one of them. We can start a discussion about this and tag people of you think that would be a good idea. I personally think this is a good solution.

@isaac-chung
Copy link
Collaborator

Borda count is used in the paper and IMO is the most clear (it even comes with its own wiki page!)

@KennethEnevoldsen KennethEnevoldsen changed the title Leaderboard: Ranks computed incrorrectly Leaderboard: What metric should we sort on? Jan 16, 2025
@KennethEnevoldsen
Copy link
Contributor

Borda rank has good conceptual backing. I don't think it is the best metric (by any stretch), but if we want to rank models it is better than the mean (singular tasks with more variance in score can overly influence the mean). However, it is non-continuous, making it hard to compare models (e.g., would work horribly in the viz.).

I think in general having more than 1 metric is a good default. For ranking (out of the measures that we have) I believe borda is the best approach and the user is free to short by the mean if they want.

Better approaches could be made using e.g. pr. sample information, however, we don't have that information.

@KennethEnevoldsen
Copy link
Contributor

I just open this leaderboard without any additional sorting and model with Rank 1 has nan as mean score and I don't understand what rank means in that case

we could have the borda rank produce nan in case the results are not complete, but nan would essentially just be 0 points for the model for that task (@x-tabdeveloping is this true - could imagine the short my put nan at the top, which would be a problem)

@x-tabdeveloping
Copy link
Collaborator

No NaN amounts to 0 as far as I know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
leaderboard issues related to the leaderboard
Projects
None yet
Development

No branches or pull requests

4 participants