Leaderboard: What metric should we sort on? #1813

Samoed · 2025-01-15T11:27:05Z

I think that non-default benchmarks have sorting problems, e.g. MTEB(classic, eng).
http://mteb-leaderboard-2-demo.hf.space/?benchmark_name=MTEB%28eng%2C+classic%29
I think that ranking not updating

x-tabdeveloping · 2025-01-15T11:36:35Z

Are you certain that these are incorrect? It could very well be the case that the Borda rank doesn't match with rankings based on the mean.

Samoed · 2025-01-15T11:54:12Z

Yes, that is the case and I think this is strange a bit

x-tabdeveloping · 2025-01-15T13:20:48Z

So what's the conclusion? Is this a bug? As far as I can tell it works as intended.
The sorting also gets recomputed (probably correctly whenever you choose a different benchmark), so it seems to me to be working.

Samoed · 2025-01-15T13:35:54Z

I just open this leaderboard without any additional sorting and model with Rank 1 has nan as mean score and I don't understand what rank means in that case

x-tabdeveloping · 2025-01-15T13:44:03Z

I'm copying this from the explanation under the leaderboard:

Rank(borda) is computed based on the borda count, where each task is treated as a preference voter, which gives votes on the models in accordance with their relative performance on the task. The best model obtains the highest number of votes. The model with the highest number of votes across tasks obtains the highest rank. The Borda rank tends to prefer models that perform well broadly across tasks. However, given that it is a rank it can be unclear if the two models perform similarly.

So models can get a Borda rank even if they haven't been run on all tasks, and a model can rank higher on borda count than based on mean performance, because it needs to balance out performance on all tasks.

The problem with just looking at the mean is that if a model is good at a task that has high variance, but one of the worst on another one that has low variance in scores, it will get a really high mean, even though it might at best be mediocre.
Borda rank requires models to be good on all tasks. Not necessarily the best on all, but good across the board.

It might be a bit unintuitive, but as far as I can tell things work as they were intended to work.

Samoed · 2025-01-15T14:01:21Z

Thanks! I missed that. Maybe it should be renamed to something less confusing, like voting rank. Also, I think it would be helpful to add a column that enumerates the current order, placing it first, and making the borda rank second.

x-tabdeveloping · 2025-01-15T15:27:35Z

That's how it used to be but then we agreed with @KennethEnevoldsen to only keep one of them. We can start a discussion about this and tag people of you think that would be a good idea. I personally think this is a good solution.

isaac-chung · 2025-01-15T16:07:05Z

Borda count is used in the paper and IMO is the most clear (it even comes with its own wiki page!)

KennethEnevoldsen · 2025-01-16T14:11:41Z

Borda rank has good conceptual backing. I don't think it is the best metric (by any stretch), but if we want to rank models it is better than the mean (singular tasks with more variance in score can overly influence the mean). However, it is non-continuous, making it hard to compare models (e.g., would work horribly in the viz.).

I think in general having more than 1 metric is a good default. For ranking (out of the measures that we have) I believe borda is the best approach and the user is free to short by the mean if they want.

Better approaches could be made using e.g. pr. sample information, however, we don't have that information.

KennethEnevoldsen · 2025-01-16T14:14:48Z

I just open this leaderboard without any additional sorting and model with Rank 1 has nan as mean score and I don't understand what rank means in that case

we could have the borda rank produce nan in case the results are not complete, but nan would essentially just be 0 points for the model for that task (@x-tabdeveloping is this true - could imagine the short my put nan at the top, which would be a problem)

x-tabdeveloping · 2025-01-16T16:20:58Z

No NaN amounts to 0 as far as I know

Samoed added the leaderboard issues related to the leaderboard label Jan 15, 2025

KennethEnevoldsen changed the title ~~Leaderboard: Ranks computed incrorrectly~~ Leaderboard: What metric should we sort on? Jan 16, 2025

Samoed mentioned this issue Jan 21, 2025

fix: Leaderboard Refinements #1849

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaderboard: What metric should we sort on? #1813

Leaderboard: What metric should we sort on? #1813

Samoed commented Jan 15, 2025 •

edited

Loading

x-tabdeveloping commented Jan 15, 2025

Samoed commented Jan 15, 2025

x-tabdeveloping commented Jan 15, 2025

Samoed commented Jan 15, 2025

x-tabdeveloping commented Jan 15, 2025 •

edited

Loading

Samoed commented Jan 15, 2025

x-tabdeveloping commented Jan 15, 2025

isaac-chung commented Jan 15, 2025

KennethEnevoldsen commented Jan 16, 2025

KennethEnevoldsen commented Jan 16, 2025

x-tabdeveloping commented Jan 16, 2025

Leaderboard: What metric should we sort on? #1813

Leaderboard: What metric should we sort on? #1813

Comments

Samoed commented Jan 15, 2025 • edited Loading

x-tabdeveloping commented Jan 15, 2025

Samoed commented Jan 15, 2025

x-tabdeveloping commented Jan 15, 2025

Samoed commented Jan 15, 2025

x-tabdeveloping commented Jan 15, 2025 • edited Loading

Samoed commented Jan 15, 2025

x-tabdeveloping commented Jan 15, 2025

isaac-chung commented Jan 15, 2025

KennethEnevoldsen commented Jan 16, 2025

KennethEnevoldsen commented Jan 16, 2025

x-tabdeveloping commented Jan 16, 2025

Samoed commented Jan 15, 2025 •

edited

Loading

x-tabdeveloping commented Jan 15, 2025 •

edited

Loading