-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leaderboard: What metric should we sort on? #1813
Comments
Are you certain that these are incorrect? It could very well be the case that the Borda rank doesn't match with rankings based on the mean. |
Yes, that is the case and I think this is strange a bit |
So what's the conclusion? Is this a bug? As far as I can tell it works as intended. |
I just open this leaderboard without any additional sorting and model with Rank 1 has |
I'm copying this from the explanation under the leaderboard:
So models can get a Borda rank even if they haven't been run on all tasks, and a model can rank higher on borda count than based on mean performance, because it needs to balance out performance on all tasks. The problem with just looking at the mean is that if a model is good at a task that has high variance, but one of the worst on another one that has low variance in scores, it will get a really high mean, even though it might at best be mediocre. It might be a bit unintuitive, but as far as I can tell things work as they were intended to work. |
Thanks! I missed that. Maybe it should be renamed to something less confusing, like |
That's how it used to be but then we agreed with @KennethEnevoldsen to only keep one of them. We can start a discussion about this and tag people of you think that would be a good idea. I personally think this is a good solution. |
Borda count is used in the paper and IMO is the most clear (it even comes with its own wiki page!) |
Borda rank has good conceptual backing. I don't think it is the best metric (by any stretch), but if we want to rank models it is better than the mean (singular tasks with more variance in score can overly influence the mean). However, it is non-continuous, making it hard to compare models (e.g., would work horribly in the viz.). I think in general having more than 1 metric is a good default. For ranking (out of the measures that we have) I believe borda is the best approach and the user is free to short by the mean if they want. Better approaches could be made using e.g. pr. sample information, however, we don't have that information. |
we could have the borda rank produce nan in case the results are not complete, but nan would essentially just be 0 points for the model for that task (@x-tabdeveloping is this true - could imagine the short my put nan at the top, which would be a problem) |
No NaN amounts to 0 as far as I know |
I think that non-default benchmarks have sorting problems, e.g. MTEB(classic, eng).
http://mteb-leaderboard-2-demo.hf.space/?benchmark_name=MTEB%28eng%2C+classic%29
I think that ranking not updating
The text was updated successfully, but these errors were encountered: