-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve belebele retrieval task #894
Improve belebele retrieval task #894
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have too much experience with cross-lingual tasks (@imenelydiaker do you have time for this?)
But I would def. say that you should:
- Create a new version of the task instead of replacing it and then add the
superseeded_by
to the old dataset (you can find an example in the arxivClustering tasks) - Compare runtime of old dataset and new dataset (how much additional time is added)
- compare scores of old and new dataset (does it make a difference in model ranking)
- see comment on potentially not creating a full grid of comparisons
def get_lang_pairs() -> dict[str, list[str]]: | ||
# add all possible language pairs | ||
lang_pairs = {} | ||
for x in _LANGUAGES: | ||
for y in _LANGUAGES: | ||
xx = x.replace("_", "-") | ||
yy = y.replace("_", "-") | ||
pair = f"{xx}-{yy}" | ||
lang_pairs[pair] = [xx, yy] | ||
return lang_pairs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All possible pairs might be unreasonable as some pairs are in practice never seen. I believe we have another dataset (Bible I believe), which created a meaningful subset.
Thanks for reviewing ! I have pushed the requested changes.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, thanks for this work! Some questions:
- How do you match a question in language XX with a passage in language YY? The order of samples on HF is not the same from one language to another (I manually checked Arabic and French and found that the first sample of each language is not the same). I can't see any IDs in the dataset to identify them.
- Why not update the old task instead of creating a new one? We'd have something like MLQA.
Ahh right, I was suspecting that the dataset changed due to the changes in the dataset transform. I see that it does not. We can keep it as one. |
|
@imenelydiaker I have one question on how to represent the language-pairs - usage of hyphen vs underscore between language and script code.
|
I'd advise to use the same standard as FloresBitextMining task. So the second option "acm_Arab-eng_Latn". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you! Ready for merging?
@imenelydiaker Yes, please go ahead and merge it. (It showing that some tests have failed, but they are related to different tasks. Don't know what to do in this case.) |
Hey @akshita-sukhlecha can you please update your branch? @isaac-chung @KennethEnevoldsen do you know why teh CI is failing? Should we merge either way? |
This enhances the Belebele retrieval task - to make it cross-lingual and add all the language scripts. This will support 122 language variants (including 115 distinct languages and their scripts).
Checklist
mteb -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.