Improve belebele retrieval task #894

akshita-sukhlecha · 2024-06-10T02:02:33Z

This enhances the Belebele retrieval task - to make it cross-lingual and add all the language scripts. This will support 122 language variants (including 115 distinct languages and their scripts).

Checklist

I have tested that the dataset runs with the mteb package.
I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

KennethEnevoldsen

I don't have too much experience with cross-lingual tasks (@imenelydiaker do you have time for this?)

But I would def. say that you should:

Create a new version of the task instead of replacing it and then add the superseeded_by to the old dataset (you can find an example in the arxivClustering tasks)
Compare runtime of old dataset and new dataset (how much additional time is added)
compare scores of old and new dataset (does it make a difference in model ranking)
see comment on potentially not creating a full grid of comparisons

KennethEnevoldsen · 2024-06-10T09:32:11Z

mteb/tasks/Retrieval/multilingual/BelebeleRetrieval.py

+def get_lang_pairs() -> dict[str, list[str]]:
+    # add all possible language pairs
+    lang_pairs = {}
+    for x in _LANGUAGES:
+        for y in _LANGUAGES:
+            xx = x.replace("_", "-")
+            yy = y.replace("_", "-")
+            pair = f"{xx}-{yy}"
+            lang_pairs[pair] = [xx, yy]
+    return lang_pairs


All possible pairs might be unreasonable as some pairs are in practice never seen. I believe we have another dataset (Bible I believe), which created a meaningful subset.

… task

akshita-sukhlecha · 2024-06-10T21:37:16Z

Thanks for reviewing ! I have pushed the requested changes.

Create a new version (superseeded) task - Done  
& 3. Compare runtime & scores - Runtime and score remain the same when providing a language to the old-task and providing a pair with the same source and target language to the new task.  
Do not create a full grid of comparisons - Done

imenelydiaker

Hello, thanks for this work! Some questions:

How do you match a question in language XX with a passage in language YY? The order of samples on HF is not the same from one language to another (I manually checked Arabic and French and found that the first sample of each language is not the same). I can't see any IDs in the dataset to identify them.
Why not update the old task instead of creating a new one? We'd have something like MLQA.

KennethEnevoldsen · 2024-06-11T09:50:40Z

Why not update the old task instead of creating a new one? We'd have somethig like MLQA.

Ahh right, I was suspecting that the dataset changed due to the changes in the dataset transform. I see that it does not. We can keep it as one.

akshita-sukhlecha · 2024-06-11T20:59:12Z

@imenelydiaker

How do you match a question in language XX with a passage in language YY? - Using the link field. A link corresponds to a particular passage.
Sure, I'll make the changes to update the old task itself.

akshita-sukhlecha · 2024-06-12T16:01:34Z

@imenelydiaker I have one question on how to represent the language-pairs - usage of hyphen vs underscore between language and script code.
Should it be "acm-Arab-eng-Latn" or "acm_Arab-eng_Latn" ?

For languages, "eng-Latn" (hyphen) has been used in the repo.
For language pairs in Bitext-mining multilingual tasks, "acm_Arab-eng_Latn" (underscore) has been used

imenelydiaker · 2024-06-12T16:30:33Z

For language pairs in Bitext-mining multilingual tasks, "acm_Arab-eng_Latn" (underscore) has been used

I'd advise to use the same standard as FloresBitextMining task. So the second option "acm_Arab-eng_Latn".

…t in language pair

imenelydiaker

LGTM! Thank you! Ready for merging?

akshita-sukhlecha · 2024-06-20T05:29:05Z

@imenelydiaker Yes, please go ahead and merge it. (It showing that some tests have failed, but they are related to different tasks. Don't know what to do in this case.)

imenelydiaker · 2024-06-23T12:31:49Z

@imenelydiaker Yes, please go ahead and merge it. (It showing that some tests have failed, but they are related to different tasks. Don't know what to do in this case.)

Hey @akshita-sukhlecha can you please update your branch?

@isaac-chung @KennethEnevoldsen do you know why teh CI is failing? Should we merge either way?

mteb/tasks/Retrieval/multilingual/BelebeleRetrieval.py

akshita-sukhlecha added 8 commits June 9, 2024 18:38

Multilingual/BelebeleRetrieval: Fix self.langs to self.hf_subsets

0d72ba4

Multilingual/BelebeleRetrieval: Use lang+script code

42a7a17

Multilingual/BelebeleRetrieval: Use 7 more scripts

029aeec

Multilingual/BelebeleRetrieval: convert to cross lingual task

878f226

Multilingual/BelebeleRetrieval: add sample results

ae5e981

Multilingual/BelebeleRetrieval: update historic results language codes

ff4d444

Multilingual/BelebeleRetrieval: code formattin

d62eed6

Multilingual/BelebeleRetrieval: update results

866aa8c

KennethEnevoldsen reviewed Jun 10, 2024

View reviewed changes

imenelydiaker self-assigned this Jun 10, 2024

akshita-sukhlecha added 3 commits June 10, 2024 13:36

Multilingual/BelebeleCrossRetrieval: Creating new version of Belebele…

ca1c6f2

… task

Multilingual/BelebeleCrossRetrieval: Restricting langauge pairs

c15b5f2

Multilingual/BelebeleCrossRetrieval: Formatting changes

c75daf7

imenelydiaker reviewed Jun 11, 2024

View reviewed changes

Multilingual/BelebeleRetrieval: Replace old task

b5ec500

akshita-sukhlecha added 2 commits June 12, 2024 19:38

Multilingual/BelebeleRetrieval: Use underscore between language-scrip…

8beb6d9

…t in language pair

Add points

878bbde

akshita-sukhlecha requested a review from imenelydiaker June 14, 2024 01:56

imenelydiaker approved these changes Jun 18, 2024

View reviewed changes

isaac-chung reviewed Jun 23, 2024

View reviewed changes

mteb/tasks/Retrieval/multilingual/BelebeleRetrieval.py Outdated Show resolved Hide resolved

mteb/tasks/Retrieval/multilingual/BelebeleRetrieval.py Outdated Show resolved Hide resolved

isaac-chung added 2 commits June 23, 2024 12:47

Merge branch 'main' into improve_belebele_retrieval_task

e5bc2c8

points

0608739

imenelydiaker requested changes Jun 23, 2024

View reviewed changes

mteb/tasks/Retrieval/multilingual/BelebeleRetrieval.py Show resolved Hide resolved

imenelydiaker approved these changes Jun 23, 2024

View reviewed changes

isaac-chung merged commit 3137f96 into embeddings-benchmark:main Jun 23, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve belebele retrieval task #894

Improve belebele retrieval task #894

akshita-sukhlecha commented Jun 10, 2024

KennethEnevoldsen left a comment •

edited

Loading

KennethEnevoldsen Jun 10, 2024

akshita-sukhlecha commented Jun 10, 2024

imenelydiaker left a comment •

edited

Loading

KennethEnevoldsen commented Jun 11, 2024

akshita-sukhlecha commented Jun 11, 2024

akshita-sukhlecha commented Jun 12, 2024

imenelydiaker commented Jun 12, 2024

imenelydiaker left a comment •

edited

Loading

akshita-sukhlecha commented Jun 20, 2024

imenelydiaker commented Jun 23, 2024

Improve belebele retrieval task #894

Improve belebele retrieval task #894

Conversation

akshita-sukhlecha commented Jun 10, 2024

Checklist

KennethEnevoldsen left a comment • edited Loading

Choose a reason for hiding this comment

KennethEnevoldsen Jun 10, 2024

Choose a reason for hiding this comment

akshita-sukhlecha commented Jun 10, 2024

imenelydiaker left a comment • edited Loading

Choose a reason for hiding this comment

KennethEnevoldsen commented Jun 11, 2024

akshita-sukhlecha commented Jun 11, 2024

akshita-sukhlecha commented Jun 12, 2024

imenelydiaker commented Jun 12, 2024

imenelydiaker left a comment • edited Loading

Choose a reason for hiding this comment

akshita-sukhlecha commented Jun 20, 2024

imenelydiaker commented Jun 23, 2024

KennethEnevoldsen left a comment •

edited

Loading

imenelydiaker left a comment •

edited

Loading

imenelydiaker left a comment •

edited

Loading