Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve belebele retrieval task #894

Conversation

akshita-sukhlecha
Copy link
Contributor

This enhances the Belebele retrieval task - to make it cross-lingual and add all the language scripts. This will support 122 language variants (including 115 distinct languages and their scripts).

Checklist

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have too much experience with cross-lingual tasks (@imenelydiaker do you have time for this?)

But I would def. say that you should:

  1. Create a new version of the task instead of replacing it and then add the superseeded_by to the old dataset (you can find an example in the arxivClustering tasks)
  2. Compare runtime of old dataset and new dataset (how much additional time is added)
  3. compare scores of old and new dataset (does it make a difference in model ranking)
  4. see comment on potentially not creating a full grid of comparisons

Comment on lines 135 to 144
def get_lang_pairs() -> dict[str, list[str]]:
# add all possible language pairs
lang_pairs = {}
for x in _LANGUAGES:
for y in _LANGUAGES:
xx = x.replace("_", "-")
yy = y.replace("_", "-")
pair = f"{xx}-{yy}"
lang_pairs[pair] = [xx, yy]
return lang_pairs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All possible pairs might be unreasonable as some pairs are in practice never seen. I believe we have another dataset (Bible I believe), which created a meaningful subset.

@imenelydiaker imenelydiaker self-assigned this Jun 10, 2024
@akshita-sukhlecha
Copy link
Contributor Author

Thanks for reviewing ! I have pushed the requested changes.

  1. Create a new version (superseeded) task - Done 


  2. & 3. Compare runtime & scores - Runtime and score remain the same when providing a language to the old-task and providing a pair with the same source and target language to the new task.



  3. Do not create a full grid of comparisons - Done

Copy link
Contributor

@imenelydiaker imenelydiaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, thanks for this work! Some questions:

  • How do you match a question in language XX with a passage in language YY? The order of samples on HF is not the same from one language to another (I manually checked Arabic and French and found that the first sample of each language is not the same). I can't see any IDs in the dataset to identify them.
  • Why not update the old task instead of creating a new one? We'd have something like MLQA.

@KennethEnevoldsen
Copy link
Contributor

Why not update the old task instead of creating a new one? We'd have somethig like MLQA.

Ahh right, I was suspecting that the dataset changed due to the changes in the dataset transform. I see that it does not. We can keep it as one.

@akshita-sukhlecha
Copy link
Contributor Author

@imenelydiaker

  1. How do you match a question in language XX with a passage in language YY? - Using the link field. A link corresponds to a particular passage.

  2. Sure, I'll make the changes to update the old task itself.

@akshita-sukhlecha
Copy link
Contributor Author

@imenelydiaker I have one question on how to represent the language-pairs - usage of hyphen vs underscore between language and script code.
Should it be "acm-Arab-eng-Latn" or "acm_Arab-eng_Latn" ?

  • For languages, "eng-Latn" (hyphen) has been used in the repo.
  • For language pairs in Bitext-mining multilingual tasks, "acm_Arab-eng_Latn" (underscore) has been used

@imenelydiaker
Copy link
Contributor

  • For language pairs in Bitext-mining multilingual tasks, "acm_Arab-eng_Latn" (underscore) has been used

I'd advise to use the same standard as FloresBitextMining task. So the second option "acm_Arab-eng_Latn".

Copy link
Contributor

@imenelydiaker imenelydiaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you! Ready for merging?

@akshita-sukhlecha
Copy link
Contributor Author

@imenelydiaker Yes, please go ahead and merge it. (It showing that some tests have failed, but they are related to different tasks. Don't know what to do in this case.)

@imenelydiaker
Copy link
Contributor

@imenelydiaker Yes, please go ahead and merge it. (It showing that some tests have failed, but they are related to different tasks. Don't know what to do in this case.)

Hey @akshita-sukhlecha can you please update your branch?

@isaac-chung @KennethEnevoldsen do you know why teh CI is failing? Should we merge either way?

@isaac-chung isaac-chung merged commit 3137f96 into embeddings-benchmark:main Jun 23, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants