fix: Add Touche2020v3 and JMTEB #1262

Samoed · 2024-09-29T19:23:06Z

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding datasets checklist

Reason for dataset addition: #749 #1170

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- facebook/contriver
  - Touche2020v3.json
- intfloat/multilingual-e5-small
  JMTEB leaderboard to compare. I ran MrTidyRetrieval only for Bengali, as the Japanese version has 7M documents in context to encode. Overall, the results are similar, with the main difference caused by that JMTEB doesn't include titles.
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

KennethEnevoldsen

Wonderful thanks for doing this!

We should also add JMTEB to the benchmarks.py file

mteb/tasks/Retrieval/eng/Touche2020Retrieval.py

Co-authored-by: Kenneth Enevoldsen <[email protected]>

Samoed · 2024-10-03T12:10:20Z

@KennethEnevoldsen runner hanged up. Found this issue in python action actions/setup-python#806

* add datasets * fix metrics * add Touche2020v3 * fix metadata * Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen <[email protected]> * upd name and supress * add benchmark class --------- Co-authored-by: Kenneth Enevoldsen <[email protected]>

Samoed and others added 6 commits September 29, 2024 15:59

add datasets

948dd47

Merge branch 'embeddings-benchmark:main' into add_datasets

48a324d

fix metrics

ce5c746

Merge branch 'embeddings-benchmark:main' into add_datasets

c752de7

add Touche2020v3

456df82

fix metadata

38c0b56

isaac-chung changed the title ~~Add Touche2020v3 and JMTEB~~ fix: Add Touche2020v3 and JMTEB Sep 29, 2024

KennethEnevoldsen mentioned this pull request Oct 3, 2024

Add mteb(eng), mteb(europe) and mteb(multilingual) #1273

Open

KennethEnevoldsen requested changes Oct 3, 2024

View reviewed changes

mteb/tasks/Retrieval/eng/Touche2020Retrieval.py Outdated Show resolved Hide resolved

mteb/tasks/Retrieval/eng/Touche2020Retrieval.py Outdated Show resolved Hide resolved

mteb/tasks/Retrieval/eng/Touche2020Retrieval.py Show resolved Hide resolved

Samoed and others added 3 commits October 3, 2024 14:15

Apply suggestions from code review

fbe737e

Co-authored-by: Kenneth Enevoldsen <[email protected]>

upd name and supress

1a56a46

add benchmark class

049c914

Samoed requested a review from KennethEnevoldsen October 3, 2024 11:43

Samoed mentioned this pull request Oct 3, 2024

[EVAL REQUEST] jina-embeddings-v3 sbintuitions/JMTEB#77

Open

17 tasks

KennethEnevoldsen approved these changes Oct 3, 2024

View reviewed changes

KennethEnevoldsen enabled auto-merge (squash) October 3, 2024 11:57

KennethEnevoldsen merged commit 5074918 into embeddings-benchmark:main Oct 3, 2024
9 checks passed

Samoed deleted the add_datasets branch October 3, 2024 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Add Touche2020v3 and JMTEB #1262

fix: Add Touche2020v3 and JMTEB #1262

Samoed commented Sep 29, 2024

KennethEnevoldsen left a comment

Samoed commented Oct 3, 2024 •

edited

Loading

fix: Add Touche2020v3 and JMTEB #1262

fix: Add Touche2020v3 and JMTEB #1262

Conversation

Samoed commented Sep 29, 2024

Checklist

Adding datasets checklist

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Samoed commented Oct 3, 2024 • edited Loading

Samoed commented Oct 3, 2024 •

edited

Loading