Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: (1) Add StatcanDialogueDatasetRetrieval (2) Fix DRESModel.encode_conversations to allow list of dictionaries #779

Merged
merged 16 commits into from
May 27, 2024

Conversation

xhluca
Copy link
Contributor

@xhluca xhluca commented May 21, 2024

This pull request introduces two changes:

  1. New dataset: StatcanDialogueDatasetRetrieval
  2. Fix: Change DRESModel.encode_conversations to allow conversations composed of list of dictionaries, alongside lists of strings (e.g. Topicoqa). Also fix batch_size parameter in DRESModel.encode_conversations (b5141e8)

Checklist for adding MMTEB dataset

Reason for dataset addition:

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

Other

Notes

I have tested that the dataset runs with the mteb package.

To test it, I have used the following code:

from mteb import MTEB
from mteb.evaluation.evaluators import DRESModel


class DummyModel:
    def encode(self, texts, *args, **kwargs):
        return texts


evaluation = MTEB(tasks=["StatcanDialogueDatasetRetrieval"])
evaluation.load_tasks_data()
sdd = evaluation.tasks[0]

q = sdd.queries["french"]["test"]["Q12210"]

doc_ids = list(sdd.relevant_docs["french"]["test"]["Q12210"].keys())
answer_1 = sdd.corpus["french"]["test"][doc_ids[0]]
answer_2 = sdd.corpus["french"]["test"][doc_ids[1]]


dres = DRESModel(model=DummyModel())

assert isinstance(q, list) is True, "Query is not a list"

encoded = dres.encode_conversations([q], batch_size=1)
print(encoded[0])
# user: bonjour; operator: Bonjour, je m'appelle Kelly C. Comment puis-je vous aider?; user: je cherches des données sur le secteur du porc canadien; user: je suis vraiment perdu dans le site; operator: Un instant; operator: Veuillez consulter l'hyperlien suivant: Production animale (filtre: porc) (https://www150.statcan.gc.ca/n1/fr/sujets/agriculture_et_alimentati; user: avez vous des données sur les exportations?; operator: un instant; user: en fait je veux mesurer l'importance de ce secteur en termes de production, de revenus, d'exportation et d'emploi.; operator: Activité commerciale internationale pour le code 1 du SH (https://www5.statcan.gc.ca/cimt-cicm/commodities-marchandises?lang=fra&ch +Live+animals+and+animal+products.&refMonth=7&refYr=2020&freq=6&countryId=999&usaState=0&provId=1&dataTransformation=0&searchStr=&monthStr=July); operator: et Activité commerciale internationale pour le code 2 du SH (https://www5.statcan.gc.ca/cimt-cicm/commodities-marchandises?lang=fra& +Live+animals+and+animal+products.&refMonth=7&refYr=2020&freq=6&countryId=999&usaState=0&provId=1&dataTransformation=0&searchStr=&monthStr=July); operator: Pour la production, je vous invite à consulter le premier hyperlien que j'ai partagé. Il y aurait des tableaux comme 32-10-0126-01 (https://w pour les exports, veuillez consulter les hyperliens ci-dessus; operator: pour les revenus et emplois, un instant; user: pour les exports, je vois des données mensuelles; user: est il possible d'avoir des données annuelles, un cumul?; operator: Oui, vous avez l'option de changer la fréquence; operator: Est-ce que vous voyez l'option?; user: exemple pour 2019, je ferais quoi pour avoir les données de toute l'année; operator: Vous devez changer l'année à 2019, et la fréquence à annuel. Vous devez ensuite extraire les données. Exemple: Tableau 980-0002 (https://w lang=fra&getSectionId()=1&dataTransformation=0&refYr=2019&refMonth=7&freq=12&countryId=0&getUsaState()=0&provId=1&retrieve=Extraire&country=null&trad; user: et le mois, je vais choisir quoi; operator: pour le revenu, veuillez consulter le tableau suivant: 32-10-0136-01 (https://www150.statcan.gc.ca/t1/tbl1/fr/tv.action?pid=3210013601&re; operator: Vous pouvez selectionner le mois que vous voulez.; user: super merci; user: pour l'emploi; user: etes vous avec moi?; operator: oui, un instant; user: si je veux connaitre la place du canada dans le monde en matière de porc, en termes de production et d'exportation; operator: Je suis toujours à la recherche de vos données. Merci de votre patience.; user: un gros merci

I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).

Looking at the scores (recall at k):

"recall_at_1": 0.03704,
"recall_at_3": 0.07407,
"recall_at_5": 0.09877,
"recall_at_10": 0.15895,
"recall_at_20": 0.17747,
"recall_at_100": 0.39738,
"recall_at_1000": 0.72222,

we can see that the recall at k from 1 to 1000 consistently, which means it is a fairly challenging task (r@10 is only 0.15) but non-random.

@xhluca
Copy link
Contributor Author

xhluca commented May 21, 2024

@orionw @vaibhavad would love if you can review this PR! There's a few changes i want to add but feel free to leave feedback now if you wish!

@xhluca xhluca marked this pull request as ready for review May 21, 2024 16:07
@orionw
Copy link
Contributor

orionw commented May 21, 2024

The current code looks like a great start -- the remainder todos are just filling in the things marked TBD and adding results/points.

@orionw orionw self-requested a review May 21, 2024 16:21
@vaibhavad vaibhavad self-requested a review May 21, 2024 18:17
@xhluca
Copy link
Contributor Author

xhluca commented May 22, 2024

Seems like language filtering does not work, or i'm running the wrong command:

from mteb import MTEB
evaluation = MTEB(task_langs=["de"])
print(evaluation.available_tasks)  # StatcanDialogueDatasetRetrieval will appear

@vaibhavad
Copy link
Contributor

vaibhavad commented May 22, 2024

@xhluca,

I don't think this is the correct API. It is show all tasks, regardless of category, language etc. Can you try evaluation.print_selected_tasks() instead?

@xhluca
Copy link
Contributor Author

xhluca commented May 22, 2024

Thanks, seems print_selected_tasks works.

@xhluca
Copy link
Contributor Author

xhluca commented May 22, 2024

@vaibhavad @orionw As discussed in this comment, a fixed was needed for DRESModel.encode_conversations (i.e. convert_conv_history_to_query) in order to allow a conversation to be represented as a list of dict rather than a list of string. This format is directly compatible with huggingface-style conversations (see this tutorial). See this commit for the details: b5141e8

Let me know if this fix makes sense (and whether I should tag this PR as bug fix or something else). Let me know if this change to DRESModel.encode_conversations should be classified under something else in 779.json.

@xhluca xhluca changed the title [WIP] Add statcan dialogue dataset [WIP] (1) Add StatcanDialogueDatasetRetrieval (2) Fix DRESModel.encode_conversations to allow list of dictionaries May 22, 2024
@xhluca xhluca changed the title [WIP] (1) Add StatcanDialogueDatasetRetrieval (2) Fix DRESModel.encode_conversations to allow list of dictionaries fix: (1) Add StatcanDialogueDatasetRetrieval (2) Fix DRESModel.encode_conversations to allow list of dictionaries May 22, 2024
Copy link
Contributor

@orionw orionw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some minor comments, but I like the changes.

No need to tag the PR (not sure we use that?) but definitely add points for the dataset, 1-2 for a bugfix, and also the reviewers.

description="A Dataset for Retrieving Data Tables through Conversations with Genuine Intents, available in English and French.",
dataset={
"path": "McGill-NLP/statcan-dialogue-dataset-retrieval",
"revision": "v1.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the git commit hash of the dataset, if possible.

@xhluca
Copy link
Contributor Author

xhluca commented May 24, 2024

Huh, seems tests failed after i merged changes from main to this branch...

@xhluca xhluca requested review from orionw and vaibhavad May 24, 2024 01:39
@orionw
Copy link
Contributor

orionw commented May 24, 2024

Huh, seems tests failed after i merged changes from main to this branch...

I think those are fixed by #803 if you merge in main again, agree they weren't caused by this PR

Copy link
Contributor

@orionw orionw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the tests pass and @vaibhavad approves, will enable automerge

@xhluca
Copy link
Contributor Author

xhluca commented May 25, 2024

@vaibhavad can you approve if everything has been covered?

@vaibhavad vaibhavad merged commit 7943ff0 into embeddings-benchmark:main May 27, 2024
7 checks passed
@vaibhavad
Copy link
Contributor

Approved and merged, thanks for the great work @xhluca!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants