Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert HALClustering to Fast #817

Conversation

AlexeyVatolin
Copy link
Contributor

To solve issue #814

Checklist for adding MMTEB dataset

Reason for dataset addition:

  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • I have added points for my submission to the points folder using the PR number as the filename (e.g. 438.jsonl).

@AlexeyVatolin AlexeyVatolin force-pushed the convert_hal_clustering_to_fast branch from 0f2fecb to a5f86e0 Compare May 24, 2024 18:39
@AlexeyVatolin AlexeyVatolin force-pushed the convert_hal_clustering_to_fast branch from a5f86e0 to 589a80e Compare May 24, 2024 18:40
Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this. Please add review points and it's good to merge!

@isaac-chung isaac-chung self-assigned this May 24, 2024
@AlexeyVatolin
Copy link
Contributor Author

I'll write it down for the story. To get information about the date range in the dataset, I parsed again all articles from https://hal.science/<hal_id> and took there the date of publication of the article. Hal_id is a column from dataset

Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again. Just a small comment I missed last time.

mteb/tasks/Clustering/fra/HALClusteringS2S.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task names should follow the new name as well.

@isaac-chung isaac-chung enabled auto-merge (squash) May 25, 2024 14:47
@isaac-chung isaac-chung merged commit c7adcd8 into embeddings-benchmark:main May 25, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants