Add some classification and ZeroShot classification tasks #1107

imenelydiaker · 2024-07-23T09:31:46Z

Added these tasks with their zero shot version from CLIP benchmark:

PatchCamelyon
Country211
GTSRB
UCF101
Imagenet1k

For the zero shot templates, I copied the ones from the CLIP Benchmark (found in datasets repositories), but we can agree on keeping only one prompt template per task for the evaluation.

I couldn't run UCF101 and Imagenet1k tasks (hardware issues), can someone help me run them please ?

Note: I ran these tasks on 1 experiment out of 5, just to confirm the task runs.

KennethEnevoldsen

Just to get into the new code I just added a PR here. It seems like the TaskMetadata updated from Main hasn't been carried over (or has accidentally been re-added).

KennethEnevoldsen · 2024-07-23T10:58:54Z

mteb/tasks/Image/ImageClassification/eng/Country211Classification.py

+        }""",
+        descriptive_stats={
+            "n_samples": {"test": 21100},
+            "avg_character_length": {"test": 0},


is this required? It should probably be conditional on the "modality"

Yeah I'm not sure avg_character_length is required, although it was filled for other image tasks. cc @isaac-chung did you update the metadata?

I didn't. If it can be removed without triggering any pydantic errors, let's remove it for pure image datasets.

KennethEnevoldsen · 2024-07-23T10:59:55Z

mteb/tasks/Image/ImageClassification/eng/Country211Classification.py

+            "2020-01-01",
+            "2021-02-26",
+        ),  # Estimated range for the collection of reviews
+        domains=["Scene"],


What is a Scene? (should we add a descript of the domains somewhere?)

Yeah we definetly should add a description of domains. I copied "Scene" from other tasks as it looked more appropriate for these tasks

KennethEnevoldsen · 2024-07-23T11:01:20Z

mteb/tasks/Image/ImageClassification/eng/GTSRBClassification.py

+        task_subtypes=["Activity recognition"],
+        domains=["Scene"],
+        license="Not specified",
+        socioeconomic_status="mixed",


This one should have been deleted from main

isaac-chung · 2024-07-23T11:59:18Z

Just to get into the new code I just added a PR here. It seems like the TaskMetadata updated from Main hasn't been carried over (or has accidentally been re-added).

I've added those changes in my previous PRs to the mieb branch. Just wondering if @imenelydiaker your base branch is up to date. Might just need a pull. Though I might have missed some changes as well (e.g. not removing all the old fields)

isaac-chung · 2024-07-23T12:00:49Z

I couldn't run UCF101 and Imagenet1k tasks (hardware issues), can someone help me run them please ?

Maybe we could run a downsampled version with 1 experiment run first? Then whoever has access to a cluster can run the full version?

imenelydiaker · 2024-07-23T12:40:45Z

Just to get into the new code I just added a PR here. It seems like the TaskMetadata updated from Main hasn't been carried over (or has accidentally been re-added).

I've added those changes in my previous PRs to the mieb branch. Just wondering if @imenelydiaker your base branch is up to date. Might just need a pull. Though I might have missed some changes as well (e.g. not removing all the old fields)

I pulled mieb and merged it into my branch earlier today (at 11:26 am). The only changes I made were adding new fields to Domains list, I guess it's not on my end..

gowitheflow-1998 · 2024-07-23T12:49:25Z

@imenelydiaker thanks for adding these! The zero-shot results look much lower than CLIP paper (see page 43). I think it's because index gets wrong after combing prompt list (label index not correspondig to prompt text index). Can you test if the results are correct with one prompt first, and leave other template commented there? I think we need to modify the evaluator a bit to make it work but could be done in the next commit!

Thanks for raising combining prompts though. CLIP paper did the same. I think this is something worth adding. To make it work, we need to change the label index in the evaluator. e.g., if label index was 0,1,2, with 3 prompt templates structured in the way you do now, the label will become [[0,1,2], [3,4,5], [6,7,8]]. For an example with groundtruth 0, two ways we can implement it: 1. as long as long it hits one of 0,1,2, it is marked as correct for that example. 2. Ensemble probabilities of 0,1,2, and see if it's higher than ensemble of other label candidates.

For those that couldn't be run, might be great to change to 1 run instead of 5 like @isaac-chung suggested on the weekend, but keeping the full set (avoid remaking the full set at the end). We can run them at once at the end.

gowitheflow-1998 · 2024-07-24T18:44:19Z

For images, we might need to change the way to process inputs and pass them into Evaluators. e.g.,

for t2i retrieval, it's now

queries = {query["id"]: {"text": query["text"]} for query in queries}
corpus = {image["id"]: image["image"] for image in corpus}
self.corpus[split], self.queries[split], self.relevant_docs[split] = (
    corpus,
    queries,
    qrels,
)

For zero-shot and linear probing ones, we have:

evaluator = ZeroshotClassificationEvaluator(
    dataset[self.image_column_name],
    dataset[self.label_column_name],
    candidate_labels,
    task_name=self.metadata.name,
    **kwargs,
)

X_sampled, y_sampled, idxs = self._undersample_data(
    train_split[self.image_column_name],  # type: ignore
    train_split[self.label_column_name],  # type: ignore
    self.samples_per_label,
    idxs,
)

operations like corpus = {image["id"]: image["image"] for image in corpus} ``dataset[self.image_column_name], train_split[self.image_column_name] are taking huge memory. I will fix these soon with iterable operations like dataloaders.

imenelydiaker · 2024-07-30T13:33:44Z

@imenelydiaker thanks for adding these! The zero-shot results look much lower than CLIP paper (see page 43). I think it's because index gets wrong after combing prompt list (label index not correspondig to prompt text index). Can you test if the results are correct with one prompt first, and leave other template commented there? I think we need to modify the evaluator a bit to make it work but could be done in the next commit!

Thanks for raising combining prompts though. CLIP paper did the same. I think this is something worth adding. To make it work, we need to change the label index in the evaluator. e.g., if label index was 0,1,2, with 3 prompt templates structured in the way you do now, the label will become [[0,1,2], [3,4,5], [6,7,8]]. For an example with groundtruth 0, two ways we can implement it: 1. as long as long it hits one of 0,1,2, it is marked as correct for that example. 2. Ensemble probabilities of 0,1,2, and see if it's higher than ensemble of other label candidates.

For those that couldn't be run, might be great to change to 1 run instead of 5 like @isaac-chung suggested on the weekend, but keeping the full set (avoid remaking the full set at the end). We can run them at once at the end.

So I managed to run everything with 1 experiment only. For ZeroShot, I put back 1 prompt only, handling multiple prompts requires more changes than what I expected and may affect the text-only version of MTEB.

If everything is good here, this PR should be ready to be merged.

imenelydiaker · 2024-07-30T13:37:14Z

Tests are failing because some dependencies are missing for FlagEmbedding, we may need to add them on mieb first and then we'll merge it into this branch.

cc @gowitheflow-1998

gowitheflow-1998 · 2024-07-30T14:38:28Z

thanks! The results look great now. I will merge it and take care of the dependency issue. Will add implementations for multiple prompts at some point at well.

* mieb ZeroshotClassification * mieb docs * mieb implementation demo * model meta; abstask column names; linear probe clf * model meta; abstask column names; linear probe clf * fix: update naming as candidate_labels * Update README.md * Update README.md * i2tretrieval * test load data ignore i2tretrieval * [MIEB] Add image clustering (#1088) * make lint * wip * add TinyImageNet and run * type hints * add accuracy * lint * remove unused & fix typos * T2I Retrieval * Any2AnyRetrieval * fix tests from merge * [MIEB] Add image text pair classification and tests (#1099) * add ImageTextPairClassification abstask and evaluator * dataset transform into sequence of images for each sample * fix processing logic; list of list images compatability * lint and docstrings * make lint * fix failing tests in TaskMetadata * add tests for mieb * skip gated repo --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [MIEB] Add image classification and zero shot classification tasks (#1101) * fix task metadata * use overrideable column names * add CIFAR datasets * add caltech101 dataset * add FGVC aircraft dataset * add food 101 dataset * add OxfordPets dataset * remove comments * correct cifar100 path * update cifar100 classification results * cifar zero shot results * add caltech101 zero shot * matching CLIP paper implementation * add aircraft and food zero shot * add oxford pets zero shot * [MIEB] Add CIFAR clustering (#1104) add CIFAR clustering * [MIEB] Add more image classification and zero shot classification datasets (#1103) * update category to i2t * add MNIST linear probe and zero shot * add FER2013 linear probe and zero shot * add stanford cars linear probe and zero shot * add birdsnap linear probe and zero shot * add eurosat linear probe and zero shot * lint * correct eurosat zero shot labels * add abstask for image multilable and voc2007 * make lint * [MIEB] Add more image classification and zero shot datasets (#1105) * add STL10 linear probe and zero shot * add RESISC45 linear probe and zeor shot * add Describable textures linear probe and zero shot * fix spacing lint * add SUN397 linear probe and zero shot * correct SUN397 zero shot captions * add baai bge vista * add e5-v * linting * memory issues for image linear probe & zeroshot * kknn linear probe arguments * del comments * Add some classification and ZeroShot classification tasks (#1107) * Add Country211 classification task * Add imagenet1k classification task * Add UCF101 classification task * Add PatchCamelyon Classification task * Add GTSRB classification task * Add GSTRB Zero Shot Classification * Add country211 zero shot classification * Add results for classification tasks * Add zero shot classification tasks * Add PatchCamelyon tasks and results * Add linting * Add results and fix prompts for zero shot * Add results * Add results and linting * fix dependency & clip mock test * [MIEB] Add jina clip (#1120) * add jina clip and mscoco i2t and t2i results * make lint * [MIEB] Update `mieb` with the `main` branch and some fixes (#1126) * fix instruction retrival (#1072) * fix instruction retrival * fix test * add points * make nested results * add test * skip instruction test * fix instruction passes * fix unions * move do_length_ablation Co-authored-by: Kenneth Enevoldsen <[email protected]> --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update points table * fix: fix bug-causing spelling error in function name of e5-mistral-instruct (#1106) found bug * 1.12.85 Automatically generated by python-semantic-release * fix: MultilingualSentimentClassification (#1109) * Update points table * fix: Avoid spaces in dataset name for CQADupstack and ignore speed tasks * 1.12.86 Automatically generated by python-semantic-release * fix: Ensure that MLSUMClusteringP2P.v2 use the fast implementation as was intended (#1112) * fix: Ensure that MLSUMClusteringP2P.v2 use the fast implementation as was intended * fix: fixed formatting for cli * docs: improve searchability in the advanced usage documentation * 1.12.87 Automatically generated by python-semantic-release * docs: improve searchability in the advanced usage documentation (#1113) * docs: improve searchability in the advanced usage documentation * docs: update based on corrections * fix: export type for `mteb create_meta` (#1114) * fix export type * fix dataset version too * 1.12.88 Automatically generated by python-semantic-release * fix: Simplify models implementations (#1085) * Merge * Adapt * Simplify * Check for rev again * Rmv cmmnt * Simplify * simplify * Rmv comment Co-authored-by: Kenneth Enevoldsen <[email protected]> * Use logging; change try except; add info * Lint * Rmv results * Update rev * format * Simplify models; Allow instructions * Jobs * Fix merge * Format * Adapt models * fix: ensure that e5 ignores the NQ * format --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * 1.12.89 Automatically generated by python-semantic-release * fix: nomic models using prefix correctly (#1125) * fix: nomic models using prefix correctly * chore: remove comment * fix: handling in case not torch tensor * Fix typo --------- Co-authored-by: Niklas Muennighoff <[email protected]> * 1.12.90 Automatically generated by python-semantic-release * refactor vista model wrapper to contain lib import * python 38 type hints --------- Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: anpalmak2003 <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Niklas Muennighoff <[email protected]> Co-authored-by: Zach Nussbaum <[email protected]> Co-authored-by: chenghao xiao <[email protected]> * image memoery issues for all retrieval Abstasks * Add CLEVR and SciMMIR Image-Text Understanding tasks (#1127) * Add CLEVER and SciMMIR * Update metadata * remove useless comment * Add linting * fix typo and tests * Add CLEVR count task * add linting * add fashion200k & fashionIQ test passed * clip text max seq truncation * add WebQA, NIGHTS, OVEN * any2any retrieval chunk encoding * add nomic vision model; any2any topk bug * add cv recall * add InfoSeek; VisualNews * [MIEB] Add Stanford Cars i2i Retrieval (#1147) * wip * add results * make lint * change back the order * [MIEB] Add CUB200 i2i retrieval (#1154) * add cub200 and results * add skip_first_result * skipped self and rerun results * consolidate i2t and t2i to any2any * remove abstask and evaluators * remove references from test --------- Co-authored-by: gowitheflow-1998 <[email protected]> Co-authored-by: chenghao xiao <[email protected]> Co-authored-by: Imene Kerboua <[email protected]> Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: anpalmak2003 <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Niklas Muennighoff <[email protected]> Co-authored-by: Zach Nussbaum <[email protected]>

imenelydiaker added 12 commits July 21, 2024 11:24

Add Country211 classification task

95e9989

Add imagenet1k classification task

f391d9a

Add UCF101 classification task

10bf074

Add PatchCamelyon Classification task

2909e5b

Add GTSRB classification task

21efc7c

Add GSTRB Zero Shot Classification

4093c79

Add country211 zero shot classification

b454dac

Add results for classification tasks

e10608f

Add zero shot classification tasks

db6e2ab

Add PatchCamelyon tasks and results

0838388

Merge branch 'mieb' into add-classification-tasks

cbecd64

Add linting

a483e3b

imenelydiaker requested review from isaac-chung and gowitheflow-1998 July 23, 2024 09:31

KennethEnevoldsen reviewed Jul 23, 2024

View reviewed changes

imenelydiaker added 4 commits July 27, 2024 12:41

Merge branch 'mieb' into add-classification-tasks

1358c38

Add results and fix prompts for zero shot

ec3c290

Add results

38e1564

Add results and linting

4980b30

gowitheflow-1998 merged commit c0f0021 into mieb Jul 30, 2024
4 of 9 checks passed

gowitheflow-1998 deleted the add-classification-tasks branch July 30, 2024 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some classification and ZeroShot classification tasks #1107

Add some classification and ZeroShot classification tasks #1107

imenelydiaker commented Jul 23, 2024 •

edited

Loading

KennethEnevoldsen left a comment

KennethEnevoldsen Jul 23, 2024

imenelydiaker Jul 30, 2024

isaac-chung Jul 30, 2024

KennethEnevoldsen Jul 23, 2024

imenelydiaker Jul 30, 2024

KennethEnevoldsen Jul 23, 2024

isaac-chung commented Jul 23, 2024 •

edited

Loading

isaac-chung commented Jul 23, 2024

imenelydiaker commented Jul 23, 2024

gowitheflow-1998 commented Jul 23, 2024 •

edited

Loading

gowitheflow-1998 commented Jul 24, 2024

imenelydiaker commented Jul 30, 2024 •

edited

Loading

imenelydiaker commented Jul 30, 2024 •

edited

Loading

gowitheflow-1998 commented Jul 30, 2024

Add some classification and ZeroShot classification tasks #1107

Add some classification and ZeroShot classification tasks #1107

Conversation

imenelydiaker commented Jul 23, 2024 • edited Loading

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

KennethEnevoldsen Jul 23, 2024

Choose a reason for hiding this comment

imenelydiaker Jul 30, 2024

Choose a reason for hiding this comment

isaac-chung Jul 30, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Jul 23, 2024

Choose a reason for hiding this comment

imenelydiaker Jul 30, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Jul 23, 2024

Choose a reason for hiding this comment

isaac-chung commented Jul 23, 2024 • edited Loading

isaac-chung commented Jul 23, 2024

imenelydiaker commented Jul 23, 2024

gowitheflow-1998 commented Jul 23, 2024 • edited Loading

gowitheflow-1998 commented Jul 24, 2024

imenelydiaker commented Jul 30, 2024 • edited Loading

imenelydiaker commented Jul 30, 2024 • edited Loading

gowitheflow-1998 commented Jul 30, 2024

imenelydiaker commented Jul 23, 2024 •

edited

Loading

isaac-chung commented Jul 23, 2024 •

edited

Loading

gowitheflow-1998 commented Jul 23, 2024 •

edited

Loading

imenelydiaker commented Jul 30, 2024 •

edited

Loading

imenelydiaker commented Jul 30, 2024 •

edited

Loading