[EVAL REQUEST] jina-embeddings-v3 #77

kaisugi · 2024-09-19T02:34:26Z

kaisugi · 2024-09-25T23:31:05Z

Note: I found on X(Twitter) that one of the authors (@bwanglzu ) has already completed the evaluations 😳
https://x.com/bo_wangbo/status/1838919204377911477

lsz05 · 2024-09-26T08:41:39Z

Thank you for the information!
I managed to run the evaluation of the model yesterday, but didn't succeed. Keeping debugging now.

kaisugi · 2024-09-26T10:10:29Z

Great, look forward to the official results 😊

lsz05 · 2024-09-27T07:59:30Z

@kaisugi

I tried the model with fast datasets, but I found that tasks except Classification worked better without LoRA than with LoRA. In Retrieval task, the results were better without prefixes.

The results that is the most similar with https://x.com/bo_wangbo/status/1838919204377911477 are, no prefixes, no LoRA except Classification.

My results are as below:

no prefixes, no LoRA except Classification

{
    "Classification": {
        "amazon_counterfactual_classification": {
            "macro_f1": 0.7949948725329687
        },
        "massive_intent_classification": {
            "macro_f1": 0.7766347542682803
        },
        "massive_scenario_classification": {
            "macro_f1": 0.8982075621284786
        }
    },
    "Retrieval": {
        "jagovfaqs_22k": {
            "ndcg@10": 0.7449944044307708
        },
        "nlp_journal_abs_intro": {
            "ndcg@10": 0.9941946751679634
        },
        "nlp_journal_title_abs": {
            "ndcg@10": 0.9717376985433034
        },
        "nlp_journal_title_intro": {
            "ndcg@10": 0.9609029386920315
        }
    },
    "STS": {
        "jsick": {
            "spearman": 0.8146985042196159
        },
        "jsts": {
            "spearman": 0.8068520872331155
        }
    },
    "Clustering": {
        "livedoor_news": {
            "v_measure_score": 0.5036707354224619
        },
        "mewsc16": {
            "v_measure_score": 0.474391205388421
        }
    },
    "PairClassification": {
        "paws_x_ja": {
            "binary_f1": 0.623716814159292
        }
    }
}

no prefixes, with LoRA

{
    "Classification": {
        "amazon_counterfactual_classification": {
            "macro_f1": 0.7949948725329687
        },
        "massive_intent_classification": {
            "macro_f1": 0.7766347542682803
        },
        "massive_scenario_classification": {
            "macro_f1": 0.8982075621284786
        }
    },
    "Retrieval": {
        "jagovfaqs_22k": {
            "ndcg@10": 0.7255870901661032
        },
        "nlp_journal_abs_intro": {
            "ndcg@10": 0.9829431790599418
        },
        "nlp_journal_title_abs": {
            "ndcg@10": 0.9552122947731903
        },
        "nlp_journal_title_intro": {
            "ndcg@10": 0.9324205002364649
        }
    },
    "STS": {
        "jsick": {
            "spearman": 0.7816133481804449
        },
        "jsts": {
            "spearman": 0.8193021839272429
        }
    },
    "Clustering": {
        "livedoor_news": {
            "v_measure_score": 0.5387525923415666
        },
        "mewsc16": {
            "v_measure_score": 0.43532523021586217
        }
    },
    "PairClassification": {
        "paws_x_ja": {
            "binary_f1": 0.623716814159292
        }
    }
}

with prefixes, with LoRA

{
    "Classification": {
        "amazon_counterfactual_classification": {
            "macro_f1": 0.7949948725329687
        },
        "massive_intent_classification": {
            "macro_f1": 0.7766347542682803
        },
        "massive_scenario_classification": {
            "macro_f1": 0.8982075621284786
        }
    },
    "Retrieval": {
        "jagovfaqs_22k": {
            "ndcg@10": 0.7157443309160252
        },
        "nlp_journal_abs_intro": {
            "ndcg@10": 0.9849100129100982
        },
        "nlp_journal_title_abs": {
            "ndcg@10": 0.9560377251324601
        },
        "nlp_journal_title_intro": {
            "ndcg@10": 0.9372937234643258
        }
    },
    "STS": {
        "jsick": {
            "spearman": 0.7816133481804449
        },
        "jsts": {
            "spearman": 0.8193021839272429
        }
    },
    "Clustering": {
        "livedoor_news": {
            "v_measure_score": 0.5313213726075848
        },
        "mewsc16": {
            "v_measure_score": 0.43532523021586217
        }
    },
    "PairClassification": {
        "paws_x_ja": {
            "binary_f1": 0.623716814159292
        }
    }
}

LoRA settings (if w/):

classification: Classification
text-matching: STS, PairClassification
saparation: Clustering, Reranking
retrieval.query: Retrieval (when encoding queries)
retrieval.passage: Retrieval (when encoding documents)

Prefix settings (if w/):

for queries in Retrieval: https://huggingface.co/jinaai/jina-embeddings-v3/blob/54da5c0cbd10718cb32b8925c9607a53c5eec8d2/config_sentence_transformers.json#L8
for documents in Retrieval: https://huggingface.co/jinaai/jina-embeddings-v3/blob/54da5c0cbd10718cb32b8925c9607a53c5eec8d2/config_sentence_transformers.json#L9
for other tasks: no prefix

kaisugi · 2024-09-27T08:09:54Z

Thank you so much for your hard work!

bwanglzu · 2024-09-27T08:51:48Z

hi @lsz05 @courage i hacked a bit the code to make it work the thing i changed:

src/jmteb/embedders/base.py

in the TextEmbeder class, i added a task parameter to make sure task is correctly send to the encode function.

src/jmteb/embedders/sbert_embedder.py

in the SentenceBertEmbedder class, i chanted max_seq_length to 512 since some of the task (MrTidy) is too slow, and i added prompt_name and task into encode function, prompt_name is set to identical task as defined here: we use 2 instructions for retrieval adapter.

only running retrieval task, i modified /src/jmteb/evaluators/retrieval/evaluator.py to send different task during indexing and searching:

I agree my code is a bit "dirty" as i only want to quickly check the results :) hopefully you understand. If i missed anything in your code base result in a different eval result please let me know :)

bwanglzu · 2024-09-27T08:56:52Z

but i'm also quite surprised (in a good way) your score is better than what i reported lol :) maybe there is something wrong in my code, but at least not worse , for mewsc16 clustering i noticed my score is higher than yours, this is what i have:

{
    "metric_name": "v_measure_score",
    "metric_value": 0.4966872142615049,
    "details": {
        "optimal_clustering_model_name": "AgglomerativeClustering",
        "val_scores": {
            "MiniBatchKMeans": {
                "v_measure_score": 0.4573582252706992,
                "homogeneity_score": 0.49434785878175236,
                "completeness_score": 0.425518738350574
            },
            "AgglomerativeClustering": {
                "v_measure_score": 0.5159727698724647,
                "homogeneity_score": 0.558382996336062,
                "completeness_score": 0.47955005205722434
            },
            "BisectingKMeans": {
                "v_measure_score": 0.45289840369081835,
                "homogeneity_score": 0.4964330478306176,
                "completeness_score": 0.4163836804409629
            },
            "Birch": {
                "v_measure_score": 0.4943869746128702,
                "homogeneity_score": 0.5396066604339305,
                "completeness_score": 0.4561602021543821
            }
        },
        "test_scores": {
            "AgglomerativeClustering": {
                "v_measure_score": 0.4966872142615049,
                "homogeneity_score": 0.5340024254176485,
                "completeness_score": 0.4642464368511074
            }
        }
    }
}

lsz05 · 2024-09-27T09:17:03Z

hi @lsz05 @courage i hacked a bit the code to make it work the thing i changed:

src/jmteb/embedders/base.py

in the TextEmbeder class, i added a task parameter to make sure task is correctly send to the encode function.

src/jmteb/embedders/sbert_embedder.py

in the SentenceBertEmbedder class, i chanted max_seq_length to 512 since some of the task (MrTidy) is too slow, and i added prompt_name and task into encode function, prompt_name is set to identical task as defined here: we use 2 instructions for retrieval adapter.

only running retrieval task, i modified /src/jmteb/evaluators/retrieval/evaluator.py to send different task during indexing and searching:

I agree my code is a bit "dirty" as i only want to quickly check the results :) hopefully you understand. If i missed anything in your code base result in a different eval result please let me know :)

I think I'm doing the same thing as you in #80

lsz05 · 2024-09-27T09:19:50Z

but i'm also quite surprised (in a good way) your score is better than what i reported lol :) maybe there is something wrong in my code, but at least not worse , for mewsc16 clustering i noticed my score is higher than yours, this is what i have:

{
    "metric_name": "v_measure_score",
    "metric_value": 0.4966872142615049,
    "details": {
        "optimal_clustering_model_name": "AgglomerativeClustering",
        "val_scores": {
            "MiniBatchKMeans": {
                "v_measure_score": 0.4573582252706992,
                "homogeneity_score": 0.49434785878175236,
                "completeness_score": 0.425518738350574
            },
            "AgglomerativeClustering": {
                "v_measure_score": 0.5159727698724647,
                "homogeneity_score": 0.558382996336062,
                "completeness_score": 0.47955005205722434
            },
            "BisectingKMeans": {
                "v_measure_score": 0.45289840369081835,
                "homogeneity_score": 0.4964330478306176,
                "completeness_score": 0.4163836804409629
            },
            "Birch": {
                "v_measure_score": 0.4943869746128702,
                "homogeneity_score": 0.5396066604339305,
                "completeness_score": 0.4561602021543821
            }
        },
        "test_scores": {
            "AgglomerativeClustering": {
                "v_measure_score": 0.4966872142615049,
                "homogeneity_score": 0.5340024254176485,
                "completeness_score": 0.4642464368511074
            }
        }
    }
}

I think I'll have to fix some randomness problems (e.g., fix the random seed in training to make sure everything can be exactly reproduced) in Clustering and Classification (where training is conducted). As the method that works best in dev set will be chosen, in my case Birch worked slightly better in dev but not so well in test, as a result the test score is not as high as your eval.

My result is as following

{
    "metric_name": "v_measure_score",
    "metric_value": 0.474391205388421,
    "details": {
        "optimal_clustering_model_name": "Birch",
        "val_scores": {
            "MiniBatchKMeans": {
                "v_measure_score": 0.45751218122353327,
                "homogeneity_score": 0.5000149261766943,
                "completeness_score": 0.42166906571540486
            },
            "AgglomerativeClustering": {
                "v_measure_score": 0.4884748969401506,
                "homogeneity_score": 0.5211802377702618,
                "completeness_score": 0.45963186760591423
            },
            "BisectingKMeans": {
                "v_measure_score": 0.4051884446721869,
                "homogeneity_score": 0.4429226569148086,
                "completeness_score": 0.3733789195189944
            },
            "Birch": {
                "v_measure_score": 0.48868192903235214,
                "homogeneity_score": 0.529365428957467,
                "completeness_score": 0.45380546454681364
            }
        },
        "test_scores": {
            "Birch": {
                "v_measure_score": 0.474391205388421,
                "homogeneity_score": 0.5112647214750645,
                "completeness_score": 0.44247868671235824
            }
        }
    }
}

bwanglzu · 2024-09-27T09:25:43Z

i think your PR looks good, maybe two things:

i'm using model.half() to make it a bit faster.
seq length set to 512 to make it a bit faster

I'm not sure why using LoRA make the performance a bit worse than w.o. LoRA (for example, on STS). Using LoRA is always my default choice :)

one small thing to notice is prefix is only applied to Retrieval, not other tasks.

bwanglzu · 2024-09-27T09:26:33Z

btw have you considered to move JMTEB to the official MTEB leaderboard, this will greatly simplify your work.

lsz05 · 2024-10-02T05:05:22Z

@bwanglzu @kaisugi
I have updated the full results in #81 . Would you please take a look?

lsz05 · 2024-10-02T05:09:26Z

i think your PR looks good, maybe two things:

i'm using model.half() to make it a bit faster.

seq length set to 512 to make it a bit faster

I'm not sure why using LoRA make the performance a bit worse than w.o. LoRA (for example, on STS). Using LoRA is always my default choice :)

one small thing to notice is prefix is only applied to Retrieval, not other tasks.

I didn't use half neither set the length to 512, and took about half a day for the full evaluation.

I examined how half affects the evaluation results. The result is, the scores don't change significantly, with time reduced to less than half. (But as it was a weekend, I didn't use half to make it fast)

I applied your prefixes to Retrieval in full evaluation, as you write in your huggingface repo.

lsz05 · 2024-10-02T05:11:27Z

btw have you considered to move JMTEB to the official MTEB leaderboard, this will greatly simplify your work.

We are considering it but also concerning about some difference (e.g., usage of dev set).

Someone's done it, but not fully finished embeddings-benchmark/mteb#749

Samoed · 2024-10-03T11:47:38Z

@lsz05 I'm finishing adding rest of datasets in embeddings-benchmark/mteb#1262

lsz05 mentioned this issue Sep 27, 2024

[Feature] Allow setting encode kwargs in SentenceBert embedder #80

Draft

2 tasks

lsz05 mentioned this issue Oct 1, 2024

[Leaderboard] Add jinaai/jina-embeddings-v3 to leaderboard #81

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EVAL REQUEST] jina-embeddings-v3 #77

[EVAL REQUEST] jina-embeddings-v3 #77

kaisugi commented Sep 19, 2024 •

edited

Loading

kaisugi commented Sep 25, 2024

lsz05 commented Sep 26, 2024

kaisugi commented Sep 26, 2024

lsz05 commented Sep 27, 2024 •

edited

Loading

kaisugi commented Sep 27, 2024

bwanglzu commented Sep 27, 2024 •

edited

Loading

bwanglzu commented Sep 27, 2024 •

edited

Loading

lsz05 commented Sep 27, 2024

lsz05 commented Sep 27, 2024 •

edited

Loading

bwanglzu commented Sep 27, 2024

bwanglzu commented Sep 27, 2024

lsz05 commented Oct 2, 2024

lsz05 commented Oct 2, 2024

lsz05 commented Oct 2, 2024

Samoed commented Oct 3, 2024 •

edited

Loading

[EVAL REQUEST] jina-embeddings-v3 #77

[EVAL REQUEST] jina-embeddings-v3 #77

Comments

kaisugi commented Sep 19, 2024 • edited Loading

モデルの基本情報

モデル詳細

seen/unseen申告

評価スクリプト

その他の情報

kaisugi commented Sep 25, 2024

lsz05 commented Sep 26, 2024

kaisugi commented Sep 26, 2024

lsz05 commented Sep 27, 2024 • edited Loading

kaisugi commented Sep 27, 2024

bwanglzu commented Sep 27, 2024 • edited Loading

bwanglzu commented Sep 27, 2024 • edited Loading

lsz05 commented Sep 27, 2024

lsz05 commented Sep 27, 2024 • edited Loading

bwanglzu commented Sep 27, 2024

bwanglzu commented Sep 27, 2024

lsz05 commented Oct 2, 2024

lsz05 commented Oct 2, 2024

lsz05 commented Oct 2, 2024

Samoed commented Oct 3, 2024 • edited Loading

kaisugi commented Sep 19, 2024 •

edited

Loading

lsz05 commented Sep 27, 2024 •

edited

Loading

bwanglzu commented Sep 27, 2024 •

edited

Loading

bwanglzu commented Sep 27, 2024 •

edited

Loading

lsz05 commented Sep 27, 2024 •

edited

Loading

Samoed commented Oct 3, 2024 •

edited

Loading