Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which version to use #2322

Open
sorobedio opened this issue Sep 19, 2024 · 9 comments
Open

Which version to use #2322

sorobedio opened this issue Sep 19, 2024 · 9 comments
Labels
validation For validation of task implementations.

Comments

@sorobedio
Copy link

Could you explain why the performance of the same model changes significantly depending on the version of lm_eval?

for example with llama3-1-8b-instruct with a batch_size=1
v0.4.2 (leaderboard version)

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_gpqa N/A none 0 acc_norm 0.2743 ± 0.0129
- leaderboard_gpqa_diamond 1 none 0 acc_norm 0.2879 ± 0.0323
- leaderboard_gpqa_extended 1 none 0 acc_norm 0.2637 ± 0.0189
- leaderboard_gpqa_main 1 none 0 acc_norm 0.2812 ± 0.0213
Groups Version Filter n-shot Metric Value Stderr
leaderboard_gpqa N/A none 0 acc_norm 0.2743 ± 0.0129

with version v0.4.3

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_gpqa N/A
- leaderboard_gpqa_diamond 1 none 0 acc_norm 0.2879 ± 0.0323
- leaderboard_gpqa_extended 1 none 0 acc_norm 0.3022 ± 0.0197
- leaderboard_gpqa_main 1 none 0 acc_norm 0.2812 ± 0.0213

with version v0.4.4

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_gpqa N/A
- leaderboard_gpqa_diamond 1 none 0 acc_norm 0.3636 ± 0.0343
- leaderboard_gpqa_extended 1 none 0 acc_norm 0.3242 ± 0.0200
- leaderboard_gpqa_main 1 none 0 acc_norm 0.3214 ± 0.0221

I'm unsure which version to use, as my model performs well on version 0.4.4 but is outperformed by the base LLaMA 3.1-8B on version 0.4.2.

@sorobedio sorobedio changed the title Which version to Trust Which version to use Sep 19, 2024
@baberabb
Copy link
Contributor

baberabb commented Sep 19, 2024

Hi! I think the leaderboard tasks were only added in 0.4.4? Also I believe the leaderboard uses this fork to run their evaluations: https://github.com/huggingface/lm-evaluation-harness

@sorobedio
Copy link
Author

thank you. i checked the leader board they use 0.4.2 as today. That link you share seem to from version 0.4.3

@baberabb
Copy link
Contributor

baberabb commented Sep 19, 2024

They link to the this branch of their repo in their docs, so you should use that. There might have been some delay from when their changes were merged here, and when we cut a new release.

That branch and 0.4.4 should give the same scores but let me know if thats not the case!

@sorobedio
Copy link
Author

Thank you for your response. I have actually reviewed their documentation. The results in the first table reflect the outcomes based on the installation guide provided in their docs. The second table corresponds to the repository link they mention in their documentation, which is the same one you recommended to me. The third table shows results for version 0.4.4.

It seems that the leaderboard is typically using version 0.4.2, while the repository you recommended corresponds to version 0.4.3. Results from 0.4.3 are generally closer to those from 0.4.2, but depending on the task, performance can still vary significantly.

As for version 0.4.4, which is the latest on this repo, the results differ considerably from the previous versions. That's why I’ve provided the results for each version, to highlight these differences. In some cases, models perform better in the newer version, but in others, they may perform worse.

My main problem is i have some issue when trying to used version 0.4.2 in code where the model name is no a string such as

` lm_eval_model = HFLM(model, device=device,
batch_size=1,
tokenizer=tokenizer,
)

    task_manager = lm_eval.tasks.TaskManager()
    #
    results = lm_eval.simple_evaluate(  # call simple_evaluate
        model=lm_eval_model,
        tasks=["leaderboard_gpqa"],
        num_fewshot=0,
        apply_chat_template=True,
        # fewshot_as_multiturn=True,
        # output_base_path="results_Out",
        task_manager=task_manager,
    )`

i got this error. lm-evaluation-harness/lm_eval/models/huggingface.py", line 409, in model return self.accelerator.unwrap_model(self._model) AttributeError: 'NoneType' object has no attribute 'unwrap_model'

if i can fix that problem i think i will keep using version 0.4.2 so that i can benchmark my model with leaderboard models.

@sorobedio
Copy link
Author

the above experiments are conducted using the command line option.

@baberabb
Copy link
Contributor

I'll look into this. For the error in 0.4.2, I think if you remove this condition and just return self._model, it will probably work.

@sorobedio
Copy link
Author

sorobedio commented Sep 20, 2024

Thank you for your help.
i looks like they updated the leaderboard forked to 0.4.3 today.
Let me know when you determine the reason for the performance differences between versions. In such cases, it's important to decide which version should be used for model evaluation to maintain consistency across all published results

@baberabb
Copy link
Contributor

baberabb commented Sep 21, 2024

Hi! I made some runs on main and their adding_all_changess branch, with meta-llama/Meta-Llama-3.1-8B-Instruct. Getting similar results (and wasn't able to reproduce your results). What command are you using?

lm_eval --model hf -a pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,trust_remote_code=True,dtype=float16 --tasks leaderboard_gpqa -b 1 --device cuda:0 --trust_remote_code --verbosity DEBUG

main
hf (pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,trust_remote_code=True,dtype=float16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_gpqa N/A
- leaderboard_gpqa_diamond 1 none 0 acc_norm 0.3131 ± 0.0330
- leaderboard_gpqa_extended 1 none 0 acc_norm 0.2930 ± 0.0195
- leaderboard_gpqa_main 1 none 0 acc_norm 0.3482 ± 0.0225

HF
hf (pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,trust_remote_code=True,dtype=float16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_gpqa N/A none 0 acc_norm 0.3171 ± 0.0135
- leaderboard_gpqa_diamond 1 none 0 acc_norm 0.3131 ± 0.0330
- leaderboard_gpqa_extended 1 none 0 acc_norm 0.2912 ± 0.0195
- leaderboard_gpqa_main 1 none 0 acc_norm 0.3504 ± 0.0226

@sorobedio
Copy link
Author

aha thank here is my command
lm_eval --model hf \ --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype="bfloat16"\ --tasks leaderboard_gpqa\ --device cuda:0 \ --num_fewshot 0\ --apply_chat_template\ --batch_size 1
The only differences are that I used bfloat16, the apply-chat template, and set the batch size to 1, while you used a float16 model, which is different from bfloat16. This explains why the results do not match. Additionally, you used some parameters that I did not. You can simply copy and paste my command. thank you

@baberabb baberabb added the validation For validation of task implementations. label Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
validation For validation of task implementations.
Projects
None yet
Development

No branches or pull requests

2 participants