Skip to content

Latest commit



124 lines (120 loc) · 16.6 KB

File metadata and controls

124 lines (120 loc) · 16.6 KB


A list of supported tasks and task groupings can be viewed with lm-eval --tasks list.

For more information, including a full list of task names and their precise meanings or sources, follow the links provided to the individual files for each subfolder.

Task Family Description Language(s)
aclue Tasks focusing on ancient Chinese language understanding and cultural aspects. Ancient Chinese
aexams Tasks in Arabic related to various academic exams covering a range of subjects. Arabic
agieval Tasks involving historical data or questions related to history and historical texts. English, Chinese
anli Adversarial natural language inference tasks designed to test model robustness. English
arabic_leaderboard_complete A full version of the tasks in the Open Arabic LLM Leaderboard, focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. Arabic (Some MT)
arabic_leaderboard_light A light version of the tasks in the Open Arabic LLM Leaderboard (i.e., 10% samples of the test set in the original benchmarks), focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. Arabic (Some MT)
arabicmmlu Localized Arabic version of MMLU with multiple-choice questions from 40 subjects. Arabic
arc Tasks involving complex reasoning over a diverse set of questions. English
arithmetic Tasks involving numerical computations and arithmetic reasoning. English
asdiv Tasks involving arithmetic and mathematical reasoning challenges. English
babi Tasks designed as question and answering challenges based on simulated stories. English
basqueglue Tasks designed to evaluate language understanding in Basque language. Basque
bbh Tasks focused on deep semantic understanding through hypothesization and reasoning. English, German
belebele Language understanding tasks in a variety of languages and scripts. Multiple (122 languages)
benchmarks General benchmarking tasks that test a wide range of language understanding capabilities.
bertaqa Local Basque cultural trivia QA tests in English and Basque languages. English, Basque, Basque (MT)
bigbench Broad tasks from the BIG-bench benchmark designed to push the boundaries of large models. Multiple
blimp Tasks testing grammatical phenomena to evaluate language model's linguistic capabilities. English
ceval Tasks that evaluate language understanding and reasoning in an educational context. Chinese
cmmlu Multi-subject multiple choice question tasks for comprehensive academic assessment. Chinese
code_x_glue Tasks that involve understanding and generating code across multiple programming languages. Go, Java, JS, PHP, Python, Ruby
commonsense_qa CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge. English
copal_id Indonesian causal commonsense reasoning dataset that captures local nuances. Indonesian
coqa Conversational question answering tasks to test dialog understanding. English
crows_pairs Tasks designed to test model biases in various sociodemographic groups. English, French
csatqa Tasks related to SAT and other standardized testing questions for academic assessment. Korean
drop Tasks requiring numerical reasoning, reading comprehension, and question answering. English
eq_bench Tasks focused on equality and ethics in question answering and decision-making. English
eus_exams Tasks based on various professional and academic exams in the Basque language. Basque
eus_proficiency Tasks designed to test proficiency in the Basque language across various topics. Basque
eus_reading Reading comprehension tasks specifically designed for the Basque language. Basque
eus_trivia Trivia and knowledge testing tasks in the Basque language. Basque
fda Tasks for extracting key-value pairs from FDA documents to test information extraction. English
fld Tasks involving free-form and directed dialogue understanding. English
french_bench Set of tasks designed to assess language model performance in French. French
glue General Language Understanding Evaluation benchmark to test broad language abilities. English
gpqa Tasks designed for general public question answering and knowledge verification. English
gsm8k A benchmark of grade school math problems aimed at evaluating reasoning capabilities. English
haerae Tasks focused on assessing detailed factual and historical knowledge. Korean
headqa A high-level education-based question answering dataset to test specialized knowledge. Spanish, English
hellaswag Tasks to predict the ending of stories or scenarios, testing comprehension and creativity. English
hendrycks_ethics Tasks designed to evaluate the ethical reasoning capabilities of models. English
hendrycks_math Mathematical problem-solving tasks to test numerical reasoning and problem-solving. English
ifeval Interactive fiction evaluation tasks for narrative understanding and reasoning. English
inverse_scaling Multiple-choice tasks from the Inverse Scaling Prize, designed to find settings where larger language models perform worse. English
kmmlu Knowledge-based multi-subject multiple choice questions for academic evaluation. Korean
kobest A collection of tasks designed to evaluate understanding in Korean language. Korean
kormedmcqa Medical question answering tasks in Korean to test specialized domain knowledge. Korean
lambada Tasks designed to predict the endings of text passages, testing language prediction skills. English
lambada_cloze Cloze-style LAMBADA dataset. English
lambada_multilingual Multilingual LAMBADA dataset. This is a legacy version of the multilingual dataset, and users should instead use lambada_multilingual_stablelm. German, English, Spanish, French, Italian
lambada_multilingual_stablelm Multilingual LAMBADA dataset. Users should prefer evaluating on this version of the multilingual dataset instead of on lambada_multilingual. German, English, Spanish, French, Italian, Dutch, Portuguese
leaderboard Task group used by Hugging Face's Open LLM Leaderboard v2. Those tasks are static and will not change through time English
lingoly Challenging logical reasoning benchmark in low-resource languages with controls for memorization English, Multilingual
logiqa Logical reasoning tasks requiring advanced inference and deduction. English, Chinese
logiqa2 Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination. English, Chinese
mathqa Question answering tasks involving mathematical reasoning and problem-solving. English
mc_taco Question-answer pairs that require temporal commonsense comprehension. English
med_concepts_qa Benchmark for evaluating LLMs on their abilities to interpret medical codes and distinguish between medical concept. English
medmcqa Medical multiple choice questions assessing detailed medical knowledge. English
medqa Multiple choice question answering based on the United States Medical License Exams.
mgsm Benchmark of multilingual grade-school math problems. Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu
minerva_math Mathematics-focused tasks requiring numerical reasoning and problem-solving skills. English
mmlu Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. English
mmlusr Variation of MMLU designed to be more rigorous. English
model_written_evals Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns.
mutual A retrieval-based dataset for multi-turn dialogue reasoning. English
nq_open Open domain question answering tasks based on the Natural Questions dataset. English
okapi/arc_multilingual Tasks that involve reading comprehension and information retrieval challenges. Multiple (31 languages) Machine Translated.
okapi/hellaswag_multilingual Tasks that involve reading comprehension and information retrieval challenges. Multiple (30 languages) Machine Translated.
okapi/mmlu_multilingual Tasks that involve reading comprehension and information retrieval challenges. Multiple (34 languages) Machine Translated.
okapi/truthfulqa_multilingual Tasks that involve reading comprehension and information retrieval challenges. Multiple (31 languages) Machine Translated.
openbookqa Open-book question answering tasks that require external knowledge and reasoning. English
paloma Paloma is a comprehensive benchmark designed to evaluate open language models across a wide range of domains, ranging from niche artist communities to mental health forums on Reddit. English
paws-x Paraphrase Adversaries from Word Scrambling, focusing on cross-lingual capabilities. English, French, Spanish, German, Chinese, Japanese, Korean
pile Open source language modelling data set that consists of 22 smaller, high-quality datasets. English
pile_10k The first 10K elements of The Pile, useful for debugging models trained on it. English
piqa Physical Interaction Question Answering tasks to test physical commonsense reasoning. English
polemo2 Sentiment analysis and emotion detection tasks based on Polish language data. Polish
prost Tasks requiring understanding of professional standards and ethics in various domains. English
pubmedqa Question answering tasks based on PubMed research articles for biomedical understanding. English
qa4mre Question Answering for Machine Reading Evaluation, assessing comprehension and reasoning. English
qasper Question Answering dataset based on academic papers, testing in-depth scientific knowledge. English
race Reading comprehension assessment tasks based on English exams in China. English
realtoxicityprompts Tasks to evaluate language models for generating text with potential toxicity.
sciq Science Question Answering tasks to assess understanding of scientific concepts. English
scrolls Tasks that involve long-form reading comprehension across various domains. English
siqa Social Interaction Question Answering to evaluate common sense and social reasoning. English
squad_completion A variant of the SQuAD question answering task designed for zero-shot evaluation of small LMs. English
squadv2 Stanford Question Answering Dataset version 2, a reading comprehension benchmark. English
storycloze Tasks to predict story endings, focusing on narrative logic and coherence. English
super_glue A suite of challenging tasks designed to test a range of language understanding skills. English
swag Situations With Adversarial Generations, predicting the next event in videos. English
swde Information extraction tasks from semi-structured web pages. English
tinyBenchmarks Evaluation of large language models with fewer examples using tiny versions of popular benchmarks. English
tmmluplus An extended set of tasks under the TMMLU framework for broader academic assessments. Traditional Chinese
toxigen Tasks designed to evaluate language models on their propensity to generate toxic content. English
translation Tasks focused on evaluating the language translation capabilities of models. Arabic, English, Spanish, Basque, Hindi, Indonesian, Burmese, Russian, Swahili, Telugu, Chinese
triviaqa A large-scale dataset for trivia question answering to test general knowledge. English
truthfulqa A QA task aimed at evaluating the truthfulness and factual accuracy of model responses. English
turkishmmlu A multiple-choice QA test modeled after MMLU, written in Turkish based on Turkish high-school level exams. Turkish
unitxt A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI. English
unscramble Tasks involving the rearrangement of scrambled sentences to test syntactic understanding. English
webqs Web-based question answering tasks designed to evaluate internet search and retrieval. English
wikitext Tasks based on text from Wikipedia articles to assess language modeling and generation. English
winogrande A large-scale dataset for coreference resolution, inspired by the Winograd Schema Challenge. English
wmdp A benchmark with the objective of minimizing performance, based on potentially-sensitive multiple-choice knowledge questions. English
wmt2016 Tasks from the WMT 2016 shared task, focusing on translation between multiple languages. English, Czech, German, Finnish, Russian, Romanian, Turkish
wsc273 The Winograd Schema Challenge, a test of commonsense reasoning and coreference resolution. English
xcopa Cross-lingual Choice of Plausible Alternatives, testing reasoning in multiple languages. Estonian, Haitian, Indonesian, Italian, Quechua, Swahili, Tamil, Thai, Turkish, Vietnamese, Chinese
xnli Cross-Lingual Natural Language Inference to test understanding across different languages. Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese
xnli_eu Cross-lingual Natural Language Inference tasks in Basque. Basque
xstorycloze Cross-lingual narrative understanding tasks to predict story endings in multiple languages. Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese
xwinograd Cross-lingual Winograd schema tasks for coreference resolution in multiple languages. English, French, Japanese, Portuguese, Russian, Chinese
portuguese_bench Collection of tasks in European Portuguese encompassing various evaluation areas. Portuguese