Are We Done with MMLU?

Read original: arXiv:2406.04127 - Published 6/10/2024 by Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani and 6 others

Overview

The paper "Are We Done with MMLU?" critically examines the popular Massive Multitask Language Understanding (MMLU) benchmark, which is used to evaluate the broad language understanding capabilities of large language models.
The authors identify several issues with MMLU and propose ways to make it more robust and challenging.
The paper also discusses alternative benchmarks, such as KMMLU for Korean language understanding and CMMU for Chinese multi-modal understanding.

Plain English Explanation

The paper looks at a well-known test called MMLU, which is used to see how well large language models (LLMs) can understand and reason about a wide range of topics. The authors argue that MMLU has some issues that need to be addressed. For example, they say the test questions are too easy and don't really challenge the models enough.

The authors propose ways to make MMLU more robust and difficult, such as by including more diverse and challenging questions. They also discuss other benchmarks like KMMLU and CMMU that test language understanding in different languages and modalities.

The key idea is that we need better ways to evaluate how capable these powerful language models truly are, beyond just answering simple trivia questions. By making the tests more challenging, we can get a clearer picture of the models' strengths and limitations.

Technical Explanation

The paper first provides an overview of the MMLU benchmark, which tests a model's ability to answer questions across a diverse set of 57 topics. The authors then identify several issues with MMLU, such as:

The questions are often too easy, allowing even small language models to achieve high scores.
The dataset contains biases and other flaws that can be exploited by models to perform well without true understanding.
MMLU only tests language understanding in English, limiting its usefulness for evaluating models in other languages.

To address these problems, the authors propose an "Error Detection Taxonomy" that categorizes different types of errors models can make on MMLU. They use this taxonomy to analyze the strengths and weaknesses of various language models on the benchmark.

The paper also discusses alternative benchmarks like KMMLU for Korean and CMMU for Chinese, which test language understanding in different languages and modalities. Additionally, the authors reference research on predicting item difficulty as a way to make MMLU more challenging.

Critical Analysis

The paper raises valid concerns about the limitations of MMLU and the need for more robust and challenging language understanding benchmarks. The authors' proposed "Error Detection Taxonomy" is a useful framework for analyzing model performance and identifying areas for improvement.

However, the paper does not provide a comprehensive solution to the issues it identifies. While discussing alternative benchmarks like KMMLU and CMMU is valuable, the authors do not delve deeply into the specific benefits and drawbacks of these alternatives.

Additionally, the paper could have explored the feasibility and potential challenges of implementing the authors' suggestions for making MMLU more difficult, such as using automated item difficulty prediction. This would have provided a more comprehensive understanding of the practical implications of their proposals.

Overall, the paper successfully highlights the need for more robust and challenging language understanding benchmarks, but could have provided a more detailed roadmap for how to address the issues it identifies.

Conclusion

The paper "Are We Done with MMLU?" makes a compelling case that the popular MMLU benchmark has significant limitations and needs to be improved to better evaluate the capabilities of large language models. By identifying specific issues with MMLU and proposing solutions, the authors contribute to the ongoing discussion on how to effectively measure and compare the language understanding abilities of advanced AI systems.

The paper's findings have important implications for the development and evaluation of language models, as well as the broader field of natural language processing. As the authors note, the ability to accurately assess language understanding is crucial for advancing the state of the art and ensuring that these powerful AI systems are being developed and deployed responsibly.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Are We Done with MMLU?

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini

Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error taxonomy. Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU's error-ridden questions to enhance its future utility and reliability as a benchmark. Therefore, we open up MMLU-Redux for additional annotation https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux.

6/10/2024

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen

In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.

6/26/2024

TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish

Arda Yuksel, Abdullatif Koksal, Lutfi Kerem c{S}enel, Anna Korhonen, Hinrich Schutze

Multiple choice question answering tasks evaluate the reasoning, comprehension, and mathematical abilities of Large Language Models (LLMs). While existing benchmarks employ automatic translation for multilingual evaluation, this approach is error-prone and potentially introduces culturally biased questions, especially in social sciences. We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU, to evaluate LLMs' understanding of the Turkish language. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. These questions are written by curriculum experts, suitable for the high-school curricula in Turkey, covering subjects ranging from natural sciences and math questions to more culturally representative topics such as Turkish Literature and the history of the Turkish Republic. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models. We provide an extensive evaluation, including zero-shot and few-shot evaluation of LLMs, chain-of-thought reasoning, and question difficulty analysis along with model performance. We provide an in-depth analysis of the Turkish capabilities and limitations of current LLMs to provide insights for future LLMs for the Turkish language. We publicly release our code for the dataset and evaluation: https://github.com/ArdaYueksel/TurkishMMLU.

7/18/2024

Revisiting Multi-Modal LLM Evaluation

Jian Lu, Shikhar Srivastava, Junyu Chen, Robik Shrestha, Manoj Acharya, Kushal Kafle, Christopher Kanan

With the advent of multi-modal large language models (MLLMs), datasets used for visual question answering (VQA) and referring expression comprehension have seen a resurgence. However, the most popular datasets used to evaluate MLLMs are some of the earliest ones created, and they have many known problems, including extreme bias, spurious correlations, and an inability to permit fine-grained analysis. In this paper, we pioneer evaluating recent MLLMs (LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses in earlier ones. We assess three VQA datasets: 1) TDIUC, which permits fine-grained analysis on 12 question types; 2) TallyQA, which has simple and complex counting questions; and 3) DVQA, which requires optical character recognition for chart understanding. We also study VQDv1, a dataset that requires identifying all image regions that satisfy a given query. Our experiments reveal the weaknesses of many MLLMs that have not previously been reported. Our code is integrated into the widely used LAVIS framework for MLLM evaluation, enabling the rapid assessment of future MLLMs. Project webpage: https://kevinlujian.github.io/MLLM_Evaluations/

8/13/2024