MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

2406.01574

Published 6/26/2024 by Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang and 7 others

cs.CL

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Abstract

In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.

Create account to get full access

Overview

This paper introduces MMLU-Pro, a new multi-task language understanding benchmark that aims to be more robust and challenging than existing benchmarks.
MMLU-Pro includes a diverse set of tasks covering various domains, and the authors propose several new techniques to make the benchmark more reliable and informative.
The paper evaluates the performance of large language models on MMLU-Pro and discusses the implications for advancing multi-task language understanding.

Plain English Explanation

The paper presents a new benchmark called MMLU-Pro, which is designed to more thoroughly test the language understanding capabilities of AI models. Existing language benchmarks can be limited in their scope or lack sufficient difficulty, so the researchers created MMLU-Pro to address these issues.

MMLU-Pro includes a wide variety of tasks spanning different topics, from science and history to ethics and pop culture. This diversity is intended to better evaluate how well models can understand and reason about language in a broad range of contexts, rather than just excelling at a narrow set of tasks.

The researchers also incorporated several new techniques to make MMLU-Pro more robust and challenging. For example, they added adversarial examples and increased the task difficulty to prevent models from relying on simple shortcuts. This forces the models to truly understand the language, rather than just recognizing patterns.

When the researchers tested large language models on MMLU-Pro, the results showed there is still significant room for improvement in multi-task language understanding. The paper discusses how advances in this area could lead to AI systems that are more flexible, reliable, and aligned with human values.

Technical Explanation

The paper introduces MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. MMLU-Pro builds upon existing multi-task language understanding benchmarks, such as MMLU and TMLU, by incorporating new techniques to make the benchmark more robust and challenging.

The key innovations in MMLU-Pro include:

Expanding the task set to cover a diverse range of domains, from science and history to ethics and pop culture.
Introducing adversarial examples to prevent models from exploiting dataset biases.
Increasing the overall difficulty of the tasks to require deeper language understanding.

The researchers evaluate the performance of large language models, such as GPT-3, on MMLU-Pro. The results show that even state-of-the-art models struggle to achieve high scores across the diverse set of tasks, indicating that multi-task language understanding remains a significant challenge.

The paper discusses how advances in this area could lead to more flexible, reliable, and value-aligned AI systems, as demonstrated by the performance of models on benchmarks like GSM8K and LLaMA.

Critical Analysis

The paper makes a strong case for the need to develop more robust and challenging benchmarks for multi-task language understanding. The authors provide a thorough explanation of the limitations of existing benchmarks and how MMLU-Pro aims to address these shortcomings.

One potential concern is the scalability of the benchmark, as the authors note that creating a diverse set of high-quality tasks across a wide range of domains is a significant undertaking. It remains to be seen whether the MMLU-Pro benchmark can be maintained and expanded over time to keep pace with the rapid advancements in language models.

Additionally, while the introduction of adversarial examples is a valuable addition, the paper does not provide a detailed analysis of the specific types of adversarial attacks used or their effectiveness in revealing the limitations of language models. Further research in this area could help strengthen the benchmark's ability to assess the robustness of models.

Overall, the MMLU-Pro benchmark represents an important step forward in the quest to develop more comprehensive and challenging language understanding evaluations. The insights gained from this work could inform the design of future benchmarks and drive progress in the field of multi-task language understanding.

Conclusion

The MMLU-Pro benchmark introduced in this paper aims to provide a more robust and challenging assessment of multi-task language understanding abilities. By expanding the task set, incorporating adversarial examples, and increasing the overall difficulty, the authors have created a benchmark that better reflects the real-world challenges faced by language models.

The evaluation of large language models on MMLU-Pro highlights the significant room for improvement in this area, underscoring the importance of continued research and development. Advances in multi-task language understanding could lead to more flexible, reliable, and value-aligned AI systems, with far-reaching implications for society.

While the MMLU-Pro benchmark is not without its limitations, it represents an important step forward in the quest to develop more comprehensive and informative language understanding assessments. The insights gained from this work can inform the design of future benchmarks and drive progress in this critical area of AI research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning

Zheqi He, Xinya Wu, Pengfei Zhou, Richeng Xuan, Guang Liu, Xi Yang, Qiannan Zhu, Hua Huang

Multi-modal large language models(MLLMs) have achieved remarkable progress and demonstrated powerful knowledge comprehension and reasoning abilities. However, the mastery of domain-specific knowledge, which is essential for evaluating the intelligence of MLLMs, continues to be a challenge. Current multi-modal benchmarks for domain-specific knowledge concentrate on multiple-choice questions and are predominantly available in English, which imposes limitations on the comprehensiveness of the evaluation. To this end, we introduce CMMU, a novel benchmark for multi-modal and multi-type question understanding and reasoning in Chinese. CMMU consists of 3,603 questions in 7 subjects, covering knowledge from primary to high school. The questions can be categorized into 3 types: multiple-choice, multiple-response, and fill-in-the-blank, bringing greater challenges to MLLMs. In addition, we propose an evaluation strategy called Positional Error Variance for assessing multiple-choice questions. The strategy aims to perform a quantitative analysis of position bias. We evaluate seven open-source MLLMs along with GPT4-V, Gemini-Pro, and Qwen-VL-Plus. The results demonstrate that CMMU poses a significant challenge to the recent MLLMs. The data and code are available at https://github.com/FlagOpen/CMMU.

5/9/2024

cs.CL cs.AI cs.MM

Reasoning or Simply Next Token Prediction? A Benchmark for Stress-Testing Large Language Models

Wentian Wang, Paul Kantor, Jacob Feldman, Lazaros Gallos, Hao Wang

We propose MMLU-SR, a novel dataset designed to measure the true comprehension abilities of Large Language Models (LLMs) by challenging their performance in question-answering tasks with modified terms. We reasoned that an agent that ``truly'' understands a concept can still evaluate it when key terms are replaced by suitably defined alternate terms, and sought to differentiate such comprehension from mere text replacement. In our study, we modified standardized test questions by replacing a key term with a dummy word along with its definition. The key term could be in the context of questions, answers, or both questions and answers. Notwithstanding the high scores achieved by recent popular LLMs on the MMLU leaderboard, we found a substantial reduction in model performance after such replacement, suggesting poor comprehension. This new benchmark provides a rigorous benchmark for testing true model comprehension, and poses a challenge to the broader scientific community.

6/26/2024

cs.CL cs.AI cs.LG

Are We Done with MMLU?

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini

Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error taxonomy. Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU's error-ridden questions to enhance its future utility and reliability as a benchmark. Therefore, we open up MMLU-Redux for additional annotation https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux.

6/10/2024

cs.CL cs.AI

🤔

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen

We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

6/14/2024

cs.CL cs.AI cs.CV