CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning

2401.14011

Published 5/9/2024 by Zheqi He, Xinya Wu, Pengfei Zhou, Richeng Xuan, Guang Liu, Xi Yang, Qiannan Zhu, Hua Huang

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning

Abstract

Multi-modal large language models(MLLMs) have achieved remarkable progress and demonstrated powerful knowledge comprehension and reasoning abilities. However, the mastery of domain-specific knowledge, which is essential for evaluating the intelligence of MLLMs, continues to be a challenge. Current multi-modal benchmarks for domain-specific knowledge concentrate on multiple-choice questions and are predominantly available in English, which imposes limitations on the comprehensiveness of the evaluation. To this end, we introduce CMMU, a novel benchmark for multi-modal and multi-type question understanding and reasoning in Chinese. CMMU consists of 3,603 questions in 7 subjects, covering knowledge from primary to high school. The questions can be categorized into 3 types: multiple-choice, multiple-response, and fill-in-the-blank, bringing greater challenges to MLLMs. In addition, we propose an evaluation strategy called Positional Error Variance for assessing multiple-choice questions. The strategy aims to perform a quantitative analysis of position bias. We evaluate seven open-source MLLMs along with GPT4-V, Gemini-Pro, and Qwen-VL-Plus. The results demonstrate that CMMU poses a significant challenge to the recent MLLMs. The data and code are available at https://github.com/FlagOpen/CMMU.

Create account to get full access

Overview

This paper introduces CMMU, a new benchmark for evaluating Chinese multi-modal multi-type question understanding and reasoning capabilities.
The benchmark covers a diverse range of question types, including factual, commonsense, and reasoning-based questions, that are grounded in both text and images.
The goal is to spur progress in developing AI systems that can effectively comprehend and reason about multi-modal information in the Chinese language.

Plain English Explanation

The researchers have created a new benchmark called CMMU to test how well artificial intelligence (AI) systems can understand and answer different types of questions in Chinese. These questions can involve both text and images, and they cover a wide range of topics, from simple facts to more complex reasoning.

The MMBENCH paper showed that current multi-modal AI models struggle with questions that require understanding the relationship between text and images. The Measuring Taiwanese Mandarin paper highlighted the need for better Chinese language understanding.

The CMMU benchmark aims to address these challenges by providing a comprehensive set of multi-modal, multi-type questions in Chinese. This will help researchers develop AI systems that can more effectively comprehend and reason about information that combines text and images, which is important for real-world applications like question answering and visual-language understanding.

Technical Explanation

The CMMU benchmark consists of over 30,000 multi-modal, multi-type questions in Chinese, covering a wide range of topics and difficulty levels. The questions are divided into three main categories:

Factual: These questions test the model's ability to understand and extract specific information from the provided text and images.
Commonsense: These questions assess the model's grasp of general world knowledge and its ability to make logical inferences.
Reasoning: These questions require the model to engage in more complex reasoning, such as making comparisons, drawing conclusions, or solving problems.

The researchers used a combination of crowdsourcing and expert annotation to create the CMMU dataset, ensuring high-quality and diverse questions. They also included a range of visual elements, such as charts, diagrams, and photographs, to make the task more challenging and realistic.

The MMC paper and the WorldQA paper have demonstrated the value of multi-modal benchmarks for advancing AI capabilities in areas like chart understanding and visual-language reasoning.

The CMMU benchmark provides a new testbed for evaluating the performance of Chinese multi-modal models, which can help drive progress in this important area of AI research and development.

Critical Analysis

The CMMU benchmark is a valuable contribution to the field of multi-modal AI research, as it addresses the need for more comprehensive and challenging Chinese-language datasets. However, there are a few potential limitations and areas for further exploration:

Bias and Representation: While the dataset aims to be diverse, there may still be biases in the types of questions, images, and topics included. It would be important to carefully analyze the dataset for potential biases and ensure that it adequately represents the diversity of the Chinese language and culture.
Real-world Applicability: The researchers mention that the CMMU benchmark is designed to be relevant to real-world applications, but more work may be needed to validate the direct applicability of the benchmark to specific use cases.
Scalability and Generalization: As with any benchmark, there are questions about how well the models trained on CMMU will generalize to other multi-modal tasks and datasets, especially as the field of multi-modal AI continues to evolve rapidly.
Ethical Considerations: The use of multi-modal AI systems in real-world applications raises important ethical questions, such as fairness, transparency, and accountability, which should be carefully considered as this technology develops.

Despite these potential limitations, the CMMU benchmark represents an important step forward in advancing the state of the art in Chinese multi-modal AI. The VisualWebBench paper has highlighted the need for more diverse and challenging multi-modal benchmarks, and the CMMU dataset helps address this gap.

Conclusion

The CMMU benchmark provides a new and valuable tool for evaluating the performance of Chinese multi-modal AI models. By including a diverse range of question types and visual elements, the benchmark aims to spur progress in developing AI systems that can effectively comprehend and reason about multi-modal information in the Chinese language.

As the field of multi-modal AI continues to evolve, benchmarks like CMMU will play a crucial role in driving innovation and ensuring that the technology developed is applicable, scalable, and aligned with important ethical considerations. The insights and advancements made through the CMMU benchmark have the potential to benefit a wide range of real-world applications, from question answering to visual-language understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤔

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen

We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

6/14/2024

cs.CL cs.AI cs.CV

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

Hongyu Wang, Jiayu Xu, Senwei Xie, Ruiping Wang, Jialin Li, Zhaojie Xie, Bin Zhang, Chuyan Xiong, Xilin Chen

Multilingual multimodal reasoning is a core component in achieving human-level intelligence. However, most existing benchmarks for multilingual multimodal reasoning struggle to differentiate between models of varying performance; even language models without visual capabilities can easily achieve high scores. This leaves a comprehensive evaluation of leading multilingual multimodal models largely unexplored. In this work, we introduce M4U, a novel and challenging benchmark for assessing the capability of multi-discipline multilingual multimodal understanding and reasoning. M4U contains 8,931 samples covering 64 disciplines across 16 subfields in Science, Engineering, and Healthcare in Chinese, English, and German. Using M4U, we conduct extensive evaluations of 21 leading Large Multimodal Models (LMMs) and Large Language Models (LLMs) with external tools. The evaluation results show that the state-of-the-art model, GPT-4o, achieves only 47.6% average accuracy on M4U. Additionally, we observe that the leading LMMs exhibit significant language preferences. Our in-depth analysis indicates that leading LMMs, including GPT-4o, suffer performance degradation when prompted with cross-lingual multimodal questions, such as images with key textual information in Chinese while the question is in German. We believe that M4U can serve as a crucial tool for systematically evaluating LMMs based on their multilingual multimodal reasoning capabilities and monitoring their development. The homepage, codes and data are public available.

5/27/2024

cs.CV cs.CL

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen

In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.

6/26/2024

cs.CL

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin

Large vision-language models have recently achieved remarkable progress, exhibiting great perception and reasoning abilities concerning visual information. However, how to effectively evaluate these large vision-language models remains a major obstacle, hindering future model development. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics. Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, but they are not scalable and display significant bias. In response to these challenges, we propose MMBench, a novel multi-modality benchmark. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element is a meticulously curated dataset that surpasses existing similar benchmarks in terms of the number and variety of evaluation questions and abilities. The second element introduces a novel CircularEval strategy and incorporates the use of ChatGPT. This implementation is designed to convert free-form predictions into pre-defined choices, thereby facilitating a more robust evaluation of the model's predictions. MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models. We hope MMBench will assist the research community in better evaluating their models and encourage future advancements in this domain. Project page: https://opencompass.org.cn/mmbench.

4/30/2024

cs.CV cs.CL