CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

Read original: arXiv:2401.11944 - Published 9/10/2024 by Ge Zhang, Xinrun Du, Bei Chen, Yiming Liang, Tongxu Luo, Tianyu Zheng, Kang Zhu, Yuyang Cheng, Chunpu Xu, Shuyue Guo and 12 others

🤔

Overview

As large multimodal models (LMMs) become more advanced, evaluating their performance is increasingly important.
There is a significant gap in evaluating the knowledge and reasoning abilities of LMMs in non-English contexts, such as Chinese.
The paper introduces CMMMU, a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to assess LMMs on tasks requiring college-level subject knowledge and deliberate reasoning in a Chinese context.

Plain English Explanation

The paper discusses the growing need to evaluate the capabilities of large multimodal models (LMMs) - models that can process and understand different types of data, such as text, images, and audio. As these models become more advanced, it's important to assess how well they perform on various tasks.

One area that has been overlooked is evaluating LMMs in non-English contexts, such as Chinese. The researchers created a new benchmark called CMMMU to address this gap. CMMMU is designed to test an LMM's ability to demonstrate college-level subject knowledge and engage in complex reasoning using a Chinese-language dataset.

The benchmark includes over 12,000 multimodal questions (questions that combine text and images) from college exams, quizzes, and textbooks, covering a wide range of subjects like art, business, science, and technology. The questions test the LMM's ability to understand different types of visual information, such as charts, diagrams, and chemical structures, and apply that knowledge to answer the questions.

By evaluating LMMs on this benchmark, the researchers aim to identify areas for improvement and drive the development of more capable artificial intelligence systems that can excel in diverse language contexts.

Technical Explanation

The researchers created CMMMU, a new Chinese Massive Multi-discipline Multimodal Understanding benchmark, to evaluate the performance of large multimodal models (LMMs) in a Chinese-language context. CMMMU is inspired by and follows the same annotation and analysis pattern as the previously developed MMMU benchmark.

CMMMU includes 12,000 manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and comprise 39 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures.

The researchers evaluated 11 open-source large language models (LLMs) and one proprietary GPT-4V(ision) model on the CMMMU benchmark. Even the high-performing GPT-4V model only achieved an accuracy of 42%, indicating a significant room for improvement in the capabilities of LMMs when it comes to complex perception and reasoning with domain-specific knowledge in the Chinese context.

Critical Analysis

The CMMMU benchmark provides a valuable tool for evaluating the advanced knowledge and reasoning abilities of large multimodal models in a non-English context. By focusing on the Chinese language, the researchers have addressed an important gap in the existing literature, which has primarily focused on English-language benchmarks.

One potential limitation of the CMMMU benchmark is the reliance on manually collected questions from college-level materials. While this approach ensures the questions are challenging and representative of real-world knowledge requirements, it may also introduce biases or inconsistencies in the dataset. The researchers acknowledge this and suggest that future work could explore automating the question generation process to create a more scalable and diverse benchmark.

Additionally, the paper does not provide a detailed analysis of the specific areas where the evaluated LMMs struggled the most, such as particular types of visual information or subject domains. A more granular understanding of the models' strengths and weaknesses could help guide future research and development efforts.

Overall, the CMMMU benchmark is a valuable contribution to the field of large multimodal model evaluation, and the researchers' findings highlight the need for continued advancements in artificial intelligence to achieve expert-level performance in diverse language and knowledge domains.

Conclusion

The introduction of the CMMMU benchmark represents a significant step forward in evaluating the capabilities of large multimodal models in a Chinese-language context. By focusing on tasks that require college-level subject knowledge and complex reasoning, the benchmark aims to push the boundaries of what these models can achieve.

The researchers' evaluation of several leading LLMs, including the high-performing GPT-4V, reveals that there is still a considerable gap in the models' ability to excel in the Chinese-language domain. This underscores the importance of developing more robust and adaptable artificial intelligence systems that can thrive in diverse language and cultural contexts.

The CMMMU benchmark's impact on the field could be significant, as it provides a standardized way to measure progress and inform the development of the next generation of large multimodal models. By promoting the democratization of LMMs across languages, the benchmark has the potential to bring the benefits of advanced AI to a wider global audience.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤔

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

Ge Zhang, Xinrun Du, Bei Chen, Yiming Liang, Tongxu Luo, Tianyu Zheng, Kang Zhu, Yuyang Cheng, Chunpu Xu, Shuyue Guo, Haoran Zhang, Xingwei Qu, Junjie Wang, Ruibin Yuan, Yizhi Li, Zekun Wang, Yudong Liu, Yu-Hsuan Tsai, Fengji Zhang, Chenghua Lin, Wenhao Huang, Jie Fu

As the capabilities of large multimodal models (LMMs) continue to advance, evaluating the performance of LMMs emerges as an increasing need. Additionally, there is an even larger gap in evaluating the advanced knowledge and reasoning abilities of LMMs in non-English contexts such as Chinese. We introduce CMMMU, a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU is inspired by and strictly follows the annotation and analysis pattern of MMMU. CMMMU includes 12k manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering, like its companion, MMMU. These questions span 30 subjects and comprise 39 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. CMMMU focuses on complex perception and reasoning with domain-specific knowledge in the Chinese context. We evaluate 11 open-source LLMs and one proprietary GPT-4V(ision). Even GPT-4V only achieves accuracies of 42%, indicating a large space for improvement. CMMMU will boost the community to build the next-generation LMMs towards expert artificial intelligence and promote the democratization of LMMs by providing diverse language contexts.

9/10/2024

🤔

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen

We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

6/14/2024

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning

Zheqi He, Xinya Wu, Pengfei Zhou, Richeng Xuan, Guang Liu, Xi Yang, Qiannan Zhu, Hua Huang

Multi-modal large language models(MLLMs) have achieved remarkable progress and demonstrated powerful knowledge comprehension and reasoning abilities. However, the mastery of domain-specific knowledge, which is essential for evaluating the intelligence of MLLMs, continues to be a challenge. Current multi-modal benchmarks for domain-specific knowledge concentrate on multiple-choice questions and are predominantly available in English, which imposes limitations on the comprehensiveness of the evaluation. To this end, we introduce CMMU, a novel benchmark for multi-modal and multi-type question understanding and reasoning in Chinese. CMMU consists of 3,603 questions in 7 subjects, covering knowledge from primary to high school. The questions can be categorized into 3 types: multiple-choice, multiple-response, and fill-in-the-blank, bringing greater challenges to MLLMs. In addition, we propose an evaluation strategy called Positional Error Variance for assessing multiple-choice questions. The strategy aims to perform a quantitative analysis of position bias. We evaluate seven open-source MLLMs along with GPT4-V, Gemini-Pro, and Qwen-VL-Plus. The results demonstrate that CMMU poses a significant challenge to the recent MLLMs. The data and code are available at https://github.com/FlagOpen/CMMU.

5/9/2024

CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

Wentao Liu, Qianjun Pan, Yi Zhang, Zhuo Liu, Ji Wu, Jie Zhou, Aimin Zhou, Qin Chen, Bo Jiang, Liang He

Large language models (LLMs) have obtained promising results in mathematical reasoning, which is a foundational skill for human intelligence. Most previous studies focus on improving and measuring the performance of LLMs based on textual math reasoning datasets (e.g., MATH, GSM8K). Recently, a few researchers have released English multimodal math datasets (e.g., MATHVISTA and MATH-V) to evaluate the effectiveness of large multimodal models (LMMs). In this paper, we release a Chinese multimodal math (CMM-Math) dataset, including benchmark and training parts, to evaluate and enhance the mathematical reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples, featuring a variety of problem types (e.g., multiple-choice, fill-in-the-blank, and so on) with detailed solutions across 12 grade levels from elementary to high school in China. Specifically, the visual context may be present in the questions or opinions, which makes this dataset more challenging. Through comprehensive analysis, we discover that state-of-the-art LMMs on the CMM-Math dataset face challenges, emphasizing the necessity for further improvements in LMM development. We also propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments. We train our model using three stages, including foundational pre-training, foundational fine-tuning, and mathematical fine-tuning. The extensive experiments indicate that our model effectively improves math reasoning performance by comparing it with the SOTA LMMs over three multimodal mathematical datasets.

9/9/2024