GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

Read original: arXiv:2402.15745 - Published 8/7/2024 by Yi Zong, Xipeng Qiu

🤔

Overview

Large vision-language models (LVLMs) have shown impressive abilities in image perception and language understanding
Existing multimodal benchmarks focus on primary perception and commonsense knowledge, but don't fully reflect the comprehensive capabilities of LVLMs
Researchers propose GAOKAO-MM, a new multimodal benchmark based on the Chinese College Entrance Examination (GAOKAO)
GAOKAO-MM tests LVLMs on a wide range of subjects and image types, including diagrams, function graphs, maps, and photos
The benchmark is designed to assess LVLMs' perception, understanding, knowledge, and reasoning abilities at a human level

Plain English Explanation

The researchers wanted to create a more comprehensive test for large vision-language models, which are AI systems that can understand both images and language. Existing tests for these models focused mainly on basic visual perception and common sense knowledge, but the researchers felt this didn't fully capture the models' true capabilities.

To address this, they developed a new benchmark called GAOKAO-MM, which is based on the Chinese college entrance exam (known as the GAOKAO). This test covers 8 different subject areas and 12 types of images, like diagrams, graphs, maps, and photos. The goal is to assess the models' abilities in perception, understanding, knowledge, and reasoning at a human level.

When the researchers evaluated 10 different large vision-language models on this new benchmark, they found that the highest-performing models only scored around 50% accuracy. This suggests these models still have significant room for improvement before reaching human-level performance across a diverse set of multimodal tasks.

Technical Explanation

The researchers developed GAOKAO-MM, a new multimodal benchmark derived from the Chinese College Entrance Examination (GAOKAO). GAOKAO-MM encompasses 8 subject areas and 12 types of images, including diagrams, function graphs, maps, and photos. This benchmark is designed to assess the comprehensive capabilities of large vision-language models in perception, understanding, knowledge, and reasoning, going beyond the primary perception and commonsense knowledge tested in existing multimodal benchmarks.

The researchers evaluated 10 different LVLMs on GAOKAO-MM and found that the highest-performing model, GPT-4-Vision, achieved an accuracy of only 48.1%. The next best-performing models were Qwen-VL-Plus at 41.2% and Gemini-Pro-Vision at 35.1%. These results indicate that current LVLMs still have significant room for improvement to reach human-level performance on this comprehensive multimodal assessment.

The multi-dimensional analysis provided insights into the strengths and weaknesses of the evaluated LVLMs. This suggests that while these models have made substantial progress, they still have a moderate distance to travel towards the goal of artificial general intelligence (AGI). The findings of this study can help guide the development of more capable, multilingual large vision-language models.

Critical Analysis

The GAOKAO-MM benchmark provides a valuable new tool for evaluating the capabilities of large vision-language models. By testing a broader range of subjects and image types beyond just basic perception and commonsense, the benchmark offers a more comprehensive assessment of these models' true abilities.

However, one potential limitation is that the benchmark is based on the Chinese GAOKAO exam, which may introduce cultural and linguistic biases that could advantage or disadvantage certain models. It would be interesting to see if similar benchmarks could be developed for other educational contexts and languages to provide a more globally representative evaluation.

Additionally, while the results suggest current LVLMs still have significant room for improvement, the paper does not delve deeply into the specific strengths, weaknesses, and failure modes of the evaluated models. Further analysis in this area could yield additional insights to guide future model development.

Overall, the GAOKAO-MM benchmark represents an important step forward in multimodal evaluation and highlights the continued challenges in developing AI systems that can match human-level performance across a diverse range of tasks and modalities.

Conclusion

The researchers have proposed GAOKAO-MM, a new multimodal benchmark derived from the Chinese College Entrance Examination, to more comprehensively evaluate the capabilities of large vision-language models. By testing perception, understanding, knowledge, and reasoning abilities across a wide range of subjects and image types, GAOKAO-MM provides a more rigorous assessment than existing benchmarks.

The evaluation of 10 LVLMs on this benchmark revealed that even the best-performing models only achieved around 50% accuracy, suggesting these systems still have significant room for improvement before reaching human-level multimodal proficiency. The insights from this study can help guide the development of more capable, multilingual large vision-language models as the field progresses towards the long-term goal of artificial general intelligence.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤔

GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

Yi Zong, Xipeng Qiu

The Large Vision-Language Models (LVLMs) have demonstrated great abilities in image perception and language understanding. However, existing multimodal benchmarks focus on primary perception abilities and commonsense knowledge which are insufficient to reflect the comprehensive capabilities of LVLMs. We propose GAOKAO-MM, a multimodal benchmark based on the Chinese College Entrance Examination (GAOKAO), comprising of 8 subjects and 12 types of images, such as diagrams, function graphs, maps and photos. GAOKAO-MM derives from native Chinese context and sets human-level requirements for the model's abilities, including perception, understanding, knowledge and reasoning. We evaluate 10 LVLMs and find that the accuracies of all of them are lower than 50%, with GPT-4-Vison (48.1%), Qwen-VL-Plus (41.2%) and Gemini-Pro-Vision (35.1%) ranking in the top three positions. The results of our multi-dimension analysis indicate that LVLMs have moderate distance towards Artificial General Intelligence (AGI) and provide insights facilitating the development of multilingual LVLMs.

8/7/2024

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, Shaoting Zhang, Bin Fu, Jianfei Cai, Bohan Zhuang, Eric J Seibel, Junjun He, Yu Qiao

Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals, and can be applied in various fields. In the medical field, LVLMs have a high potential to offer substantial assistance for diagnosis and treatment. Before that, it is crucial to develop benchmarks to evaluate LVLMs' effectiveness in various medical applications. Current benchmarks are often built upon specific academic literature, mainly focusing on a single domain, and lacking varying perceptual granularities. Thus, they face specific challenges, including limited clinical relevance, incomplete evaluations, and insufficient guidance for interactive LVLMs. To address these limitations, we developed the GMAI-MMBench, the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 285 datasets across 39 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format. Additionally, we implemented a lexical tree structure that allows users to customize evaluation tasks, accommodating various assessment needs and substantially supporting medical AI research and applications. We evaluated 50 LVLMs, and the results show that even the advanced GPT-4o only achieves an accuracy of 52%, indicating significant room for improvement. Moreover, we identified five key insufficiencies in current cutting-edge LVLMs that need to be addressed to advance the development of better medical applications. We believe that GMAI-MMBench will stimulate the community to build the next generation of LVLMs toward GMAI.

8/12/2024

🤔

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

Ge Zhang, Xinrun Du, Bei Chen, Yiming Liang, Tongxu Luo, Tianyu Zheng, Kang Zhu, Yuyang Cheng, Chunpu Xu, Shuyue Guo, Haoran Zhang, Xingwei Qu, Junjie Wang, Ruibin Yuan, Yizhi Li, Zekun Wang, Yudong Liu, Yu-Hsuan Tsai, Fengji Zhang, Chenghua Lin, Wenhao Huang, Jie Fu

As the capabilities of large multimodal models (LMMs) continue to advance, evaluating the performance of LMMs emerges as an increasing need. Additionally, there is an even larger gap in evaluating the advanced knowledge and reasoning abilities of LMMs in non-English contexts such as Chinese. We introduce CMMMU, a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU is inspired by and strictly follows the annotation and analysis pattern of MMMU. CMMMU includes 12k manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering, like its companion, MMMU. These questions span 30 subjects and comprise 39 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. CMMMU focuses on complex perception and reasoning with domain-specific knowledge in the Chinese context. We evaluate 11 open-source LLMs and one proprietary GPT-4V(ision). Even GPT-4V only achieves accuracies of 42%, indicating a large space for improvement. CMMMU will boost the community to build the next-generation LMMs towards expert artificial intelligence and promote the democratization of LMMs by providing diverse language contexts.

9/10/2024

🤔

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao

The capability to process multiple images is crucial for Large Vision-Language Models (LVLMs) to develop a more thorough and nuanced understanding of a scene. Recent multi-image LVLMs have begun to address this need. However, their evaluation has not kept pace with their development. To fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks. MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions, making it the most extensive benchmark of its kind. Our evaluation of 24 popular LVLMs, including both open-source and proprietary models, reveals significant challenges in multi-image comprehension, particularly in tasks involving spatial understanding. Even the most advanced models, such as GPT-4o, achieve only 55.7% accuracy on MMIU. Through multi-faceted analytical experiments, we identify key performance gaps and limitations, providing valuable insights for future model and data improvements. We aim for MMIU to advance the frontier of LVLM research and development, moving us toward achieving sophisticated multimodal multi-image user interactions.

8/7/2024