M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

2405.15638

Published 5/27/2024 by Hongyu Wang, Jiayu Xu, Senwei Xie, Ruiping Wang, Jialin Li, Zhaojie Xie, Bin Zhang, Chuyan Xiong, Xilin Chen

cs.CV cs.CL

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

Abstract

Multilingual multimodal reasoning is a core component in achieving human-level intelligence. However, most existing benchmarks for multilingual multimodal reasoning struggle to differentiate between models of varying performance; even language models without visual capabilities can easily achieve high scores. This leaves a comprehensive evaluation of leading multilingual multimodal models largely unexplored. In this work, we introduce M4U, a novel and challenging benchmark for assessing the capability of multi-discipline multilingual multimodal understanding and reasoning. M4U contains 8,931 samples covering 64 disciplines across 16 subfields in Science, Engineering, and Healthcare in Chinese, English, and German. Using M4U, we conduct extensive evaluations of 21 leading Large Multimodal Models (LMMs) and Large Language Models (LLMs) with external tools. The evaluation results show that the state-of-the-art model, GPT-4o, achieves only 47.6% average accuracy on M4U. Additionally, we observe that the leading LMMs exhibit significant language preferences. Our in-depth analysis indicates that leading LMMs, including GPT-4o, suffer performance degradation when prompted with cross-lingual multimodal questions, such as images with key textual information in Chinese while the question is in German. We believe that M4U can serve as a crucial tool for systematically evaluating LMMs based on their multilingual multimodal reasoning capabilities and monitoring their development. The homepage, codes and data are public available.

Create account to get full access

Overview

This paper introduces the M4U benchmark, a new evaluation framework for assessing the multilingual understanding and reasoning capabilities of large multimodal models.
The M4U benchmark covers a diverse range of tasks, including visual question answering, image-text retrieval, and cross-lingual transfer, across 7 languages.
The authors evaluate several state-of-the-art multimodal models on the M4U benchmark and provide insights into their strengths, weaknesses, and limitations in handling multilingual and multimodal tasks.

Plain English Explanation

The paper presents a new benchmark called M4U (Multilingual Multimodal Understanding) that is designed to evaluate how well large AI models can understand and reason about information in different languages and across text, images, and other media. The M4U benchmark includes a variety of tasks, such as answering questions about images, finding relevant images for given text, and transferring knowledge from one language to another.

The researchers tested several leading multimodal AI models on the M4U benchmark to see how they performed. They found that while these models have made impressive advances, they still struggle with certain multilingual and multimodal tasks. The findings provide valuable insights into the current limitations of these models and suggest areas for future improvement.

By creating a comprehensive evaluation framework like M4U, the researchers aim to drive progress in building AI systems that can truly understand and reason about information in a multilingual and multimodal way, which is an important capability for many real-world applications.

Technical Explanation

The paper introduces the M4U (Multilingual Multimodal Understanding) benchmark, a new evaluation framework designed to assess the multilingual understanding and reasoning capabilities of large multimodal models. The M4U benchmark covers a diverse range of tasks, including visual question answering, image-text retrieval, and cross-lingual transfer, across 7 languages (English, French, German, Spanish, Italian, Chinese, and Japanese).

The authors evaluate several state-of-the-art multimodal models, such as CMMU, MM-InstructEval, and MLUM, on the M4U benchmark and provide insights into their strengths, weaknesses, and limitations in handling multilingual and multimodal tasks. The findings highlight the current challenges faced by these models, such as poor cross-lingual transfer, limited multilingual understanding, and inconsistent performance across different modalities.

The paper also discusses the importance of developing comprehensive evaluation frameworks like M4U to drive progress in multimodal large language models and support the development of real-world applications that require robust multilingual and multimodal understanding.

Critical Analysis

The M4U benchmark provides a valuable tool for evaluating the multilingual and multimodal capabilities of large language models. By testing these models across a diverse range of tasks and languages, the authors have identified key limitations and areas for improvement.

One potential limitation of the M4U benchmark is the relatively small size of the datasets used for some of the tasks, which could affect the reliability of the results. Additionally, the benchmark may not fully capture the nuances of multilingual and multimodal understanding, and further research is needed to develop more comprehensive evaluation frameworks.

The authors also acknowledge that the performance of the tested models may be influenced by factors such as pretraining data, model size, and fine-tuning strategies. Exploring these factors in more depth could provide additional insights into the strengths and weaknesses of different multimodal models.

Overall, the M4U benchmark represents an important step forward in assessing the multilingual and multimodal capabilities of AI systems. The insights gained from this research can inform the development of more robust and inclusive language models that can better serve the needs of a diverse global audience.

Conclusion

The M4U benchmark introduced in this paper provides a comprehensive framework for evaluating the multilingual understanding and reasoning capabilities of large multimodal models. The authors' evaluation of several state-of-the-art models on the M4U benchmark reveals significant challenges in areas such as cross-lingual transfer, multilingual understanding, and consistent performance across modalities.

The findings from this research highlight the need for continued advancements in multimodal large language models to support the development of real-world applications that require robust multilingual and multimodal understanding. The M4U benchmark can serve as a valuable tool to drive progress in this direction and ensure that AI systems are capable of effectively communicating and reasoning across diverse languages and modalities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤔

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen

We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

6/14/2024

cs.CL cs.AI cs.CV

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning

Zheqi He, Xinya Wu, Pengfei Zhou, Richeng Xuan, Guang Liu, Xi Yang, Qiannan Zhu, Hua Huang

Multi-modal large language models(MLLMs) have achieved remarkable progress and demonstrated powerful knowledge comprehension and reasoning abilities. However, the mastery of domain-specific knowledge, which is essential for evaluating the intelligence of MLLMs, continues to be a challenge. Current multi-modal benchmarks for domain-specific knowledge concentrate on multiple-choice questions and are predominantly available in English, which imposes limitations on the comprehensiveness of the evaluation. To this end, we introduce CMMU, a novel benchmark for multi-modal and multi-type question understanding and reasoning in Chinese. CMMU consists of 3,603 questions in 7 subjects, covering knowledge from primary to high school. The questions can be categorized into 3 types: multiple-choice, multiple-response, and fill-in-the-blank, bringing greater challenges to MLLMs. In addition, we propose an evaluation strategy called Positional Error Variance for assessing multiple-choice questions. The strategy aims to perform a quantitative analysis of position bias. We evaluate seven open-source MLLMs along with GPT4-V, Gemini-Pro, and Qwen-VL-Plus. The results demonstrate that CMMU poses a significant challenge to the recent MLLMs. The data and code are available at https://github.com/FlagOpen/CMMU.

5/9/2024

cs.CL cs.AI cs.MM

💬

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

Xiaocui Yang, Wenfang Wu, Shi Feng, Ming Wang, Daling Wang, Yang Li, Qi Sun, Yifei Zhang, Xiaoming Fu, Soujanya Poria

The rising popularity of multimodal large language models (MLLMs) has sparked a significant increase in research dedicated to evaluating these models. However, current evaluation studies predominantly concentrate on the ability of models to comprehend and reason within a unimodal (vision-only) context, overlooking critical performance evaluations in complex multimodal reasoning tasks that integrate both visual and text contexts. Furthermore, tasks that demand reasoning across multiple modalities pose greater challenges and require a deep understanding of multimodal contexts. In this paper, we introduce a comprehensive assessment framework named MM-InstructEval, which integrates a diverse array of metrics to provide an extensive evaluation of the performance of various models and instructions across a broad range of multimodal reasoning tasks with vision-text contexts. MM-InstructEval enhances the research on the performance of MLLMs in complex multimodal reasoning tasks, facilitating a more thorough and holistic zero-shot evaluation of MLLMs. We firstly utilize the Best Performance metric to determine the upper performance limit of each model across various datasets. The Mean Relative Gain metric provides an analysis of the overall performance across different models and instructions, while the Stability metric evaluates their sensitivity to variations. Historically, the research has focused on evaluating models independently or solely assessing instructions, overlooking the interplay between models and instructions. To address this gap, we introduce the Adaptability metric, designed to quantify the degree of adaptability between models and instructions. Evaluations are conducted on 31 models (23 MLLMs) across 16 multimodal datasets, covering 6 tasks, with 10 distinct instructions. The extensive analysis enables us to derive novel insights.

5/14/2024

cs.MM

Spanish and LLM Benchmarks: is MMLU Lost in Translation?

Irene Plaza, Nina Melero, Cristina del Pozo, Javier Conde, Pedro Reviriego, Marina Mayor-Rocher, Mar'ia Grandury

The evaluation of Large Language Models (LLMs) is a key element in their continuous improvement process and many benchmarks have been developed to assess the performance of LLMs in different tasks and topics. As LLMs become adopted worldwide, evaluating them in languages other than English is increasingly important. However, most LLM benchmarks are simply translated using an automated tool and then run in the target language. This means that the results depend not only on the LLM performance in that language but also on the quality of the translation. In this paper, we consider the case of the well-known Massive Multitask Language Understanding (MMLU) benchmark. Selected categories of the benchmark are translated into Spanish using Azure Translator and ChatGPT4 and run on ChatGPT4. Next, the results are processed to identify the test items that produce different answers in Spanish and English. Those are then analyzed manually to understand if the automatic translation caused the change. The results show that a significant fraction of the failing items can be attributed to mistakes in the translation of the benchmark. These results make a strong case for improving benchmarks in languages other than English by at least revising the translations of the items and preferably by adapting the tests to the target language by experts.

6/27/2024

cs.CL cs.AI