AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

2406.09295

YC

0

Reddit

0

Published 6/17/2024 by Yuhang Wu, Wenmeng Yu, Yean Cheng, Yan Wang, Xiaohan Zhang, Jiazheng Xu, Ming Ding, Yuxiao Dong
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

Abstract

Evaluating the alignment capabilities of large Vision-Language Models (VLMs) is essential for determining their effectiveness as helpful assistants. However, existing benchmarks primarily focus on basic abilities using nonverbal methods, such as yes-no and multiple-choice questions. In this paper, we address this gap by introducing AlignMMBench, a comprehensive alignment benchmark specifically designed for emerging Chinese VLMs. This benchmark is meticulously curated from real-world scenarios and Chinese Internet sources, encompassing thirteen specific tasks across three categories, and includes both single-turn and multi-turn dialogue scenarios. Incorporating a prompt rewrite strategy, AlignMMBench encompasses 1,054 images and 4,978 question-answer pairs. To facilitate the evaluation pipeline, we propose CritiqueVLM, a rule-calibrated evaluator that exceeds GPT-4's evaluation ability. Finally, we report the performance of representative VLMs on AlignMMBench, offering insights into the capabilities and limitations of different VLM architectures. All evaluation codes and data are available on https://alignmmbench.github.io.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • The paper introduces AlignMMBench, a new benchmark for evaluating Chinese multimodal alignment in large vision-language models.
  • The benchmark aims to assess how well these models can align visual and textual information in the Chinese language domain.
  • It includes a diverse range of evaluation tasks covering image-text matching, visual question answering, and image caption generation.

Plain English Explanation

The researchers have created a new tool called AlignMMBench to test how well large AI models that work with both images and text can understand the relationship between visual information and text in the Chinese language. These models, known as vision-language models, are trained on massive amounts of data that includes both images and the text descriptions that go with them.

The key goal of AlignMMBench is to evaluate how well these models can "align" or match up the visual information in an image with the corresponding text, for Chinese-language data. This is important because it allows us to assess the capabilities of these models when working with Chinese-language multimodal (image + text) data, compared to English which has been more widely studied.

The benchmark includes a variety of different tasks that test different aspects of multimodal understanding, such as matching images to their correct Chinese captions, answering questions about the contents of Chinese-language images, and generating Chinese-language descriptions for images. By evaluating model performance across this diverse set of tasks, the researchers aim to get a comprehensive picture of the strengths and limitations of current Chinese multimodal AI systems.

Technical Explanation

The paper introduces AlignMMBench, a new benchmark for evaluating Chinese multimodal alignment in large vision-language models. The benchmark includes a diverse set of tasks such as image-text matching, visual question answering, and image caption generation.

The tasks are designed to assess how well these models can align visual and textual information in the Chinese language domain, complementing existing benchmarks like MVBench and VisualWebBench which focus on other aspects of multimodal understanding.

The paper describes the dataset curation process, evaluation metrics, and baseline model performance on the benchmark. Experiments show that current state-of-the-art vision-language models exhibit room for improvement on Chinese multimodal alignment tasks, suggesting further research is needed in this area.

Critical Analysis

The researchers acknowledge several limitations of the AlignMMBench dataset and evaluation. The dataset size, while substantial, may not fully capture the diversity of Chinese multimodal data on the open web. Additionally, the benchmark focuses on static image-text alignment, leaving out dynamic multimodal understanding tasks like video-language models.

The paper also notes that the benchmark evaluates models in a decontextualized setting, whereas real-world multimodal reasoning often relies on broader contextual cues. Extending the benchmark to more open-ended, grounded multimodal tasks could provide additional insights.

While the baseline model results provide a useful starting point, further analysis of model failure cases and cross-lingual transfer capabilities would help identify key challenges for Chinese multimodal alignment. Incorporating human evaluation and qualitative assessment could also complement the quantitative metrics.

Conclusion

Overall, the AlignMMBench introduces an important new evaluation resource for assessing Chinese multimodal alignment in large vision-language models. By surfacing performance gaps in this domain, the benchmark can help drive further research and development to improve the multimodal understanding capabilities of AI systems working with Chinese language and visual data. As multimodal AI continues to advance, such targeted benchmarks will be crucial for ensuring these models work robustly across diverse linguistic and cultural contexts.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MMBench: Is Your Multi-modal Model an All-around Player?

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin

YC

0

Reddit

0

Large vision-language models have recently achieved remarkable progress, exhibiting great perception and reasoning abilities concerning visual information. However, how to effectively evaluate these large vision-language models remains a major obstacle, hindering future model development. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics. Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, but they are not scalable and display significant bias. In response to these challenges, we propose MMBench, a novel multi-modality benchmark. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element is a meticulously curated dataset that surpasses existing similar benchmarks in terms of the number and variety of evaluation questions and abilities. The second element introduces a novel CircularEval strategy and incorporates the use of ChatGPT. This implementation is designed to convert free-form predictions into pre-defined choices, thereby facilitating a more robust evaluation of the model's predictions. MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models. We hope MMBench will assist the research community in better evaluating their models and encourage future advancements in this domain. Project page: https://opencompass.org.cn/mmbench.

Read more

4/30/2024

🏅

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, Lichao Sun

YC

0

Reddit

0

Multimodal Large Language Models (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence. However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence of multimodal benchmarks that align with human preferences. Drawing inspiration from the concept of LLM-as-a-Judge within LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities, encompassing three distinct tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking. Furthermore, a closer examination reveals persistent challenges in the judgment capacities of LLMs, including diverse biases, hallucinatory responses, and inconsistencies in judgment, even in advanced models such as GPT-4V. These findings emphasize the pressing need for enhancements and further research efforts to be undertaken before regarding MLLMs as fully reliable evaluators. In light of this, we advocate for additional efforts dedicated to supporting the continuous development within the domain of MLLM functioning as judges. The code and dataset are publicly available at our project homepage: url{https://mllm-judge.github.io/}.

Read more

6/12/2024

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, Kai Chen

YC

0

Reddit

0

The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Video, a quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy. We employ GPT-4 for automated assessment, demonstrating superior accuracy and robustness over earlier LLM-based evaluations. Utilizing MMBench-Video, we have conducted comprehensive evaluations that include both proprietary and open-source LVLMs for images and videos. MMBench-Video stands as a valuable resource for the research community, facilitating improved evaluation of LVLMs and catalyzing progress in the field of video understanding. The evalutation code of MMBench-Video will be integrated into VLMEvalKit: https://github.com/open-compass/VLMEvalKit.

Read more

6/21/2024

Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models

Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models

Andr'es Villa, Juan Carlos Le'on Alc'azar, Alvaro Soto, Bernard Ghanem

YC

0

Reddit

0

Large Vision and Language Models have enabled significant advances in fully supervised and zero-shot visual tasks. These large architectures serve as the baseline to what is currently known as Instruction Tuning Large Vision and Language models (IT-LVLMs). IT-LVLMs are general-purpose multi-modal assistants whose responses are modulated by natural language instructions and visual data. Despite this versatility, IT-LVLM effectiveness in fundamental computer vision problems remains unclear, primarily due to the absence of a standardized evaluation benchmark. This paper introduces a Multi-modal Evaluation Benchmark named MERLIM, a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks. MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal hallucination events in IT-LVLMs. Our results bring important insights on the performance of state-of-the-art IT-LVMLs including limitations at identifying fine-grained visual concepts, object hallucinations across tasks, and biases towards the language query. Our findings also suggest that these models have weak visual grounding, but manage to make adequate guesses from global visual patterns or language biases contained in the LLM component.

Read more

6/13/2024