Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models

Read original: arXiv:2408.12763 - Published 8/26/2024 by Jean Park, Kuk Jin Jang, Basam Alasaly, Sriharsha Mopidevi, Andrew Zolensky, Eric Eaton, Insup Lee, Kevin Johnson

💬

Overview

Multimodal large language models (MLLMs) can process visual, textual, and auditory data to gain insights that complement human analysis.
Existing video question-answering (VidQA) datasets often exhibit a bias toward a single modality, despite the goal of requiring advanced reasoning skills that integrate diverse modalities.
This research introduces the modality importance score (MIS) to identify such bias and proposes a method using state-of-the-art MLLMs to estimate the modality importance.

Plain English Explanation

Large language models that can understand multiple types of data, like text, images, and audio, have the potential to provide insights that humans might miss. However, existing datasets used to test these models' abilities often focus more on a single type of information, even though the goal is to test how well the models can combine different types of information to answer questions.

The researchers in this paper developed a way to measure how important each type of information is for answering a particular question. They call this the "modality importance score" (MIS). By using the latest multimodal language models, they can estimate the MIS automatically, which can serve as a proxy for how humans perceive the importance of each modality.

Using the MIS, the researchers showed that many existing datasets have a bias towards a single type of information, rather than truly testing the models' ability to integrate diverse types of information. They also found that current models don't effectively combine information from different sources due to this modality imbalance in the datasets.

The researchers suggest that the MIS can be used to create more balanced datasets that will better assess multimodal learning and improve the capabilities of these powerful language models to understand the relationships between different types of information.

Technical Explanation

The researchers introduce the modality importance score (MIS) to quantify the bias in existing video question-answering (VidQA) datasets towards a single modality, despite the goal of requiring advanced reasoning skills that integrate diverse modalities.

They propose an innovative method using state-of-the-art multimodal large language models (MLLMs) to estimate the modality importance, which can serve as a proxy for human judgments of modality perception. By applying this MIS analysis, the researchers demonstrate the presence of unimodal bias and the scarcity of genuinely multimodal questions in existing datasets.

The researchers further validate the modality importance score with multiple ablation studies to evaluate the performance of MLLMs on permuted feature sets. Their results indicate that current models do not effectively integrate information due to modality imbalance in existing datasets.

The proposed MLLM-derived MIS can guide the curation of modality-balanced datasets that advance multimodal learning and enhance MLLMs' capabilities to understand and utilize synergistic relations across modalities.

Critical Analysis

The researchers acknowledge several caveats and limitations in their work. They note that the MIS estimation may not fully capture human perception of modality importance, as their method relies on the performance of current MLLMs, which may not perfectly align with human judgments.

Additionally, the researchers do not provide a detailed analysis of the specific types of unimodal bias present in the examined datasets, which could limit the insights gained from their findings.

Further research could explore alternative methods for assessing modality importance, potentially involving more direct human evaluation or the development of novel MLLM architectures specifically designed for multimodal integration.

Despite these limitations, the modality importance score represents a valuable tool for identifying and mitigating biases in multimodal datasets, which is a crucial step in advancing the field of multimodal learning and enhancing the capabilities of large language models.

Conclusion

This research introduces the modality importance score (MIS) to quantify the bias in existing multimodal datasets towards a single modality, despite the goal of requiring advanced reasoning skills that integrate diverse modalities.

By using state-of-the-art multimodal large language models (MLLMs) to estimate the MIS, the researchers demonstrate the presence of unimodal bias and the scarcity of genuinely multimodal questions in current datasets.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models

Jean Park, Kuk Jin Jang, Basam Alasaly, Sriharsha Mopidevi, Andrew Zolensky, Eric Eaton, Insup Lee, Kevin Johnson

Multimodal large language models (MLLMs) can simultaneously process visual, textual, and auditory data, capturing insights that complement human analysis. However, existing video question-answering (VidQA) benchmarks and datasets often exhibit a bias toward a single modality, despite the goal of requiring advanced reasoning skills that integrate diverse modalities to answer the queries. In this work, we introduce the modality importance score (MIS) to identify such bias. It is designed to assess which modality embeds the necessary information to answer the question. Additionally, we propose an innovative method using state-of-the-art MLLMs to estimate the modality importance, which can serve as a proxy for human judgments of modality perception. With this MIS, we demonstrate the presence of unimodal bias and the scarcity of genuinely multimodal questions in existing datasets. We further validate the modality importance score with multiple ablation studies to evaluate the performance of MLLMs on permuted feature sets. Our results indicate that current models do not effectively integrate information due to modality imbalance in existing datasets. Our proposed MLLM-derived MIS can guide the curation of modality-balanced datasets that advance multimodal learning and enhance MLLMs' capabilities to understand and utilize synergistic relations across modalities.

8/26/2024

Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective

Meiqi Chen, Yixin Cao, Yan Zhang, Chaochao Lu

Recent advancements in Large Language Models (LLMs) have facilitated the development of Multimodal LLMs (MLLMs). Despite their impressive capabilities, MLLMs often suffer from an over-reliance on unimodal biases (e.g., language bias and vision bias), leading to incorrect answers in complex multimodal tasks. To investigate this issue, we propose a causal framework to interpret the biases in Visual Question Answering (VQA) problems. Within our framework, we devise a causal graph to elucidate the predictions of MLLMs on VQA problems, and assess the causal effect of biases through an in-depth causal analysis. Motivated by the causal graph, we introduce a novel MORE dataset, consisting of 12,000 VQA instances. This dataset is designed to challenge MLLMs' abilities, necessitating multi-hop reasoning and the surmounting of unimodal biases. Furthermore, we propose two strategies to mitigate unimodal biases and enhance MLLMs' reasoning capabilities, including a Decompose-Verify-Answer (DeVA) framework for limited-access MLLMs and the refinement of open-source MLLMs through fine-tuning. Extensive quantitative and qualitative experiments offer valuable insights for future research. Our project page is at https://opencausalab.github.io/MORE.

4/4/2024

A Survey on Benchmarks of Multimodal Large Language Models

Jian Li, Weiheng Lu, Hao Fei, Meng Luo, Ming Dai, Min Xia, Yizhang Jin, Zhenye Gan, Ding Qi, Chaoyou Fu, Ying Tai, Wankou Yang, Yabiao Wang, Chengjie Wang

Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both academia and industry due to their remarkable performance in various applications such as visual question answering, visual perception, understanding, and reasoning. Over the past few years, significant efforts have been made to examine MLLMs from multiple perspectives. This paper presents a comprehensive review of 200 benchmarks and evaluations for MLLMs, focusing on (1)perception and understanding, (2)cognition and reasoning, (3)specific domains, (4)key capabilities, and (5)other modalities. Finally, we discuss the limitations of the current evaluation methods for MLLMs and explore promising future directions. Our key argument is that evaluation should be regarded as a crucial discipline to support the development of MLLMs better. For more details, please visit our GitHub repository: https://github.com/swordlidev/Evaluation-Multimodal-LLMs-Survey.

9/9/2024

Revisiting Multi-Modal LLM Evaluation

Jian Lu, Shikhar Srivastava, Junyu Chen, Robik Shrestha, Manoj Acharya, Kushal Kafle, Christopher Kanan

With the advent of multi-modal large language models (MLLMs), datasets used for visual question answering (VQA) and referring expression comprehension have seen a resurgence. However, the most popular datasets used to evaluate MLLMs are some of the earliest ones created, and they have many known problems, including extreme bias, spurious correlations, and an inability to permit fine-grained analysis. In this paper, we pioneer evaluating recent MLLMs (LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses in earlier ones. We assess three VQA datasets: 1) TDIUC, which permits fine-grained analysis on 12 question types; 2) TallyQA, which has simple and complex counting questions; and 3) DVQA, which requires optical character recognition for chart understanding. We also study VQDv1, a dataset that requires identifying all image regions that satisfy a given query. Our experiments reveal the weaknesses of many MLLMs that have not previously been reported. Our code is integrated into the widely used LAVIS framework for MLLM evaluation, enabling the rapid assessment of future MLLMs. Project webpage: https://kevinlujian.github.io/MLLM_Evaluations/

8/13/2024