Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions

Read original: arXiv:2408.01091 - Published 8/6/2024 by Jin Gao, Lei Gan, Yuankai Li, Yixin Ye, Dequan Wang

↗️

Overview

Large multimodal models (LMMs) excel at following human instructions
Self-contradictory instructions can arise due to increasing multimodal interaction and context length, challenging for beginners and vulnerable populations
Authors introduce the Self-Contradictory Instructions benchmark to evaluate LMMs' ability to recognize conflicting commands
Benchmark contains 20,000 conflicts, evenly split between language and vision paradigms
Automatic dataset creation framework enables wide range of instruction forms
Evaluation shows current LMMs struggle to identify multimodal instruction discordance due to lack of self-awareness
Cognitive Awakening Prompting proposed to enhance dissonance detection by injecting external cognition

Plain English Explanation

The paper discusses the challenge of self-contradictory instructions in the context of large multimodal models (LMMs), which are AI systems that can process and understand different types of data like text, images, and audio. As people interact more with these models through various modes, there's an increased risk of providing conflicting commands, which can be difficult for beginners and vulnerable populations to navigate.

To address this issue, the authors created the Self-Contradictory Instructions benchmark, a dataset of 20,000 conflicting commands evenly split between language and visual tasks. They developed an automatic process to generate this dataset, allowing them to cover a wide range of instruction formats.

When they tested current LMMs on this benchmark, the models consistently struggled to identify the contradictions. This is because the models lack self-awareness - they don't have a good understanding of their own limitations and can't recognize when they're receiving conflicting information.

To improve the models' ability to detect these conflicts, the authors propose a technique called Cognitive Awakening Prompting. This approach injects external cognition into the models, which helps them become more aware of their own thought processes and better able to spot self-contradictory instructions.

Technical Explanation

The paper introduces the Self-Contradictory Instructions benchmark to evaluate the capability of LMMs in recognizing conflicting commands. The benchmark consists of 20,000 conflicts, evenly distributed between language and vision paradigms, created using a novel automatic dataset creation framework.

The authors' comprehensive evaluation reveals that current LMMs consistently struggle to identify multimodal instruction discordance due to a lack of self-awareness. To address this, they propose the Cognitive Awakening Prompting technique, which injects cognition from external sources to enhance the models' dissonance detection abilities.

The automatic dataset creation framework expedites the process of generating a wide range of instruction forms, including language instructions, visual instructions, and real-world knowledge to create the Self-Contradictory Instructions benchmark.

Critical Analysis

The paper highlights an important issue in the development of LMMs, as the increasing trend of multimodal interaction and longer context lengths can lead to self-contradictory instructions that are challenging for users, especially for language beginners and vulnerable populations.

While the authors' proposed Cognitive Awakening Prompting technique aims to enhance the models' self-awareness and ability to detect contradictions, the effectiveness of this approach may be limited by the underlying model architecture and training data. Additionally, the paper does not address the potential for biases or fairness issues that may arise in the detection of contradictions, which could disproportionately affect certain user groups.

Further research is needed to explore more robust and generalizable methods for improving the self-awareness and critical thinking capabilities of LMMs, as well as to investigate the broader societal implications of these systems and their interactions with diverse user populations.

Conclusion

The paper introduces the Self-Contradictory Instructions benchmark to evaluate the capability of large multimodal models (LMMs) in recognizing conflicting commands. The authors' comprehensive evaluation reveals that current LMMs consistently struggle to identify multimodal instruction discordance due to a lack of self-awareness.

To address this issue, the authors propose the Cognitive Awakening Prompting technique, which aims to enhance the models' dissonance detection abilities by injecting external cognition. This research highlights the importance of developing more self-aware and critically-thinking AI systems to ensure safe and effective interactions, particularly for vulnerable populations.

The dataset and code for the Self-Contradictory Instructions benchmark are available online, providing a valuable resource for further research and development in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

↗️

Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions

Jin Gao, Lei Gan, Yuankai Li, Yixin Ye, Dequan Wang

Large multimodal models (LMMs) excel in adhering to human instructions. However, self-contradictory instructions may arise due to the increasing trend of multimodal interaction and context length, which is challenging for language beginners and vulnerable populations. We introduce the Self-Contradictory Instructions benchmark to evaluate the capability of LMMs in recognizing conflicting commands. It comprises 20,000 conflicts, evenly distributed between language and vision paradigms. It is constructed by a novel automatic dataset creation framework, which expedites the process and enables us to encompass a wide range of instruction forms. Our comprehensive evaluation reveals current LMMs consistently struggle to identify multimodal instruction discordance due to a lack of self-awareness. Hence, we propose the Cognitive Awakening Prompting to inject cognition from external, largely enhancing dissonance detection. The dataset and code are here: https://selfcontradiction.github.io/.

8/6/2024

🤔

ContraDoc: Understanding Self-Contradictions in Documents with Large Language Models

Jierui Li, Vipul Raheja, Dhruv Kumar

In recent times, large language models (LLMs) have shown impressive performance on various document-level tasks such as document classification, summarization, and question-answering. However, research on understanding their capabilities on the task of self-contradictions in long documents has been very limited. In this work, we introduce ContraDoc, the first human-annotated dataset to study self-contradictions in long documents across multiple domains, varying document lengths, self-contradictions types, and scope. We then analyze the current capabilities of four state-of-the-art open-source and commercially available LLMs: GPT3.5, GPT4, PaLM2, and LLaMAv2 on this dataset. While GPT4 performs the best and can outperform humans on this task, we find that it is still unreliable and struggles with self-contradictions that require more nuance and context. We release the dataset and all the code associated with the experiments (https://github.com/ddhruvkr/CONTRADOC).

4/16/2024

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier, Peter Grasch, Yinfei Yang, Zhe Gan

We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models' ability to strictly follow instructions without compromising performance on other tasks. We hope this benchmark not only serves as a tool for measuring MLLM adherence to instructions, but also guides future developments in MLLM training methods.

7/29/2024

WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia

Yufang Hou, Alessandra Pascale, Javier Carnerero-Cano, Tigran Tchrakian, Radu Marinescu, Elizabeth Daly, Inkit Padhi, Prasanna Sattigeri

Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages. Through rigorous human evaluations on a subset of WikiContradict instances involving 5 LLMs and over 3,500 judgements, we shed light on the behaviour and limitations of these models. For instance, when provided with two passages containing contradictory facts, all models struggle to generate answers that accurately reflect the conflicting nature of the context, especially for implicit conflicts requiring reasoning. Since human evaluation is costly, we also introduce an automated model that estimates LLM performance using a strong open-source language model, achieving an F-score of 0.8. Using this automated metric, we evaluate more than 1,500 answers from seven LLMs across all WikiContradict instances. To facilitate future work, we release WikiContradict on: https://ibm.biz/wikicontradict.

6/21/2024