UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models

2310.10942

Published 4/16/2024 by Yangyang Guo, Fangkai Jiao, Zhiqi Shen, Liqiang Nie, Mohan Kankanhalli

UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models

Abstract

Teaching Visual Question Answering (VQA) models to refrain from answering unanswerable questions is necessary for building a trustworthy AI system. Existing studies, though have explored various aspects of VQA but somewhat ignored this particular attribute. This paper aims to bridge the research gap by contributing a comprehensive dataset, called UNK-VQA. The dataset is specifically designed to address the challenge of questions that models do not know. To this end, we first augment the existing data via deliberate perturbations on either the image or question. In specific, we carefully ensure that the question-image semantics remain close to the original unperturbed distribution. By this means, the identification of unanswerable questions becomes challenging, setting our dataset apart from others that involve mere image replacement. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models and discover their significant limitations when applied to our dataset. Additionally, we also propose a straightforward method to tackle these unanswerable questions. This dataset, we believe, will serve as a valuable benchmark for enhancing the abstention capability of VQA models, thereby leading to increased trustworthiness of AI systems. We have made the dataset (https://github.com/guoyang9/UNK-VQA) available to facilitate further exploration in this area.

Create account to get full access

Overview

This paper proposes a solution for addressing unanswerable questions in visual question answering (VQA) tasks.
It introduces the UNK-VQA dataset, which contains images and questions where the correct answer is "unknown" or unanswerable.
The researchers developed a model that can distinguish between answerable and unanswerable questions, allowing it to provide more reliable responses.

Plain English Explanation

Visual question answering (VQA) is a task where an AI system is given an image and a question about that image, and it has to provide the correct answer. However, sometimes the question may not have enough information in the image to be answered reliably. This paper explores the problem of unanswerable questions in VQA.

The researchers created a new dataset called UNK-VQA, which contains images and questions where the correct answer is "unknown" or unanswerable based on the information in the image. They developed a model that can distinguish between answerable and unanswerable questions, so it can provide a more reliable response. This work builds on previous efforts to enhance visual question answering through question-driven approaches.

The key idea is to teach the model to recognize when a question cannot be answered, rather than guessing an answer that may be incorrect. This can improve the overall reliability and usefulness of VQA systems in real-world applications, where providing uncertain or incorrect answers could be problematic.

Technical Explanation

The researchers first collected the UNK-VQA dataset, which contains images and questions where the correct answer is "unknown" or unanswerable based on the image content. They used a perturbation procedure to generate questions that were unanswerable, and implemented quality control measures to ensure the dataset was reliable.

They then developed a VQA model that could classify questions as answerable or unanswerable. The model used a multi-task learning approach, where it was trained to both answer questions and predict whether a question was answerable or not. [This builds on previous work in multi-image VQA and compact neural network architectures for VQA](https://aimodels.fyi/papers/arxiv/multi-image-visual-question-answering-unsupervised-anomaly, https://aimodels.fyi/papers/arxiv/tinyvqa-compact-multimodal-deep-neural-network-visual).

The model was evaluated on the UNK-VQA dataset, as well as the standard VQA v2 dataset. The results showed that the model could effectively distinguish between answerable and unanswerable questions, leading to more reliable overall performance.

Critical Analysis

The researchers acknowledge that their approach relies on the quality and coverage of the UNK-VQA dataset, which may not capture all possible types of unanswerable questions. There may be edge cases or novel question types that the model struggles to identify as unanswerable.

Additionally, the paper does not explore how the model's ability to recognize unanswerable questions could be used to improve the overall VQA system. For example, the model could potentially provide explanations or clarifications to users when a question cannot be answered, rather than just returning an "unknown" response.

Further research could explore how to integrate this capability into more comprehensive VQA systems, or how to generalize the approach to other multimodal tasks where unanswerable inputs may be a concern.

Conclusion

This paper presents a novel approach to addressing the problem of unanswerable questions in visual question answering tasks. By developing a model that can classify questions as answerable or unanswerable, the researchers have taken an important step towards building more reliable and trustworthy VQA systems.

The introduction of the UNK-VQA dataset and the demonstrated performance of the model on both the UNK-VQA and standard VQA datasets suggest that this approach has the potential to improve the overall quality and usefulness of VQA technologies in real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models

Manas Jhalani, Annervaz K M, Pushpak Bhattacharyya

In the realm of multimodal tasks, Visual Question Answering (VQA) plays a crucial role by addressing natural language questions grounded in visual content. Knowledge-Based Visual Question Answering (KBVQA) advances this concept by adding external knowledge along with images to respond to questions. We introduce an approach for KBVQA, augmenting the existing vision-language transformer encoder-decoder (OFA) model. Our main contribution involves enhancing questions by incorporating relevant external knowledge extracted from knowledge graphs, using a dynamic triple extraction method. We supply a flexible number of triples from the knowledge graph as context, tailored to meet the requirements for answering the question. Our model, enriched with knowledge, demonstrates an average improvement of 4.75% in Exact Match Score over the state-of-the-art on three different KBVQA datasets. Through experiments and analysis, we demonstrate that furnishing variable triples for each question improves the reasoning capabilities of the language model in contrast to supplying a fixed number of triples. This is illustrated even for recent large language models. Additionally, we highlight the model's generalization capability by showcasing its SOTA-beating performance on a small dataset, achieved through straightforward fine-tuning.

6/17/2024

cs.CL

Robust Few-shot Transfer Learning for Knowledge Base Question Answering with Unanswerable Questions

Riya Sawhney, Indrajit Bhattacharya, Mausam

Real-world KBQA applications require models that are (1) robust -- e.g., can differentiate between answerable and unanswerable questions, and (2) low-resource -- do not require large training data. Towards this goal, we propose the novel task of few-shot transfer for KBQA with unanswerable questions. We present FUn-FuSIC that extends the state-of-the-art (SoTA) few-shot transfer model for answerable-only KBQA to handle unanswerability. It iteratively prompts an LLM to generate logical forms for the question by providing feedback using a diverse suite of syntactic, semantic and execution guided checks, and adapts self-consistency to assess confidence of the LLM to decide answerability. Experiments over newly constructed datasets show that FUn-FuSIC outperforms suitable adaptations of the SoTA model for KBQA with unanswerability, and the SoTA model for answerable-only few-shot-transfer KBQA.

6/21/2024

cs.CL cs.AI

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja, Bontu Fufa Balcha, Chenxi Whitehouse, Christian Salamea, Dan John Velasco, David Ifeoluwa Adelani, David Le Meur, Emilio Villa-Cueva, Fajri Koto, Fauzan Farooqui, Frederico Belcavello, Ganzorig Batnasan, Gisela Vallejo, Grainne Caulfield, Guido Ivetta, Haiyue Song, Henok Biadglign Ademtew, Hern'an Maina, Holy Lovenia, Israel Abebe Azime, Jan Christian Blaise Cruz, Jay Gala, Jiahui Geng, Jesus-German Ortiz-Barajas, Jinheon Baek, Jocelyn Dunstan, Laura Alonso Alemany, Kumaranage Ravindu Yasas Nagasinghe, Luciana Benotti, Luis Fernando D'Haro, Marcelo Viridiano, Marcos Estecha-Garitagoitia, Maria Camila Buitrago Cabrera, Mario Rodr'iguez-Cantelar, M'elanie Jouitteau, Mihail Mihaylov, Mohamed Fazli Mohamed Imam, Muhammad Farid Adilazuarda, Munkhjargal Gochoo, Munkh-Erdene Otgonbold, Naome Etori, Olivier Niyomugisha, Paula M'onica Silva, Pranjal Chitale, Raj Dabre, Rendi Chevi, Ruochen Zhang, Ryandito Diandaru, Samuel Cahyawijaya, Santiago G'ongora, Soyeong Jeong, Sukannya Purkayastha, Tatsuki Kuribayashi, Thanmay Jayakumar, Tiago Timponi Torrent, Toqeer Ehsan, Vladimir Araujo, Yova Kementchedjhieva, Zara Burzo, Zheng Wei Lim, Zheng Xin Yong, Oana Ignat, Joan Nwatu, Rada Mihalcea, Thamar Solorio, Alham Fikri Aji

Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 28 countries on four continents, covering 26 languages with 11 scripts, providing a total of 9k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.

6/11/2024

cs.CV cs.AI cs.CL cs.LG

📈

RetinaQA: A Robust Knowledge Base Question Answering Model for both Answerable and Unanswerable Questions

Prayushi Faldu, Indrajit Bhattacharya, Mausam

An essential requirement for a real-world Knowledge Base Question Answering (KBQA) system is the ability to detect answerability of questions when generating logical forms. However, state-of-the-art KBQA models assume all questions to be answerable. Recent research has found that such models, when superficially adapted to detect answerability, struggle to satisfactorily identify the different categories of unanswerable questions, and simultaneously preserve good performance for answerable questions. Towards addressing this issue, we propose RetinaQA, a new KBQA model that unifies two key ideas in a single KBQA architecture: (a) discrimination over candidate logical forms, rather than generating these, for handling schema-related unanswerability, and (b) sketch-filling-based construction of candidate logical forms for handling data-related unaswerability. Our results show that RetinaQA significantly outperforms adaptations of state-of-the-art KBQA models in handling both answerable and unanswerable questions and demonstrates robustness across all categories of unanswerability. Notably, RetinaQA also sets a new state-of-the-art for answerable KBQA, surpassing existing models.

6/18/2024

cs.CL