Towards Multilingual Audio-Visual Question Answering

2406.09156

Published 6/14/2024 by Orchid Chetia Phukan, Priyabrata Mallick, Swarup Ranjan Behera, Aalekhya Satya Narayani, Arun Balaji Buduru, Rajesh Sharma

cs.LG cs.CV cs.MM cs.SD eess.AS

Towards Multilingual Audio-Visual Question Answering

Abstract

In this paper, we work towards extending Audio-Visual Question Answering (AVQA) to multilingual settings. Existing AVQA research has predominantly revolved around English and replicating it for addressing AVQA in other languages requires a substantial allocation of resources. As a scalable solution, we leverage machine translation and present two multilingual AVQA datasets for eight languages created from existing benchmark AVQA datasets. This prevents extra human annotation efforts of collecting questions and answers manually. To this end, we propose, MERA framework, by leveraging state-of-the-art (SOTA) video, audio, and textual foundation models for AVQA in multiple languages. We introduce a suite of models namely MERA-L, MERA-C, MERA-T with varied model architectures to benchmark the proposed datasets. We believe our work will open new research directions and act as a reference benchmark for future works in multilingual AVQA.

Create account to get full access

Overview

This paper explores the development of a multilingual audio-visual question answering (MAVQA) system, which aims to enable users to ask questions about video content in their native language and receive answers in the same language.
The key idea is to leverage both audio and visual information from the video, as well as language models trained on multilingual data, to provide accurate and natural-sounding responses to users' questions.
The authors propose several approaches to address the challenges of MAVQA, including cross-lingual feature alignment, multilingual knowledge distillation, and audio-visual fusion.

Plain English Explanation

The paper discusses the challenge of building a system that can answer questions about video content in multiple languages. The goal is to create a tool that allows users to ask questions in their native language and receive responses in that same language, rather than having to use a single common language.

To achieve this, the researchers propose leveraging both the audio and visual information in the video, as well as language models trained on multilingual data. This is important because the audio and visual cues can provide complementary information to help the system understand the content and context of the video, while the multilingual language models allow the system to communicate fluently in different languages.

Some of the key innovations the authors explore include aligning features across languages, distilling knowledge from multilingual models, and combining audio and visual inputs in an effective way. These technical approaches are designed to overcome the challenges of building a truly multilingual and multimodal question answering system.

Technical Explanation

The paper presents a framework for Towards Multilingual Audio-Visual Question Answering, which aims to enable users to ask questions about video content in their native language and receive answers in the same language.

The authors propose several key components to address the challenges of MAVQA:

Cross-Lingual Feature Alignment: The system needs to be able to effectively process and understand video content across multiple languages. The authors explore techniques to align visual and audio features in a shared multilingual representation space.
Multilingual Knowledge Distillation: To leverage multilingual language models, the researchers investigate knowledge distillation approaches to transfer knowledge from large multilingual models to more compact models optimized for the MAVQA task.
Audio-Visual Fusion: The system must combine audio and visual information from the video in an effective way to answer questions accurately. The authors experiment with different fusion strategies to integrate these multimodal inputs.

The paper presents experiments on several benchmark datasets, including CVQA, MTVQA, and VIsPER, to evaluate the performance of their proposed MAVQA approaches. The results demonstrate promising progress towards the goal of building a truly multilingual and multimodal question answering system.

Critical Analysis

The paper presents a thoughtful and well-designed approach to the challenge of Multilingual Audio-Visual Speech Recognition. The authors have identified key technical hurdles and proposed innovative solutions to address them.

One potential limitation is the reliance on existing benchmark datasets, which may not fully capture the nuances and complexities of real-world multilingual and multimodal interactions. Further evaluation on more diverse and representative data could help validate the system's performance in realistic scenarios.

Additionally, the paper does not delve deeply into the potential societal implications of a successful MAVQA system. Considerations around accessibility, bias, and the impact on multilingual communities would be valuable for the reader to understand.

Overall, this research represents an important step towards more inclusive and accessible multimedia interaction systems. Continued advancements in this area could have far-reaching benefits for individuals and communities around the world.

Conclusion

The paper presents a compelling approach to the challenge of Multilingual Audio-Visual Question Answering. By leveraging both audio and visual information, as well as multilingual language models, the proposed MAVQA system aims to enable users to interact with video content in their native languages.

The technical innovations, including cross-lingual feature alignment, multilingual knowledge distillation, and audio-visual fusion, demonstrate the researchers' thoughtful approach to addressing the key challenges in this domain. While further evaluation and consideration of societal implications are warranted, this work represents an important step towards more inclusive and accessible multimedia interaction systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering

Jie Ma, Min Hu, Pinghui Wang, Wangchun Sun, Lingyun Song, Hongbin Pei, Jun Liu, Youtian Du

Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robustness. Furthermore, current datasets may not provide a precise diagnostic for these methods. To tackle these challenges, firstly, we propose a novel dataset, textit{MUSIC-AVQA-R}, crafted in two steps: rephrasing questions within the test split of a public dataset (textit{MUSIC-AVQA}) and subsequently introducing distribution shifts to split questions. The former leads to a large, diverse test space, while the latter results in a comprehensive robustness evaluation on rare, frequent, and overall questions. Secondly, we propose a robust architecture that utilizes a multifaceted cycle collaborative debiasing strategy to overcome bias learning. Experimental results show that this architecture achieves state-of-the-art performance on both datasets, especially obtaining a significant improvement of 9.32% on the proposed dataset. Extensive ablation experiments are conducted on these two datasets to validate the effectiveness of the debiasing strategy. Additionally, we highlight the limited robustness of existing multi-modal QA methods through the evaluation on our dataset.

5/21/2024

cs.CV

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja, Bontu Fufa Balcha, Chenxi Whitehouse, Christian Salamea, Dan John Velasco, David Ifeoluwa Adelani, David Le Meur, Emilio Villa-Cueva, Fajri Koto, Fauzan Farooqui, Frederico Belcavello, Ganzorig Batnasan, Gisela Vallejo, Grainne Caulfield, Guido Ivetta, Haiyue Song, Henok Biadglign Ademtew, Hern'an Maina, Holy Lovenia, Israel Abebe Azime, Jan Christian Blaise Cruz, Jay Gala, Jiahui Geng, Jesus-German Ortiz-Barajas, Jinheon Baek, Jocelyn Dunstan, Laura Alonso Alemany, Kumaranage Ravindu Yasas Nagasinghe, Luciana Benotti, Luis Fernando D'Haro, Marcelo Viridiano, Marcos Estecha-Garitagoitia, Maria Camila Buitrago Cabrera, Mario Rodr'iguez-Cantelar, M'elanie Jouitteau, Mihail Mihaylov, Mohamed Fazli Mohamed Imam, Muhammad Farid Adilazuarda, Munkhjargal Gochoo, Munkh-Erdene Otgonbold, Naome Etori, Olivier Niyomugisha, Paula M'onica Silva, Pranjal Chitale, Raj Dabre, Rendi Chevi, Ruochen Zhang, Ryandito Diandaru, Samuel Cahyawijaya, Santiago G'ongora, Soyeong Jeong, Sukannya Purkayastha, Tatsuki Kuribayashi, Thanmay Jayakumar, Tiago Timponi Torrent, Toqeer Ehsan, Vladimir Araujo, Yova Kementchedjhieva, Zara Burzo, Zheng Wei Lim, Zheng Xin Yong, Oana Ignat, Joan Nwatu, Rada Mihalcea, Thamar Solorio, Alham Fikri Aji

Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 28 countries on four continents, covering 26 languages with 11 scripts, providing a total of 9k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.

6/11/2024

cs.CV cs.AI cs.CL cs.LG

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, Yanjie Wang, Yuliang Liu, Hao Liu, Xiang Bai, Can Huang

Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. Nonetheless, most existing TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works to expand multilingual QA pairs in non-text-centric VQA datasets through translation engines, the translation-based protocol encounters a substantial visual-textual misalignment problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Moreover, it fails to address complexities related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we tackle multilingual TEC-VQA by introducing MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages, consisting of 6,778 question-answer pairs across 2,116 images. Further, by comprehensively evaluating numerous state-of-the-art Multimodal Large Language Models (MLLMs), including GPT-4o, GPT-4V, Claude3, and Gemini, on the MTVQA dataset, it is evident that there is still a large room for performance improvement, underscoring the value of MTVQA. Additionally, we supply multilingual training data within the MTVQA dataset, demonstrating that straightforward fine-tuning with this data can substantially enhance multilingual TEC-VQA performance. We aspire that MTVQA will offer the research community fresh insights and stimulate further exploration in multilingual visual text comprehension. The project homepage is available at https://bytedance.github.io/MTVQA/.

6/12/2024

cs.CV

🗣️

ViSpeR: Multilingual Audio-Visual Speech Recognition

Sanath Narayan, Yasser Abdelaziz Dahou Djilali, Ankit Singh, Eustache Le Bihan, Hakim Hacid

This work presents an extensive and detailed study on Audio-Visual Speech Recognition (AVSR) for five widely spoken languages: Chinese, Spanish, English, Arabic, and French. We have collected large-scale datasets for each language except for English, and have engaged in the training of supervised learning models. Our model, ViSpeR, is trained in a multi-lingual setting, resulting in competitive performance on newly established benchmarks for each language. The datasets and models are released to the community with an aim to serve as a foundation for triggering and feeding further research work and exploration on Audio-Visual Speech Recognition, an increasingly important area of research. Code available at href{https://github.com/YasserdahouML/visper}{https://github.com/YasserdahouML/visper}.

6/4/2024

cs.CL cs.AI