Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

2405.12983

Published 5/24/2024 by Maxime Burchi, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg, Radu Timofte

🗣️

Abstract

Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy conditions. In this work, we present a multilingual AVSR model incorporating several enhancements to improve performance and audio noise robustness. Notably, we adapt the recently proposed Fast Conformer model to process both audio and visual modalities using a novel hybrid CTC/RNN-T architecture. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets (VoxCeleb2 and AVSpeech). Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%. On the recently introduced MuAViC benchmark, our model yields an absolute average-WER reduction of 11.9% in comparison to the original baseline. Finally, we demonstrate the ability of the proposed model to perform audio-only, visual-only, and audio-visual speech recognition at test time.

Create account to get full access

Overview

This paper presents a multilingual audio-visual speech recognition (AVSR) model that demonstrates state-of-the-art performance on several benchmark datasets.
The model incorporates several enhancements, including adapting the Fast Conformer architecture to process both audio and visual modalities.
The model is trained on a large amount of audio-visual data across six languages, including automatically transcribed datasets like VoxCeleb2 and AVSpeech.
The proposed model achieves new state-of-the-art results on the LRS3 dataset and significant improvements on the MuAViC benchmark.

Plain English Explanation

Humans are remarkably good at understanding speech by watching a speaker's lip movements, especially in noisy environments. This audio-visual speech recognition (AVSR) approach is the inspiration for the model presented in this paper.

The researchers have developed a multilingual AVSR model that can process both audio and visual information to achieve robust speech recognition, even in the presence of background noise. They adapted a recently proposed Fast Conformer architecture to handle both audio and visual inputs, using a novel hybrid CTC/RNN-T (Connectionist Temporal Classification / Recurrent Neural Network Transducer) architecture.

To train the model, the researchers significantly increased the amount of audio-visual data available by automatically generating transcriptions for large, unlabeled multilingual datasets like VoxCeleb2 and AVSpeech. This allowed the model to learn from a diverse range of speakers and languages.

The resulting model achieves new state-of-the-art performance on the LRS3 dataset, with a word error rate (WER) of just 0.8%. On the more recently introduced MuAViC benchmark, the model demonstrates an impressive 11.9% absolute reduction in average WER compared to the original baseline.

Importantly, the model can perform audio-only, visual-only, and audio-visual speech recognition at test time, showing its flexibility and robustness.

Technical Explanation

The researchers present a multilingual AVSR model that builds on the Fast Conformer architecture, which has shown promising results for streaming automatic speech recognition. They adapt the Conformer to process both audio and visual modalities using a novel hybrid CTC/RNN-T architecture.

To increase the amount of audio-visual training data, the researchers generate automatic transcriptions for two large, unlabeled multilingual datasets: VoxCeleb2 and AVSpeech. This allows the model to learn from a diverse range of speakers and languages.

The proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching a word error rate (WER) of 0.8%. On the recently introduced MuAViC benchmark, the model yields an absolute average-WER reduction of 11.9% compared to the original baseline.

The model's flexibility is further demonstrated by its ability to perform audio-only, visual-only, and audio-visual speech recognition at test time.

Critical Analysis

The researchers have made several noteworthy contributions in this work, including the adaptation of the Fast Conformer architecture to process both audio and visual modalities, and the significant expansion of the available audio-visual training data.

However, the paper does not provide much insight into the specific architectural choices or training processes that led to the model's impressive performance. Additionally, while the results on the LRS3 and MuAViC datasets are compelling, it would be useful to see the model's performance on a wider range of multilingual and noisy speech recognition benchmarks to fully assess its capabilities.

Furthermore, the paper does not address potential fairness or bias concerns that may arise from the use of automatically transcribed datasets, which could introduce unwanted biases into the model's performance across different languages and demographic groups.

It would also be interesting to see how the proposed model compares to other state-of-the-art multi-modal speech recognition and cross-modal learning approaches, as well as to explore the model's applicability in real-world scenarios with varying levels of noise and visual quality.

Conclusion

The multilingual AVSR model presented in this paper demonstrates impressive performance on several benchmark datasets, thanks to its adaptation of the Fast Conformer architecture and the significant expansion of the available audio-visual training data.

This work highlights the potential of audio-visual speech recognition to achieve robust performance in adverse listening conditions, with potential applications in assistive technologies, human-computer interaction, and multilingual speech recognition systems. However, further research is needed to address potential fairness and bias concerns, as well as to explore the model's performance in a wider range of real-world scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

He Wang, Pengcheng Guo, Pan Zhou, Lei Xie

While automatic speech recognition (ASR) systems degrade significantly in noisy environments, audio-visual speech recognition (AVSR) systems aim to complement the audio stream with noise-invariant visual cues and improve the system's robustness. However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the contextual relationship during the modality feature learning. In this study, we propose a multi-layer cross-attention fusion based AVSR (MLCA-AVSR) approach that promotes representation learning of each modality by fusing them at different levels of audio/visual encoders. Experimental results on the MISP2022-AVSR Challenge dataset show the efficacy of our proposed system, achieving a concatenated minimum permutation character error rate (cpCER) of 30.57% on the Eval set and yielding up to 3.17% relative improvement compared with our previous system which ranked the second place in the challenge. Following the fusion of multiple systems, our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.

4/9/2024

cs.SD cs.AI eess.AS

🗣️

ViSpeR: Multilingual Audio-Visual Speech Recognition

Sanath Narayan, Yasser Abdelaziz Dahou Djilali, Ankit Singh, Eustache Le Bihan, Hakim Hacid

This work presents an extensive and detailed study on Audio-Visual Speech Recognition (AVSR) for five widely spoken languages: Chinese, Spanish, English, Arabic, and French. We have collected large-scale datasets for each language except for English, and have engaged in the training of supervised learning models. Our model, ViSpeR, is trained in a multi-lingual setting, resulting in competitive performance on newly established benchmarks for each language. The datasets and models are released to the community with an aim to serve as a foundation for triggering and feeding further research work and exploration on Audio-Visual Speech Recognition, an increasingly important area of research. Code available at href{https://github.com/YasserdahouML/visper}{https://github.com/YasserdahouML/visper}.

6/4/2024

cs.CL cs.AI

🗣️

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

Kun Wei, Bei Li, Hang Lv, Quan Lu, Ning Jiang, Lei Xie

Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. Our approach leverages a cross-modal extractor that combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This enables the extraction of richer historical speech context without explicit error propagation. We also incorporate conditional latent variational modules to learn conversational level attributes such as role preference and topic coherence. By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss, achieving relative accuracy improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and MagicData-RAMC, respectively, compared to the standard Conformer model.

4/30/2024

cs.SD cs.CL eess.AS

Towards Multilingual Audio-Visual Question Answering

Orchid Chetia Phukan, Priyabrata Mallick, Swarup Ranjan Behera, Aalekhya Satya Narayani, Arun Balaji Buduru, Rajesh Sharma

In this paper, we work towards extending Audio-Visual Question Answering (AVQA) to multilingual settings. Existing AVQA research has predominantly revolved around English and replicating it for addressing AVQA in other languages requires a substantial allocation of resources. As a scalable solution, we leverage machine translation and present two multilingual AVQA datasets for eight languages created from existing benchmark AVQA datasets. This prevents extra human annotation efforts of collecting questions and answers manually. To this end, we propose, MERA framework, by leveraging state-of-the-art (SOTA) video, audio, and textual foundation models for AVQA in multiple languages. We introduce a suite of models namely MERA-L, MERA-C, MERA-T with varied model architectures to benchmark the proposed datasets. We believe our work will open new research directions and act as a reference benchmark for future works in multilingual AVQA.

6/14/2024

cs.LG cs.CV cs.MM cs.SD eess.AS