RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement

Read original: arXiv:2407.07825 - Published 7/11/2024 by Honglie Chen, Rodrigo Mira, Stavros Petridis, Maja Pantic

RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement

Overview

This paper introduces RT-LA-VocE, a real-time low-SNR audio-visual speech enhancement system.
The system aims to improve speech quality in noisy environments by leveraging both audio and visual information.
It uses a hybrid convolutional-recurrent neural network architecture to perform speech enhancement in a computationally efficient manner.

Plain English Explanation

RT-LA-VocE is a system that can improve the quality of speech in noisy environments by using both audio and visual information. Imagine you're trying to have a conversation with someone in a crowded room - the background noise makes it hard to hear them clearly. RT-LA-VocE could help by using the person's lip movements and facial expressions, along with the audio signal, to better understand what they're saying and remove the background noise.

The key innovation is the use of a hybrid neural network architecture that combines convolutional and recurrent layers. This allows the system to efficiently process the audio and visual inputs in real-time to enhance the speech. By leveraging both modalities, RT-LA-VocE can do a better job of separating the speech signal from the background noise compared to approaches that only use audio.

Technical Explanation

The RT-LA-VocE system uses a hybrid convolutional-recurrent neural network architecture to perform real-time low-SNR audio-visual speech enhancement. The audio and visual inputs are first processed by separate convolutional layers to extract relevant features. These features are then fed into recurrent layers, which allow the model to capture the temporal dynamics of the speech and visual signals.

The audio and visual features are then combined using cross-modal attention, which learns to focus on the most informative parts of each modality to enhance the speech. This hybrid approach enables RT-LA-VocE to perform speech enhancement in a computationally efficient manner, making it suitable for real-time applications.

Critical Analysis

The authors of the paper acknowledge that while RT-LA-VocE demonstrates promising results, there are still some limitations to address. For example, the system may not perform as well in extremely low-SNR conditions or when the speaker's face is partially occluded. Additionally, the authors suggest that incorporating language model-generated pseudo labels could further improve the speech enhancement capabilities of the system.

Another potential area for improvement is the flexibility of the system to handle multilingual speech or speakers with different accents. The authors did not explore these aspects in the current work, but they could be valuable extensions to increase the real-world applicability of RT-LA-VocE.

Conclusion

RT-LA-VocE represents a significant advancement in real-time audio-visual speech enhancement, demonstrating the potential of leveraging both audio and visual information to improve speech quality in noisy environments. The hybrid convolutional-recurrent neural network architecture allows for efficient processing of the input signals, making the system suitable for practical, real-time applications. While the current work has some limitations, the authors have identified promising avenues for future research to further enhance the capabilities of the system.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement

Honglie Chen, Rodrigo Mira, Stavros Petridis, Maja Pantic

In this paper, we aim to generate clean speech frame by frame from a live video stream and a noisy audio stream without relying on future inputs. To this end, we propose RT-LA-VocE, which completely re-designs every component of LA-VocE, a state-of-the-art non-causal audio-visual speech enhancement model, to perform causal real-time inference with a 40ms input frame. We do so by devising new visual and audio encoders that rely solely on past frames, replacing the Transformer encoder with the Emformer, and designing a new causal neural vocoder C-HiFi-GAN. On the popular AVSpeech dataset, we show that our algorithm achieves state-of-the-art results in all real-time scenarios. More importantly, each component is carefully tuned to minimize the algorithm latency to the theoretical minimum (40ms) while maintaining a low end-to-end processing latency of 28.15ms per frame, enabling real-time frame-by-frame enhancement with minimal delay.

7/11/2024

🗣️

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

Maxime Burchi, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg, Radu Timofte

Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy conditions. In this work, we present a multilingual AVSR model incorporating several enhancements to improve performance and audio noise robustness. Notably, we adapt the recently proposed Fast Conformer model to process both audio and visual modalities using a novel hybrid CTC/RNN-T architecture. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets (VoxCeleb2 and AVSpeech). Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%. On the recently introduced MuAViC benchmark, our model yields an absolute average-WER reduction of 11.9% in comparison to the original baseline. Finally, we demonstrate the ability of the proposed model to perform audio-only, visual-only, and audio-visual speech recognition at test time.

5/24/2024

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang

Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation in over 100 languages. It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes. Extensive evaluation on the MuAViC benchmark shows the strength of XLAVS-R on downstream audio-visual speech recognition and translation tasks, where it outperforms the previous state of the art by up to 18.5% WER and 4.7 BLEU given noisy AV inputs, and enables strong zero-shot audio-visual ability with audio-only fine-tuning.

8/13/2024

DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

Ziqian Ning, Shuai Wang, Pengcheng Zhu, Zhichao Wang, Jixun Yao, Lei Xie, Mengxiao Bi

Streaming voice conversion has become increasingly popular for its potential in real-time applications. The recently proposed DualVC 2 has achieved robust and high-quality streaming voice conversion with a latency of about 180ms. Nonetheless, the recognition-synthesis framework hinders end-to-end optimization, and the instability of automatic speech recognition (ASR) model with short chunks makes it challenging to further reduce latency. To address these issues, we propose an end-to-end model, DualVC 3. With speaker-independent semantic tokens to guide the training of the content encoder, the dependency on ASR is removed and the model can operate under extremely small chunks, with cascading errors eliminated. A language model is trained on the content encoder output to produce pseudo context by iteratively predicting future frames, providing more contextual information for the decoder to improve conversion quality. Experimental results demonstrate that DualVC 3 achieves comparable performance to DualVC 2 in subjective and objective metrics, with a latency of only 50 ms.

6/13/2024