FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching

Read original: arXiv:2406.09286 - Published 6/14/2024 by Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung

FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching

Overview

• This paper proposes a new method called FlowAVSE, which aims to efficiently enhance audio-visual speech using conditional flow matching.

• The key idea is to use a conditional flow model to match audio and video features, allowing for improved speech enhancement compared to existing methods.

• The proposed approach is claimed to be efficient and effective, with experiments showing improvements over prior audio-visual speech enhancement techniques.

Plain English Explanation

The paper describes a new way to improve the quality of audio recordings that contain speech, by using information from the corresponding video of the speaker's face. The technique, called FlowAVSE, works by finding a connection between the audio and visual features of the speech, and then using that relationship to "clean up" the audio.

Rather than processing the audio and video separately, FlowAVSE uses a conditional flow model to model the relationship between the two. This allows the system to more effectively leverage the complementary information in the audio and video data, leading to better speech enhancement compared to prior methods.

The key innovation is this "conditional flow" approach, which aims to be more efficient and effective than alternative techniques like audio-visual feature fusion or diffusion-based resynthesis. By directly modeling the relationship between the audio and visual signals, FlowAVSE can produce high-quality enhanced speech without the need for complex architectures or large amounts of data.

Technical Explanation

The FlowAVSE method works by first extracting audio and visual features from the input speech and video, respectively. These features are then fed into a conditional flow model, which learns to map between the audio and visual representations.

Specifically, the conditional flow model uses a series of invertible transformations to capture the underlying relationship between the audio and video. This allows the model to not only enhance the audio based on the video information, but also to generate plausible video frames from the enhanced audio.

The training of the FlowAVSE model involves optimizing the conditional flow matching objective, which encourages the model to learn a tight coupling between the audio and visual modalities. This is in contrast to approaches that treat the modalities independently, such as audio-visual feature fusion or separate speech chain models.

Experiments show that the proposed FlowAVSE method outperforms prior audio-visual speech enhancement techniques in terms of both objective metrics and subjective human evaluations. The authors attribute this to the efficient and effective nature of the conditional flow modeling approach, which allows FlowAVSE to generate high-fidelity enhanced speech in real-time without the need for computationally expensive architectures or large training datasets.

Critical Analysis

The paper presents a novel and promising approach to audio-visual speech enhancement, with the key strength being the use of conditional flow modeling to effectively capture the relationship between audio and visual features.

However, the authors do not discuss any major limitations or caveats of their method. For example, the FlowAVSE model may struggle with noisy or low-quality video inputs, or it may require careful tuning of hyperparameters to achieve optimal performance.

Additionally, while the authors demonstrate the efficiency and effectiveness of FlowAVSE, they do not compare it to more advanced audio-visual techniques, such as Frieren, which also aims to efficiently generate high-quality audio from video. Further research would be needed to fully assess the strengths and weaknesses of FlowAVSE relative to the state-of-the-art.

Conclusion

The FlowAVSE method represents an innovative approach to audio-visual speech enhancement, leveraging conditional flow modeling to effectively combine audio and visual information. The proposed technique is claimed to be efficient and effective, outperforming prior methods in both objective and subjective evaluations.

While the paper does not address potential limitations or compare FlowAVSE to the latest advancements in the field, the core idea of using a conditional flow model to tightly couple audio and visual features is a promising direction for improving speech quality in a wide range of applications, from video conferencing to voice assistants.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching

Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung

This work proposes an efficient method to enhance the quality of corrupted speech signals by leveraging both acoustic and visual cues. While existing diffusion-based approaches have demonstrated remarkable quality, their applicability is limited by slow inference speeds and computational complexity. To address this issue, we present FlowAVSE which enhances the inference speed and reduces the number of learnable parameters without degrading the output quality. In particular, we employ a conditional flow matching algorithm that enables the generation of high-quality speech in a single sampling step. Moreover, we increase efficiency by optimizing the underlying U-net architecture of diffusion-based systems. Our experiments demonstrate that FlowAVSE achieves 22 times faster inference speed and reduces the model size by half while maintaining the output quality. The demo page is available at: https://cyongong.github.io/FlowAVSE.github.io/

6/14/2024

🗣️

AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement

Ju-Chieh Chou, Chung-Ming Chien, Karen Livescu

Speech enhancement systems are typically trained using pairs of clean and noisy speech. In audio-visual speech enhancement (AVSE), there is not as much ground-truth clean data available; most audio-visual datasets are collected in real-world environments with background noise and reverberation, hampering the development of AVSE. In this work, we introduce AV2Wav, a resynthesis-based audio-visual speech enhancement approach that can generate clean speech despite the challenges of real-world training data. We obtain a subset of nearly clean speech from an audio-visual corpus using a neural quality estimator, and then train a diffusion model on this subset to generate waveforms conditioned on continuous speech representations from AV-HuBERT with noise-robust training. We use continuous rather than discrete representations to retain prosody and speaker information. With this vocoding task alone, the model can perform speech enhancement better than a masking-based baseline. We further fine-tune the diffusion model on clean/noisy utterance pairs to improve the performance. Our approach outperforms a masking-based baseline in terms of both automatic metrics and a human listening test and is close in quality to the target speech in the listening test. Audio samples can be found at https://home.ttic.edu/~jcchou/demo/avse/avse_demo.html.

4/10/2024

LSTMSE-Net: Long Short Term Speech Enhancement Network for Audio-visual Speech Enhancement

Arnav Jain, Jasmer Singh Sanjotra, Harshvardhan Choudhary, Krish Agrawal, Rupal Shah, Rohan Jha, M. Sajid, Amir Hussain, M. Tanveer

In this paper, we propose long short term memory speech enhancement network (LSTMSE-Net), an audio-visual speech enhancement (AVSE) method. This innovative method leverages the complementary nature of visual and audio information to boost the quality of speech signals. Visual features are extracted with VisualFeatNet (VFN), and audio features are processed through an encoder and decoder. The system scales and concatenates visual and audio features, then processes them through a separator network for optimized speech enhancement. The architecture highlights advancements in leveraging multi-modal data and interpolation techniques for robust AVSE challenge systems. The performance of LSTMSE-Net surpasses that of the baseline model from the COG-MHEAR AVSE Challenge 2024 by a margin of 0.06 in scale-invariant signal-to-distortion ratio (SISDR), $0.03$ in short-time objective intelligibility (STOI), and $1.32$ in perceptual evaluation of speech quality (PESQ). The source code of the proposed LSTMSE-Net is available at url{https://github.com/mtanveer1/AVSEC-3-Challenge}.

9/5/2024

➖

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, Kai Yu

Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the process of generating mel-spectrograms into an ordinary differential equation conditional on text inputs, whose vector field is then estimated. The rectified flow technique then effectively straightens its sampling trajectory for efficient synthesis. Subjective and objective evaluations on both single and multi-speaker corpora showed the superior synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation studies further verified the validity of the rectified flow technique in VoiceFlow.

9/4/2024