CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement

2209.11112

Published 5/7/2024 by Sherif Abdulatif, Ruizhe Cao, Bin Yang

🗣️

Abstract

In this work, we further develop the conformer-based metric generative adversarial network (CMGAN) model for speech enhancement (SE) in the time-frequency (TF) domain. This paper builds on our previous work but takes a more in-depth look by conducting extensive ablation studies on model inputs and architectural design choices. We rigorously tested the generalization ability of the model to unseen noise types and distortions. We have fortified our claims through DNS-MOS measurements and listening tests. Rather than focusing exclusively on the speech denoising task, we extend this work to address the dereverberation and super-resolution tasks. This necessitated exploring various architectural changes, specifically metric discriminator scores and masking techniques. It is essential to highlight that this is among the earliest works that attempted complex TF-domain super-resolution. Our findings show that CMGAN outperforms existing state-of-the-art methods in the three major speech enhancement tasks: denoising, dereverberation, and super-resolution. For example, in the denoising task using the Voice Bank+DEMAND dataset, CMGAN notably exceeded the performance of prior models, attaining a PESQ score of 3.41 and an SSNR of 11.10 dB. Audio samples and CMGAN implementations are available online.

Create account to get full access

Overview

This work further develops the conformer-based metric generative adversarial network (CMGAN) model for speech enhancement (SE) in the time-frequency (TF) domain.
It builds on the authors' previous work and conducts extensive ablation studies on model inputs and architectural design choices.
The paper rigorously tests the generalization ability of the model to unseen noise types and distortions.
It extends the work to address dereverberation and super-resolution tasks, not just speech denoising.
The findings show that CMGAN outperforms existing state-of-the-art methods in the three major speech enhancement tasks: denoising, dereverberation, and super-resolution.

Plain English Explanation

The researchers have improved upon their previous work on a machine learning model called CMGAN, which is used for enhancing the quality of recorded speech. CMGAN works by analyzing the time and frequency components of the audio signal to identify and remove unwanted noise, reverberation, and other distortions.

In this new study, the researchers conducted a thorough investigation of how different design choices and input data affect the performance of CMGAN. They tested the model's ability to generalize and work well on a variety of noise and distortion types that it hadn't been trained on before.

The researchers also expanded the capabilities of CMGAN beyond just noise reduction. They enabled it to also tackle the problems of dereverberation (removing the echoes and reverberations in a recording) and super-resolution (increasing the clarity and detail of the audio).

The results show that CMGAN outperforms other state-of-the-art speech enhancement methods on all three of these tasks. For example, in a noise reduction test, CMGAN achieved significantly better scores on measures of speech quality and noise suppression compared to previous models.

Technical Explanation

The core of this work is the continued development and evaluation of the CMGAN model for speech enhancement in the time-frequency (TF) domain. CMGAN builds on the authors' previous research, but this paper takes a more in-depth look by conducting extensive ablation studies to understand the impact of model inputs and architectural choices.

A key focus of the study was rigorously testing CMGAN's generalization capability - its ability to perform well on unseen noise types and distortions, not just the data it was trained on. The researchers also expanded the scope of CMGAN beyond just speech denoising, exploring its use for dereverberation and super-resolution tasks as well.

This required exploring various architectural changes, such as modifying the metric discriminator scores and masking techniques used by the model. Importantly, the paper notes that this is among the earliest works to attempt complex TF-domain super-resolution for speech enhancement.

The experimental results show that CMGAN outperforms existing state-of-the-art methods across the three major speech enhancement tasks: denoising, dereverberation, and super-resolution. For example, in a denoising task using the Voice Bank+DEMAND dataset, CMGAN achieved a PESQ score of 3.41 and an SSNR of 11.10 dB, significantly exceeding prior models.

Critical Analysis

The paper provides a thorough and rigorous evaluation of the CMGAN model, highlighting its strong performance across multiple speech enhancement tasks. However, the authors do acknowledge some limitations and avenues for future work.

For instance, the paper mentions that further research is needed to fully understand CMGAN's behavior and generalization capabilities, especially when dealing with more diverse and challenging noise and distortion scenarios. The authors also note that the computational complexity of the model may be a concern for some real-world applications, particularly on edge devices.

Additionally, while the paper demonstrates CMGAN's effectiveness, it would be valuable to see comparisons to a wider range of state-of-the-art techniques beyond just the methods the authors have previously proposed. This could provide a more comprehensive understanding of how CMGAN fares against the broader landscape of speech enhancement approaches.

Overall, the research presented in this paper represents a significant advancement in speech enhancement capabilities, particularly in its ability to handle complex TF-domain tasks like dereverberation and super-resolution. However, as with any research, there remain opportunities for further refinement and investigation to fully unlock the potential of this technology.

Conclusion

This work has made valuable contributions to the field of speech enhancement by further developing the CMGAN model and rigorously evaluating its performance across a range of tasks, including denoising, dereverberation, and super-resolution.

The findings demonstrate that CMGAN outperforms existing state-of-the-art methods, offering significant improvements in speech quality and noise suppression. This research represents an important step forward in enhancing the fidelity and clarity of recorded speech, which has numerous applications in areas like remote sensing, telecommunications, and multimedia production.

As the researchers continue to refine and expand the capabilities of CMGAN, it holds great promise for further advancing the state of the art in speech enhancement and improving the user experience in a wide variety of real-world scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Unrestricted Global Phase Bias-Aware Single-channel Speech Enhancement with Conformer-based Metric GAN

Shiqi Zhang, Zheng Qiu, Daiki Takeuchi, Noboru Harada, Shoji Makino

With the rapid development of neural networks in recent years, the ability of various networks to enhance the magnitude spectrum of noisy speech in the single-channel speech enhancement domain has become exceptionally outstanding. However, enhancing the phase spectrum using neural networks is often ineffective, which remains a challenging problem. In this paper, we found that the human ear cannot sensitively perceive the difference between a precise phase spectrum and a biased phase (BP) spectrum. Therefore, we propose an optimization method of phase reconstruction, allowing freedom on the global-phase bias instead of reconstructing the precise phase spectrum. We applied it to a Conformer-based Metric Generative Adversarial Networks (CMGAN) baseline model, which relaxes the existing constraints of precise phase and gives the neural network a broader learning space. Results show that this method achieves a new state-of-the-art performance without incurring additional computational overhead.

6/5/2024

eess.AS cs.SD

Speech enhancement deep-learning architecture for efficient edge processing

Monisankha Pal, Arvind Ramanathan, Ted Wada, Ashutosh Pandey

Deep learning has become a de facto method of choice for speech enhancement tasks with significant improvements in speech quality. However, real-time processing with reduced size and computations for low-power edge devices drastically degrades speech quality. Recently, transformer-based architectures have greatly reduced the memory requirements and provided ways to improve the model performance through local and global contexts. However, the transformer operations remain computationally heavy. In this work, we introduce WaveUNet squeeze-excitation Res2 (WSR)-based metric generative adversarial network (WSR-MGAN) architecture that can be efficiently implemented on low-power edge devices for noise suppression tasks while maintaining speech quality. We utilize multi-scale features using Res2Net blocks that can be related to spectral content used in speech-processing tasks. In the generator, we integrate squeeze-excitation blocks (SEB) with multi-scale features for maintaining local and global contexts along with gated recurrent units (GRUs). The proposed approach is optimized through a combined loss function calculated over raw waveform, multi-resolution magnitude spectrogram, and objective metrics using a metric discriminator. Experimental results in terms of various objective metrics on VoiceBank+DEMAND and DNS-2020 challenge datasets demonstrate that the proposed speech enhancement (SE) approach outperforms the baselines and achieves state-of-the-art (SOTA) performance in the time domain.

5/28/2024

eess.AS

Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

Kevin Zhang, Luka Chkhetiani, Francis McCann Ramirez, Yash Khare, Andrea Vanzo, Michael Liang, Sergio Ramirez Martin, Gabriel Oexle, Ruben Bousbib, Taufiquzzaman Peyash, Michael Nguyen, Dillon Pulliam, Domenic Donato

This paper presents Conformer-1, an end-to-end Automatic Speech Recognition (ASR) model trained on an extensive dataset of 570k hours of speech audio data, 91% of which was acquired from publicly available sources. To achieve this, we perform Noisy Student Training after generating pseudo-labels for the unlabeled public data using a strong Conformer RNN-T baseline model. The addition of these pseudo-labeled data results in remarkable improvements in relative Word Error Rate (WER) by 11.5% and 24.3% for our asynchronous and realtime models, respectively. Additionally, the model is more robust to background noise owing to the addition of these data. The results obtained in this study demonstrate that the incorporation of pseudo-labeled publicly available data is a highly effective strategy for improving ASR accuracy and noise robustness.

4/16/2024

eess.AS cs.CL cs.LG cs.SD

Listen and Move: Improving GANs Coherency in Agnostic Sound-to-Video Generation

Rafael Redondo

Deep generative models have demonstrated the ability to create realistic audiovisual content, sometimes driven by domains of different nature. However, smooth temporal dynamics in video generation is a challenging problem. This work focuses on generic sound-to-video generation and proposes three main features to enhance both image quality and temporal coherency in generative adversarial models: a triple sound routing scheme, a multi-scale residual and dilated recurrent network for extended sound analysis, and a novel recurrent and directional convolutional layer for video prediction. Each of the proposed features improves, in both quality and coherency, the baseline neural architecture typically used in the SoTA, with the video prediction layer providing an extra temporal refinement.

6/26/2024

cs.SD cs.GR eess.AS