VC-ENHANCE: Speech Restoration with Integrated Noise Suppression and Voice Conversion

Read original: arXiv:2409.06126 - Published 9/11/2024 by Kyungguen Byun, Jason Filos, Erik Visser, Sunkuk Moon

VC-ENHANCE: Speech Restoration with Integrated Noise Suppression and Voice Conversion

Overview

This paper presents VC-ENHANCE, a speech restoration system that integrates noise suppression and voice conversion.
The goal is to improve the quality and intelligibility of degraded speech by removing noise and converting it to a target voice.
VC-ENHANCE leverages diffusion models for both noise suppression and voice conversion.

Plain English Explanation

VC-ENHANCE is a speech restoration system that aims to clean up noisy speech and change it to a different voice. The paper explores using diffusion models for both removing background noise and converting the voice to a target speaker.

The key idea is to take degraded speech, like audio recordings with a lot of background noise, and first use a diffusion model to suppress the noise. Then, a separate diffusion model is used to convert the cleaned-up speech to a target voice that the user specifies.

This allows the system to improve the overall quality and intelligibility of the audio by addressing two common problems - noise and mismatched voices. The authors show this approach can outperform previous methods for speech restoration in terms of speech quality and naturalness.

Technical Explanation

VC-ENHANCE uses a two-stage process to restore degraded speech. First, a diffusion-based noise suppression model is applied to remove background noise and other acoustic distortions. This model takes the noisy input speech and learns to progressively refine it into a clean version, similar to how diffusion models work for image generation.

Next, a separate diffusion-based voice conversion model is used to convert the cleaned-up speech to a target speaker's voice. This model takes the denoised speech and the target speaker's voice characteristics as input, and generates the final restored speech in the target voice.

The authors demonstrate that integrating these two diffusion-based components - noise suppression and voice conversion - into a unified VC-ENHANCE system leads to improved speech quality and intelligibility compared to prior approaches that treated these as separate problems. The experiments show VC-ENHANCE outperforms baseline methods on both objective and subjective evaluations of speech restoration.

Critical Analysis

The key strengths of VC-ENHANCE are its ability to jointly address the challenges of noise suppression and voice conversion using a common diffusion-based framework. This integrated approach allows the system to better leverage the capabilities of diffusion models to generate high-quality, natural-sounding speech.

However, the paper does not provide a detailed analysis of the computational and memory requirements of the full VC-ENHANCE system, which could be an important practical consideration. Additionally, the evaluations are limited to a relatively narrow set of noise and voice conversion scenarios, so further research would be needed to assess the system's robustness and generalization to a wider range of real-world conditions.

Conclusion

VC-ENHANCE presents a novel approach to speech restoration that combines diffusion-based noise suppression and voice conversion into a unified system. By addressing these two key challenges jointly, the method demonstrates improved performance in enhancing the quality and intelligibility of degraded speech. This work highlights the potential of diffusion models for advancing speech processing technologies with practical applications in areas like audio enhancement, virtual assistants, and accessibility.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

VC-ENHANCE: Speech Restoration with Integrated Noise Suppression and Voice Conversion

Kyungguen Byun, Jason Filos, Erik Visser, Sunkuk Moon

Noise suppression (NS) algorithms are effective in improving speech quality in many cases. However, aggressive noise suppression can damage the target speech, reducing both speech intelligibility and quality despite removing the noise. This study proposes an explicit speech restoration method using a voice conversion (VC) technique for restoration after noise suppression. We observed that high-quality speech can be restored through a diffusion-based voice conversion stage, conditioned on the target speaker embedding and speech content information extracted from the de-noised speech. This speech restoration can achieve enhancement effects such as bandwidth extension, de-reverberation, and in-painting. Our experimental results demonstrate that this two-stage NS+VC framework outperforms single-stage enhancement models in terms of output speech quality, as measured by objective metrics, while scoring slightly lower in speech intelligibility. To further improve the intelligibility of the combined system, we propose a content encoder adaptation method for robust content extraction in noisy conditions.

9/11/2024

Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment

Takuto Igarashi, Yuki Saito, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari

We propose noise-robust voice conversion (VC) which takes into account the recording quality and environment of noisy source speech. Conventional denoising training improves the noise robustness of a VC model by learning noisy-to-clean VC process. However, the naturalness of the converted speech is limited when the noise of the source speech is unseen during the training. To this end, our proposed training conditions a VC model on two latent variables representing the recording quality and environment of the source speech. These latent variables are derived from deep neural networks pre-trained on recording quality assessment and acoustic scene classification and calculated in an utterance-wise or frame-wise manner. As a result, the trained VC model can explicitly learn information about speech degradation during the training. Objective and subjective evaluations show that our training improves the quality of the converted speech compared to the conventional training.

6/12/2024

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

Linhan Ma, Xinfa Zhu, Yuanjun Lv, Zhichao Wang, Ziqian Wang, Wendi He, Hongbin Zhou, Lei Xie

Zero-shot voice conversion (VC) aims to transform source speech into arbitrary unseen target voice while keeping the linguistic content unchanged. Recent VC methods have made significant progress, but semantic losses in the decoupling process as well as training-inference mismatch still hinder conversion performance. In this paper, we propose Vec-Tok-VC+, a novel prompt-based zero-shot VC model improved from Vec-Tok Codec, achieving voice conversion given only a 3s target speaker prompt. We design a residual-enhanced K-Means decoupler to enhance the semantic content extraction with a two-layer clustering process. Besides, we employ teacher-guided refinement to simulate the conversion process to eliminate the training-inference mismatch, forming a dual-mode training strategy. Furthermore, we design a multi-codebook progressive loss function to constrain the layer-wise output of the model from coarse to fine to improve speaker similarity and content accuracy. Objective and subjective evaluations demonstrate that Vec-Tok-VC+ outperforms the strong baselines in naturalness, intelligibility, and speaker similarity.

6/17/2024

⛏️

SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

Paarth Neekhara, Shehzeen Hussain, Rafael Valle, Boris Ginsburg, Rishabh Ranjan, Shlomo Dubnov, Farinaz Koushanfar, Julian McAuley

We propose SelfVC, a training strategy to iteratively improve a voice conversion model with self-synthesized examples. Previous efforts on voice conversion focus on factorizing speech into explicitly disentangled representations that separately encode speaker characteristics and linguistic content. However, disentangling speech representations to capture such attributes using task-specific loss terms can lead to information loss. In this work, instead of explicitly disentangling attributes with loss terms, we present a framework to train a controllable voice conversion model on entangled speech representations derived from self-supervised learning (SSL) and speaker verification models. First, we develop techniques to derive prosodic information from the audio signal and SSL representations to train predictive submodules in the synthesis model. Next, we propose a training strategy to iteratively improve the synthesis model for voice conversion, by creating a challenging training objective using self-synthesized examples. We demonstrate that incorporating such self-synthesized examples during training improves the speaker similarity of generated speech as compared to a baseline voice conversion model trained solely on heuristically perturbed inputs. Our framework is trained without any text and achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio.

5/6/2024