Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline

Read original: arXiv:2406.04494 - Published 6/10/2024 by Ali N. Salman, Zongyang Du, Shreeram Suresh Chandra, Ismail Rasim Ulgen, Carlos Busso, Berrak Sisman

Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline

Overview

This paper introduces the NaturalVoices dataset and an automatic processing pipeline for naturalistic voice conversion.
The goal is to enable more realistic and engaging voice conversion by modeling natural speech variations.
The dataset includes high-quality recordings of multiple speakers in diverse speaking styles, with automatic annotations for various acoustic and linguistic features.
The authors also propose an end-to-end voice conversion model that can transfer natural speaking characteristics from a source to a target speaker.

Plain English Explanation

The researchers behind this paper wanted to make voice conversion technology more realistic and lifelike. Voice conversion is the process of modifying one person's voice to sound like someone else's. This is useful for applications like text-to-speech, dubbing, and virtual assistants.

However, most existing voice conversion systems struggle to capture the natural variations in human speech, such as changes in pitch, tone, and speaking style. The researchers created the NaturalVoices dataset to address this challenge. It contains high-quality recordings of multiple speakers using a variety of natural speaking styles, along with detailed annotations of the acoustic and linguistic features of the recordings.

Using this dataset, the researchers developed a new end-to-end voice conversion model that can transfer the natural speaking characteristics of a source speaker to a target speaker. This means the converted voice will sound more lifelike and expressive, rather than a robotic or monotonous imitation.

The goal is to make voice conversion technology more engaging and realistic, which could improve user experiences in applications like virtual assistants, audiobook narration, and language dubbing.

Technical Explanation

The NaturalVoices dataset contains high-quality audio recordings of multiple speakers in diverse speaking styles, such as read speech, storytelling, and spontaneous conversation. The dataset also includes detailed annotations for various acoustic and linguistic features, including pitch, energy, speaking rate, and phoneme-level alignments.

The authors propose an end-to-end voice conversion model that can transfer the natural speaking characteristics of a source speaker to a target speaker. The model consists of an encoder that extracts linguistic and prosodic features from the source speech, and a decoder that generates the target speech while preserving the natural variations of the source.

The authors evaluate their approach on several voice conversion benchmarks, demonstrating improvements in both objective and subjective measures of speech quality and similarity to the target speaker. They also show that their model can generalize to unseen speakers and handle diverse speaking styles.

Critical Analysis

The authors acknowledge several limitations of their approach, such as the need for high-quality source recordings and the potential for gender mismatch between source and target speakers to degrade performance. They also note that the NaturalVoices dataset, while more diverse than previous voice conversion datasets, may still not capture the full range of natural speech variations.

Additionally, the benchmark datasets used to evaluate voice conversion systems may not fully reflect real-world applications, where factors like background noise, accents, and speaking styles can introduce additional challenges.

Further research is needed to address these limitations and improve the robustness and generalization of naturalistic voice conversion models. Exploring self-supervised or zero-shot learning techniques, as well as incorporating more diverse and challenging datasets, could be promising directions for future work.

Conclusion

This paper presents an important step towards more naturalistic and engaging voice conversion technology. By creating the NaturalVoices dataset and developing an end-to-end voice conversion model that can capture natural speech variations, the authors have made significant progress in addressing a key limitation of existing voice conversion systems.

The potential applications of this research are wide-ranging, from improving the user experience of virtual assistants to enhancing the accessibility and quality of audiobook narration and language dubbing. As voice conversion technology continues to advance, it will play an increasingly important role in shaping the way we interact with and consume digital media.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline

Ali N. Salman, Zongyang Du, Shreeram Suresh Chandra, Ismail Rasim Ulgen, Carlos Busso, Berrak Sisman

Voice conversion (VC) research traditionally depends on scripted or acted speech, which lacks the natural spontaneity of real-life conversations. While natural speech data is limited for VC, our study focuses on filling in this gap. We introduce a novel data-sourcing pipeline that makes the release of a natural speech dataset for VC, named NaturalVoices. The pipeline extracts rich information in speech such as emotion and signal-to-noise ratio (SNR) from raw podcast data, utilizing recent deep learning methods and providing flexibility and ease of use. NaturalVoices marks a large-scale, spontaneous, expressive, and emotional speech dataset, comprising over 3,800 hours speech sourced from the original podcasts in the MSP-Podcast dataset. Objective and subjective evaluations demonstrate the effectiveness of using our pipeline for providing natural and expressive data for VC, suggesting the potential of NaturalVoices for broader speech generation tasks.

6/10/2024

Who is Authentic Speaker

Qiang Huang

Voice conversion (VC) using deep learning technologies can now generate high quality one-to-many voices and thus has been used in some practical application fields, such as entertainment and healthcare. However, voice conversion can pose potential social issues when manipulated voices are employed for deceptive purposes. Moreover, it is a big challenge to find who are real speakers from the converted voices as the acoustic characteristics of source speakers are changed greatly. In this paper we attempt to explore the feasibility of identifying authentic speakers from converted voices. This study is conducted with the assumption that certain information from the source speakers persists, even when their voices undergo conversion into different target voices. Therefore our experiments are geared towards recognising the source speakers given the converted voices, which are generated by using FragmentVC on the randomly paired utterances from source and target speakers. To improve the robustness against converted voices, our recognition model is constructed by using hierarchical vector of locally aggregated descriptors (VLAD) in deep neural networks. The authentic speaker recognition system is mainly tested in two aspects, including the impact of quality of converted voices and the variations of VLAD. The dataset used in this work is VCTK corpus, where source and target speakers are randomly paired. The results obtained on the converted utterances show promising performances in recognising authentic speakers from converted voices.

5/2/2024

LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech

Haechan Kim, Junho Myung, Seoyoung Kim, Sungpah Lee, Dongyeop Kang, Juho Kim

Prevalent ungrammatical expressions and disfluencies in spontaneous speech from second language (L2) learners pose unique challenges to Automatic Speech Recognition (ASR) systems. However, few datasets are tailored to L2 learner speech. We publicly release LearnerVoice, a dataset consisting of 50.04 hours of audio and transcriptions of L2 learners' spontaneous speech. Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner's Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies (e.g., filler words, word repetitions, self-repairs, false starts), significantly more than native speech datasets. Fine-tuning whisper-small.en with LearnerVoice achieves a WER of 10.26%, 44.2% lower than vanilla whisper-small.en. Furthermore, our qualitative analysis indicates that 54.2% of errors from the vanilla model on LearnerVoice are attributable to L2S features, with 48.1% of them being reduced in the fine-tuned model.

7/8/2024

Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment

Takuto Igarashi, Yuki Saito, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari

We propose noise-robust voice conversion (VC) which takes into account the recording quality and environment of noisy source speech. Conventional denoising training improves the noise robustness of a VC model by learning noisy-to-clean VC process. However, the naturalness of the converted speech is limited when the noise of the source speech is unseen during the training. To this end, our proposed training conditions a VC model on two latent variables representing the recording quality and environment of the source speech. These latent variables are derived from deep neural networks pre-trained on recording quality assessment and acoustic scene classification and calculated in an utterance-wise or frame-wise manner. As a result, the trained VC model can explicitly learn information about speech degradation during the training. Objective and subjective evaluations show that our training improves the quality of the converted speech compared to the conventional training.

6/12/2024