DreamVoice: Text-Guided Voice Conversion

Read original: arXiv:2406.16314 - Published 6/25/2024 by Jiarui Hai, Karan Thakkar, Helin Wang, Zengyi Qin, Mounya Elhilali

DreamVoice: Text-Guided Voice Conversion

Overview

This paper presents a new text-guided voice conversion model called DreamVoice that can transform anyone's voice to match a target speaker's voice based on just a text prompt.
DreamVoice leverages large language models to provide text guidance, which helps it learn more expressive and natural-sounding voice conversion compared to previous methods.
The model shows strong performance on several voice conversion benchmarks, outperforming existing approaches in terms of both audio quality and similarity to the target speaker.

Plain English Explanation

DreamVoice: Text-Guided Voice Conversion is a new AI system that can change someone's voice to sound like a different person, just by providing a text description. For example, you could use DreamVoice to make your voice sound like Morgan Freeman's voice, as long as you give it the right text instructions.

Previous voice conversion systems have struggled to create natural-sounding results. DualVC-3: Leveraging Language Model-Generated Pseudo addresses this by using large language models to guide the voice conversion process. This helps DreamVoice learn more expressive and lifelike ways of changing someone's voice.

The researchers show that DreamVoice outperforms other voice conversion methods on standard benchmarks. It can make the converted voice sound very similar to the target speaker, while also maintaining high audio quality. This is an important advancement, as Who is the Authentic Speaker? and Converting Anyone's Voice End-to-End Expressive have highlighted the challenges of achieving both high similarity and high quality in voice conversion.

Technical Explanation

DreamVoice: Text-Guided Voice Conversion is a novel voice conversion model that leverages large language models to provide text-based guidance for the conversion process. The key innovation is using text prompts to steer the model towards more expressive and natural-sounding voice transformations.

The DreamVoice architecture consists of several components. First, a speaker encoder extracts speaker-specific features from the input audio. Then, a text encoder processes the text prompt and generates corresponding acoustic features. These acoustic features are used to condition the voice conversion process, helping the model learn to transform the input voice to match the target speaker's voice and prosody.

The researchers evaluate DreamVoice on several voice conversion benchmarks, including DualVC-3: Leveraging Language Model-Generated Pseudo and Who is the Authentic Speaker?. They find that DreamVoice outperforms existing methods in terms of both audio quality and speaker similarity, as measured by subjective listening tests and objective metrics.

Critical Analysis

The paper presents a promising approach to text-guided voice conversion, but there are a few potential limitations and areas for further research:

The model was trained and evaluated on a relatively limited set of speakers. It would be valuable to test its performance on a more diverse range of speakers, including non-English speakers and speakers with different accents or vocal characteristics.
The paper does not provide much detail on the computational cost or inference time of the DreamVoice model. As DreamView: Injecting View-Specific Text Guidance into systems become more widely deployed, efficiency will be an important consideration.
The authors mention that the text prompts used in the experiments were manually curated. It would be interesting to explore how well the model performs with more open-ended or less-constrained text inputs, which may be more representative of real-world use cases.

Overall, the DreamVoice approach represents an important step forward in text-guided voice conversion, and the researchers have demonstrated promising results. Further research and development in this area could lead to more accessible and expressive voice transformation technologies.

Conclusion

DreamVoice: Text-Guided Voice Conversion presents a novel approach to voice conversion that leverages large language models to provide text-based guidance. This helps the model learn more expressive and natural-sounding voice transformations, outperforming previous methods on standard benchmarks.

The ability to transform someone's voice to match a target speaker solely based on a text prompt is a significant advancement in voice conversion technology. This could have a wide range of applications, from accessibility tools to creative media production. While the current version of DreamVoice has some limitations, the researchers have demonstrated the potential of this approach and laid the groundwork for further refinements and expansions of the technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

DreamVoice: Text-Guided Voice Conversion

Jiarui Hai, Karan Thakkar, Helin Wang, Zengyi Qin, Mounya Elhilali

Generative voice technologies are rapidly evolving, offering opportunities for more personalized and inclusive experiences. Traditional one-shot voice conversion (VC) requires a target recording during inference, limiting ease of usage in generating desired voice timbres. Text-guided generation offers an intuitive solution to convert voices to desired DreamVoices according to the users' needs. Our paper presents two major contributions to VC technology: (1) DreamVoiceDB, a robust dataset of voice timbre annotations for 900 speakers from VCTK and LibriTTS. (2) Two text-guided VC methods: DreamVC, an end-to-end diffusion-based text-guided VC model; and DreamVG, a versatile text-to-voice generation plugin that can be combined with any one-shot VC models. The experimental results demonstrate that our proposed methods trained on the DreamVoiceDB dataset generate voice timbres accurately aligned with the text prompt and achieve high-quality VC.

6/25/2024

🌀

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

Huadai Liu, Rongjie Huang, Xuan Lin, Wenqiang Xu, Maozong Zheng, Hong Chen, Jinzheng He, Zhou Zhao

Text-to-speech(TTS) has undergone remarkable improvements in performance, particularly with the advent of Denoising Diffusion Probabilistic Models (DDPMs). However, the perceived quality of audio depends not solely on its content, pitch, rhythm, and energy, but also on the physical environment. In this work, we propose ViT-TTS, the first visual TTS model with scalable diffusion transformers. ViT-TTS complement the phoneme sequence with the visual information to generate high-perceived audio, opening up new avenues for practical applications of AR and VR to allow a more immersive and realistic audio experience. To mitigate the data scarcity in learning visual acoustic information, we 1) introduce a self-supervised learning framework to enhance both the visual-text encoder and denoiser decoder; 2) leverage the diffusion transformer scalable in terms of parameters and capacity to learn visual scene information. Experimental results demonstrate that ViT-TTS achieves new state-of-the-art results, outperforming cascaded systems and other baselines regardless of the visibility of the scene. With low-resource data (1h, 2h, 5h), ViT-TTS achieves comparative results with rich-resource baselines.~footnote{Audio samples are available at url{https://ViT-TTS.github.io/.}}

4/23/2024

StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion

Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Lei Xie, Yuping Wang

StreamVoice has recently pushed the boundaries of zero-shot voice conversion (VC) in the streaming domain. It uses a streamable language model (LM) with a context-aware approach to convert semantic features from automatic speech recognition (ASR) into acoustic features with the desired speaker timbre. Despite its innovations, StreamVoice faces challenges due to its dependency on a streaming ASR within a cascaded framework, which complicates system deployment and optimization, affects VC system's design and performance based on the choice of ASR, and struggles with conversion stability when faced with low-quality semantic inputs. To overcome these limitations, we introduce StreamVoice+, an enhanced LM-based end-to-end streaming framework that operates independently of streaming ASR. StreamVoice+ integrates a semantic encoder and a connector with the original StreamVoice framework, now trained using a non-streaming ASR. This model undergoes a two-stage training process: initially, the StreamVoice backbone is pre-trained for voice conversion and the semantic encoder for robust semantic extraction. Subsequently, the system is fine-tuned end-to-end, incorporating a LoRA matrix to activate comprehensive streaming functionality. Furthermore, StreamVoice+ mainly introduces two strategic enhancements to boost conversion quality: a residual compensation mechanism in the connector to ensure effective semantic transmission and a self-refinement strategy that leverages pseudo-parallel speech pairs generated by the conversion backbone to improve speech decoupling. Experiments demonstrate that StreamVoice+ not only achieves higher naturalness and speaker similarity in voice conversion than its predecessor but also provides versatile support for both streaming and non-streaming conversion scenarios.

8/6/2024

VoiceX: A Text-To-Speech Framework for Custom Voices

Silvan Mertes, Daksitha Withanage Don, Otto Grothe, Johanna Kuch, Ruben Schlagowski, Elisabeth Andr'e

Modern TTS systems are capable of creating highly realistic and natural-sounding speech. Despite these developments, the process of customizing TTS voices remains a complex task, mostly requiring the expertise of specialists within the field. One reason for this is the utilization of deep learning models, which are characterized by their expansive, non-interpretable parameter spaces, restricting the feasibility of manual customization. In this paper, we present a novel human-in-the-loop paradigm based on an evolutionary algorithm for directly interacting with the parameter space of a neural TTS model. We integrated our approach into a user-friendly graphical user interface that allows users to efficiently create original voices. Those voices can then be used with the backbone TTS model, for which we provide a Python API. Further, we present the results of a user study exploring the capabilities of VoiceX. We show that VoiceX is an appropriate tool for creating individual, custom voices.

8/23/2024