ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

2406.01205

Published 6/4/2024 by Shengpeng Ji, Jialong Zuo, Minghui Fang, Siqi Zheng, Qian Chen, Wen Wang, Ziyue Jiang, Hai Huang, Xize Cheng, Rongjie Huang and 1 other

eess.AS cs.LG cs.SD

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

Abstract

In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style, merely based on a few seconds of audio prompt and a simple textual style description prompt. Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and adjustment capabilities or were unrelated to speaker-specific voice generation. Therefore, ControlSpeech focuses on a more challenging new task-a TTS system with controllable timbre, content, and style at the same time. ControlSpeech takes speech prompts, content prompts, and style prompts as inputs and utilizes bidirectional attention and mask-based parallel decoding to capture corresponding codec representations in a discrete decoupling codec space. Moreover, we discovered the issue of text style controllability in a many-to-many mapping fashion and proposed the Style Mixture Semantic Density (SMSD) model to resolve this problem. SMSD module which is based on Gaussian mixture density networks, is designed to enhance the fine-grained partitioning and sampling capabilities of style semantic information and generate speech with more diverse styles. In terms of experiments, we make available a controllable model toolkit called ControlToolkit with a new style controllable dataset, some replicated baseline models and propose new metrics to evaluate both the control capability and the quality of generated audio in ControlSpeech. The relevant ablation studies validate the necessity of each component in ControlSpeech is necessary. We hope that ControlSpeech can establish the next foundation paradigm of controllable speech synthesis. The relevant code and demo are available at https://github.com/jishengpeng/ControlSpeech .

Create account to get full access

Overview

This paper introduces ControlSpeech, a novel approach for simultaneous zero-shot speaker cloning and zero-shot language style control using a decoupled codec.
ControlSpeech aims to enable users to adjust the speaker's voice and language style independently, without requiring any additional training data or fine-tuning.
The proposed method decouples the speaker information and language style into separate latent representations, allowing for independent control over these two aspects of speech synthesis.

Plain English Explanation

ControlSpeech is a new technology that allows you to change both the speaker's voice and the way they speak, without needing any extra training data or adjustments. Usually, if you want to change the voice or the speaking style, you'd have to provide more examples or do additional training. But ControlSpeech separates the speaker's voice and the language style into different parts, so you can adjust them independently.

For example, you could take a recording of someone speaking in a formal style, and then use ControlSpeech to make that same person sound more casual or conversational, while still keeping their original voice. Or you could take a recording of one person and make them sound like a different person, while keeping the same language style. This kind of independent control over voice and style could be really useful for things like MobileSpeech, zero-shot text-to-speech, zero-shot speech editing, and efficient zero-shot speech synthesis.

Technical Explanation

The key innovation in ControlSpeech is the use of a decoupled codec architecture, which separates the speaker information and language style into distinct latent representations. This allows for independent control over these two aspects of the generated speech.

The ControlSpeech model is composed of three main components:

Speaker Encoder: Encodes the speaker identity from the input speech.
Language Style Encoder: Encodes the language style from the input speech.
Speech Decoder: Generates the output speech by combining the speaker and language style representations.

During training, the model learns to disentangle the speaker and language style information, enabling the independent manipulation of these factors at inference time. The authors demonstrate the effectiveness of ControlSpeech through experiments on various datasets, showing its ability to perform simultaneous zero-shot speaker cloning and zero-shot language style control.

The Controllable Prosody Generation from Partial Inputs approach is also related, as it explores controlling the prosody of synthetic speech using partial input information.

Critical Analysis

The ControlSpeech paper presents a promising approach for simultaneous zero-shot speaker cloning and zero-shot language style control. The decoupled codec architecture is a clever way to disentangle these two important aspects of speech synthesis, which could have significant practical applications.

However, the paper does not address some potential limitations and areas for further research. For example, it's unclear how well ControlSpeech would perform on more diverse or challenging datasets, or how it would handle cases where the speaker identity and language style are more closely intertwined.

Additionally, the paper does not delve into potential ethical considerations, such as the implications of enabling users to easily manipulate the voice and speaking style of others, or the potential for misuse in creating "deepfake" audio. These are important issues that should be carefully considered as the technology advances.

Overall, ControlSpeech represents an interesting and potentially impactful contribution to the field of speech synthesis. However, further research and thoughtful discussion around the technology's limitations and ethical implications will be crucial as it continues to develop.

Conclusion

The ControlSpeech paper introduces a novel approach for simultaneous zero-shot speaker cloning and zero-shot language style control using a decoupled codec architecture. This technology could enable a wide range of applications, from personalized text-to-speech systems to enhanced voice editing capabilities.

By separating the speaker identity and language style into distinct latent representations, ControlSpeech allows for the independent manipulation of these factors, which is a significant advancement in the field of speech synthesis. The authors demonstrate the effectiveness of their approach through extensive experiments, showcasing the potential of this technology to transform how we interact with and customize synthetic speech.

As with any emerging technology, it will be important to carefully consider the ethical implications of ControlSpeech and similar tools, particularly in terms of the potential for misuse or unintended consequences. Ongoing research and thoughtful discussion will be crucial to ensure that these advancements are developed and deployed responsibly, with a focus on benefiting society as a whole.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Shun Lei, Yixuan Zhou, Liyang Chen, Dan Luo, Zhiyong Wu, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han, Helen Meng

Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.

4/10/2024

cs.SD eess.AS

💬

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

Shengpeng Ji, Ziyue Jiang, Hanting Wang, Jialong Zuo, Zhou Zhao

Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice cloning capabilities, requiring only a few seconds of unseen speaker voice prompts. However, all previous work has been developed for cloud-based systems. Taking autoregressive models as an example, although these approaches achieve high-fidelity voice cloning, they fall short in terms of inference speed, model size, and robustness. Therefore, we propose MobileSpeech, which is a fast, lightweight, and robust zero-shot text-to-speech system based on mobile devices for the first time. Specifically: 1) leveraging discrete codec, we design a parallel speech mask decoder module called SMD, which incorporates hierarchical information from the speech codec and weight mechanisms across different codec layers during the generation process. Moreover, to bridge the gap between text and speech, we introduce a high-level probabilistic mask that simulates the progression of information flow from less to more during speech generation. 2) For speaker prompts, we extract fine-grained prompt duration from the prompt speech and incorporate text, prompt speech by cross attention in SMD. We demonstrate the effectiveness of MobileSpeech on multilingual datasets at different levels, achieving state-of-the-art results in terms of generating speed and speech quality. MobileSpeech achieves RTF of 0.09 on a single A100 GPU and we have successfully deployed MobileSpeech on mobile devices. Audio samples are available at url{https://mobilespeech.github.io/} .

6/4/2024

eess.AS cs.SD

🧠

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka

Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples.

6/27/2024

eess.AS cs.CL cs.LG cs.SD

VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech

Ashishkumar Gudmalwar, Nirmesh Shah, Sai Akarsh, Pankaj Wasnik, Rajiv Ratn Shah

Despite the significant advancements in Text-to-Speech (TTS) systems, their full utilization in automatic dubbing remains limited. This task necessitates the extraction of voice identity and emotional style from a reference speech in a source language and subsequently transferring them to a target language using cross-lingual TTS techniques. While previous approaches have mainly concentrated on controlling voice identity within the cross-lingual TTS framework, there has been limited work on incorporating emotion and voice identity together. To this end, we introduce an end-to-end Voice Identity and Emotional Style Controllable Cross-Lingual (VECL) TTS system using multilingual speakers and an emotion embedding network. Moreover, we introduce content and style consistency losses to enhance the quality of synthesized speech further. The proposed system achieved an average relative improvement of 8.83% compared to the state-of-the-art (SOTA) methods on a database comprising English and three Indian languages (Hindi, Telugu, and Marathi).

6/13/2024

eess.AS cs.SD