Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

2307.07218

Published 4/11/2024 by Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Zhenhui Ye, Shengpeng Ji, Qian Yang, Chen Zhang, Pengfei Wei, Chunfeng Wang and 3 others

eess.AS cs.SD

🗣️

Abstract

Zero-shot text-to-speech (TTS) aims to synthesize voices with unseen speech prompts, which significantly reduces the data and computation requirements for voice cloning by skipping the fine-tuning process. However, the prompting mechanisms of zero-shot TTS still face challenges in the following aspects: 1) previous works of zero-shot TTS are typically trained with single-sentence prompts, which significantly restricts their performance when the data is relatively sufficient during the inference stage. 2) The prosodic information in prompts is highly coupled with timbre, making it untransferable to each other. This paper introduces Mega-TTS 2, a generic prompting mechanism for zero-shot TTS, to tackle the aforementioned challenges. Specifically, we design a powerful acoustic autoencoder that separately encodes the prosody and timbre information into the compressed latent space while providing high-quality reconstructions. Then, we propose a multi-reference timbre encoder and a prosody latent language model (P-LLM) to extract useful information from multi-sentence prompts. We further leverage the probabilities derived from multiple P-LLM outputs to produce transferable and controllable prosody. Experimental results demonstrate that Mega-TTS 2 could not only synthesize identity-preserving speech with a short prompt of an unseen speaker from arbitrary sources but consistently outperform the fine-tuning method when the volume of data ranges from 10 seconds to 5 minutes. Furthermore, our method enables to transfer various speaking styles to the target timbre in a fine-grained and controlled manner. Audio samples can be found in https://boostprompt.github.io/boostprompt/.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper introduces Mega-TTS 2, a new approach to zero-shot text-to-speech (TTS) synthesis that aims to address the limitations of previous methods.
Zero-shot TTS aims to generate speech from unseen prompts without the need for fine-tuning, which can significantly reduce the data and computational requirements for voice cloning.
The paper focuses on two key challenges in zero-shot TTS: (1) the performance limitations when using single-sentence prompts, and (2) the difficulty in separating prosodic information from timbre for transferability.

Plain English Explanation

The paper presents a new system called Mega-TTS 2 that can generate realistic-sounding speech from text prompts, even for speakers it has never heard before. This is an important advancement because it means you can create a voice clone without needing tons of data to train the system.

The key ideas behind Mega-TTS 2 are:

Separate Prosody and Timbre: The system has a clever way of separating the "rhythm and melody" (prosody) from the "voice quality" (timbre) of the speech. This allows it to independently control these two important aspects of the generated voice.
Multi-Sentence Prompts: Rather than just using a single sentence as the input prompt, Mega-TTS 2 can leverage multiple sentences. This gives the system more context to work with, leading to better performance.
Transferable Prosody: By using a "prosody language model", Mega-TTS 2 can transfer the prosody (rhythm and melody) from the prompt to the generated speech, even if the target voice sounds very different.

The end result is a system that can produce high-quality, identity-preserving speech from just a short prompt, without requiring extensive training data or fine-tuning. This has exciting applications in areas like voice cloning, audiobook narration, and virtual assistants.

Technical Explanation

Mega-TTS 2 is designed to address the limitations of previous zero-shot TTS approaches. The key technical components are:

Acoustic Autoencoder: Mega-TTS 2 uses a powerful acoustic autoencoder that can separately encode the prosody and timbre information into a compressed latent space, while still being able to reconstruct high-quality speech.
Multi-Reference Timbre Encoder: The system uses a multi-reference timbre encoder to extract useful information from multi-sentence prompts, rather than just relying on single-sentence inputs.
Prosody Latent Language Model (P-LLM): Mega-TTS 2 employs a prosody latent language model to capture the prosodic information in the prompts. The probabilities derived from multiple P-LLM outputs are then leveraged to produce transferable and controllable prosody.

Through these innovations, Mega-TTS 2 is able to outperform fine-tuning methods for zero-shot TTS, even when the available data ranges from 10 seconds to 5 minutes. The system also enables fine-grained control over the speaking style of the generated speech.

Critical Analysis

The paper presents a well-designed and thorough approach to addressing the challenges in zero-shot TTS. The key strengths of Mega-TTS 2 are its ability to separately model prosody and timbre, as well as its effective use of multi-sentence prompts to improve performance.

However, the paper does not delve deeply into the limitations or potential issues with the proposed system. For example, it would be interesting to understand how Mega-TTS 2 would perform on more diverse or challenging datasets, or how it might handle regional accents or emotional speech.

Additionally, while the paper demonstrates impressive results, it would be valuable to see a more detailed analysis of the system's limitations, such as any cases where it might struggle or produce sub-optimal outputs. A more critical examination of the tradeoffs and potential pitfalls of the approach would help readers assess its real-world applicability and areas for further research.

Conclusion

The Mega-TTS 2 system presented in this paper represents a significant advancement in zero-shot text-to-speech synthesis. By separately modeling prosody and timbre, and leveraging multi-sentence prompts, the system is able to generate high-quality, identity-preserving speech from just a short prompt, without the need for extensive fine-tuning.

This work has important implications for applications such as voice cloning, audiobook narration, and virtual assistants, where the ability to generate realistic speech from limited data can unlock new possibilities. As the field of zero-shot TTS continues to evolve, the innovations and insights presented in this paper will likely inform and inspire future research in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Shun Lei, Yixuan Zhou, Liyang Chen, Dan Luo, Zhiyong Wu, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han, Helen Meng

Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.

4/10/2024

cs.SD eess.AS

🗣️

FlashSpeech: Efficient Zero-Shot Speech Synthesis

Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Qifeng Liu, Yike Guo, Wei Xue

Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in https://flashspeech.github.io/.

4/26/2024

eess.AS cs.AI cs.CL cs.LG cs.SD

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Wenbin Wang, Yang Song, Sanjay Jha

Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. In addition, prior approaches only provide either zero-shot or few-shot adaptation, constraining their utility across varied real-world scenarios with different demands. Besides, most current evaluations of speaker-adaptive TTS are conducted only on datasets of native speakers, inadvertently neglecting a vast portion of non-native speakers with diverse accents. Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as instant and fine-grained adaptations based on their merits. To alleviate the insufficient generalization performance observed in zero-shot speaker adaptation, we designed two innovative discriminators and introduced a memory mechanism for the speech decoder. To prevent catastrophic forgetting and reduce storage implications for few-shot speaker adaptation, we designed two adapters and a unique adaptation procedure.

4/30/2024

cs.SD cs.AI cs.CL

❗

MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

Xiang Li, Zhi-Qi Cheng, Jun-Yan He, Xiaojiang Peng, Alexander G. Hauptmann

Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-Speech System (MM-TTS), a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. MM-TTS consists of two key components: (1) the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information; and (2) the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations across diverse datasets demonstrate the superior performance of MM-TTS compared to traditional E-TTS models. Objective metrics, including Word Error Rate (WER) and Character Error Rate (CER), show significant improvements on ESD dataset, with MM-TTS achieving scores of 7.35% and 3.07%, respectively. Subjective assessments further validate that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech. Our code and pre-trained models are publicly available at https://anonymous.4open.science/r/MMTTS-D214

4/30/2024

cs.CL cs.MM