USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

2404.18094

Published 4/30/2024 by Wenbin Wang, Yang Song, Sanjay Jha

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Abstract

Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. In addition, prior approaches only provide either zero-shot or few-shot adaptation, constraining their utility across varied real-world scenarios with different demands. Besides, most current evaluations of speaker-adaptive TTS are conducted only on datasets of native speakers, inadvertently neglecting a vast portion of non-native speakers with diverse accents. Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as instant and fine-grained adaptations based on their merits. To alleviate the insufficient generalization performance observed in zero-shot speaker adaptation, we designed two innovative discriminators and introduced a memory mechanism for the speech decoder. To prevent catastrophic forgetting and reduce storage implications for few-shot speaker adaptation, we designed two adapters and a unique adaptation procedure.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper introduces USAT, a universal speaker-adaptive text-to-speech (TTS) approach that can generate high-quality speech for any target speaker using only a few audio samples.
USAT leverages a speaker-adaptive model and a speaker embedding network to enable zero-shot and few-shot learning, allowing it to adapt to new speakers with limited data.
The paper demonstrates USAT's ability to outperform existing TTS models in terms of speech quality and speaker similarity, particularly for unseen speakers.

Plain English Explanation

USAT is a new text-to-speech (TTS) system that can generate high-quality speech for any speaker, even if it hasn't been trained on that speaker's voice before. Most TTS systems require lots of audio samples from a speaker to learn their voice. But USAT uses a clever approach to adapt to new speakers using just a few audio clips.

The key idea is that USAT has two main components: a speaker-adaptive model that can generate speech, and a speaker embedding network that learns a compact representation of each speaker's voice. By combining these two parts, USAT can quickly adapt to a new speaker's voice using only a small amount of data. This allows USAT to generate natural-sounding speech for any speaker, even ones it has never heard before.

The researchers show that USAT outperforms other TTS systems, especially when it comes to mimicking the voice of speakers it hasn't been trained on. This is a significant advance, as it means USAT can be used to create custom voices for a wide range of applications, from audiobooks to virtual assistants, without requiring large datasets of training data.

Technical Explanation

The core of USAT is a speaker-adaptive TTS model that can generate high-quality speech for any target speaker. This model is built on a Transformer-based TTS architecture, with a dedicated speaker embedding network that learns a compact representation of each speaker's voice.

During training, the speaker embedding network is used to condition the TTS model on the target speaker's voice. This allows the model to learn to generate speech that matches the speaker's acoustic characteristics, such as their pitch, timbre, and speaking style.

To adapt USAT to a new speaker, the researchers first train the speaker embedding network on a small amount of audio data from the target speaker. This speaker-specific embedding is then used to fine-tune the TTS model, enabling it to generate speech that closely matches the target speaker's voice.

Experiments on several TTS benchmarks show that USAT achieves state-of-the-art performance, outperforming previous zero-shot and few-shot TTS approaches in terms of both speech quality and speaker similarity. USAT is particularly effective for generating speech for unseen speakers, demonstrating its strong speaker-adaptive capabilities.

Critical Analysis

The authors acknowledge that USAT's performance may be limited by the quality and diversity of the training data used to build the initial TTS and speaker embedding models. Larger and more diverse datasets could potentially further improve USAT's ability to adapt to new speakers.

Additionally, the paper does not extensively explore the potential biases or fairness implications of USAT, which is an important consideration for any AI system that generates human-like speech. Further research is needed to ensure USAT can be deployed equitably across different demographics and use cases.

Overall, USAT represents a significant advancement in speaker-adaptive TTS, with the potential to enable a wide range of applications that require personalized, high-quality synthetic speech. However, as with any new technology, there are still important challenges and considerations that should be carefully addressed as the research progresses.

Conclusion

The USAT approach introduces a novel speaker-adaptive TTS system that can generate high-quality speech for any target speaker using only a few audio samples. By combining a speaker-adaptive TTS model with a speaker embedding network, USAT achieves state-of-the-art performance in both speech quality and speaker similarity, particularly for unseen speakers.

This breakthrough in zero-shot and few-shot TTS has important implications for applications that require customized synthetic voices, such as audiobooks, virtual assistants, and accessibility tools. USAT's ability to adapt quickly to new speakers with limited data could make it a valuable tool for creating personalized and inclusive voice experiences.

As the field of TTS continues to evolve, research like USAT will be crucial for expanding the capabilities and accessibility of synthetic speech technologies. By addressing the challenge of speaker adaptation, USAT represents a significant step forward in making high-quality, personalized text-to-speech a reality for a wide range of users and use cases.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Shun Lei, Yixuan Zhou, Liyang Chen, Dan Luo, Zhiyong Wu, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han, Helen Meng

Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.

4/10/2024

cs.SD eess.AS

HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria

Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain to the speech domain. While developing TTS architectures that train and test on the same set of speakers has seen significant improvements, out-of-domain speaker performance still faces enormous limitations. Domain adaptation on a new set of speakers can be achieved by fine-tuning the whole model for each new domain, thus making it parameter-inefficient. This problem can be solved by Adapters that provide a parameter-efficient alternative to domain adaptation. Although famous in NLP, speech synthesis has not seen much improvement from Adapters. In this work, we present HyperTTS, which comprises a small learnable network, hypernetwork, that generates parameters of the Adapter blocks, allowing us to condition Adapters on speaker representations and making them dynamic. Extensive evaluations of two domain adaptation settings demonstrate its effectiveness in achieving state-of-the-art performance in the parameter-efficient regime. We also compare different variants of HyperTTS, comparing them with baselines in different studies. Promising results on the dynamic adaptation of adapter parameters using hypernetworks open up new avenues for domain-generic multi-speaker TTS systems. The audio samples and code are available at https://github.com/declare-lab/HyperTTS.

4/9/2024

cs.CL cs.LG cs.SD eess.AS

🗣️

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Zhenhui Ye, Shengpeng Ji, Qian Yang, Chen Zhang, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao

Zero-shot text-to-speech (TTS) aims to synthesize voices with unseen speech prompts, which significantly reduces the data and computation requirements for voice cloning by skipping the fine-tuning process. However, the prompting mechanisms of zero-shot TTS still face challenges in the following aspects: 1) previous works of zero-shot TTS are typically trained with single-sentence prompts, which significantly restricts their performance when the data is relatively sufficient during the inference stage. 2) The prosodic information in prompts is highly coupled with timbre, making it untransferable to each other. This paper introduces Mega-TTS 2, a generic prompting mechanism for zero-shot TTS, to tackle the aforementioned challenges. Specifically, we design a powerful acoustic autoencoder that separately encodes the prosody and timbre information into the compressed latent space while providing high-quality reconstructions. Then, we propose a multi-reference timbre encoder and a prosody latent language model (P-LLM) to extract useful information from multi-sentence prompts. We further leverage the probabilities derived from multiple P-LLM outputs to produce transferable and controllable prosody. Experimental results demonstrate that Mega-TTS 2 could not only synthesize identity-preserving speech with a short prompt of an unseen speaker from arbitrary sources but consistently outperform the fine-tuning method when the volume of data ranges from 10 seconds to 5 minutes. Furthermore, our method enables to transfer various speaking styles to the target timbre in a fine-grained and controlled manner. Audio samples can be found in https://boostprompt.github.io/boostprompt/.

4/11/2024

eess.AS cs.SD

🗣️

FlashSpeech: Efficient Zero-Shot Speech Synthesis

Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Qifeng Liu, Yike Guo, Wei Xue

Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in https://flashspeech.github.io/.

4/26/2024

eess.AS cs.AI cs.CL cs.LG cs.SD