XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

2406.04904

Published 6/10/2024 by Edresson Casanova, Kelly Davis, Eren Golge, Gorkem Goknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi and 1 other

eess.AS cs.CL cs.SD

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

Abstract

Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just a few high/medium resource languages, limiting the applications of these models in most of the low/medium resource languages. In this paper, we aim to alleviate this issue by proposing and making publicly available the XTTS system. Our method builds upon the Tortoise model and adds several novel modifications to enable multilingual training, improve voice cloning, and enable faster training and inference. XTTS was trained in 16 languages and achieved state-of-the-art (SOTA) results in most of them.

Create account to get full access

Overview

The paper proposes a massively multilingual zero-shot text-to-speech (TTS) model called XTTS that can generate high-quality speech in over 100 languages without the need for any language-specific training data.
XTTS leverages recent advancements in large language models and cross-lingual transfer learning to achieve this impressive multilingual capability.
The model is evaluated on a diverse set of languages and shown to outperform existing multilingual TTS approaches in both objective and subjective metrics.

Plain English Explanation

The researchers have developed a new artificial intelligence (AI) system that can generate human-like speech in over 100 different languages, even if the system has never been trained on audio examples from those languages before. This is known as "zero-shot" learning, where the AI can apply its knowledge to new tasks without additional training.

The key to this system is its use of advanced language models and techniques for transferring knowledge across languages. Rather than having to painstakingly record and transcribe audio samples for each language, the AI can leverage its understanding of language structure and acoustic patterns to generate high-quality speech from just text inputs.

This massively multilingual zero-shot text-to-speech capability has important applications, like making digital assistants and text-to-speech services accessible to people around the world in their native languages. It could also aid in language preservation efforts by enabling the creation of speech samples for endangered languages.

The researchers show that their XTTS model outperforms existing multilingual TTS approaches, both in terms of objective measures of speech quality and subjective ratings by human listeners. This suggests the system is a significant advancement in the field of zero-shot text-to-speech technology.

Technical Explanation

The XTTS model takes advantage of recent breakthroughs in large language models and cross-lingual transfer learning to enable zero-shot text-to-speech in over 100 languages. The core architecture consists of a text encoder, a speech decoder, and a shared acoustic embedding space that bridges the two modalities.

The text encoder is a massively multilingual language model that can encode text in any of the supported languages into a rich, contextual representation. This representation is then passed to the speech decoder, which generates the corresponding speech waveform. Crucially, the acoustic embedding space allows the model to share knowledge about speech production across languages, enabling zero-shot generalization.

The researchers train XTTS in a multi-task fashion, using a combination of language modeling, speech synthesis, and acoustic embedding learning objectives. This enables the model to learn powerful cross-modal associations that facilitate zero-shot transfer.

XTTS is evaluated on a diverse set of languages, including both high-resource and low-resource languages. The model is shown to outperform existing multilingual TTS approaches, such as CLAM-TTS and USAT, in both objective metrics of speech quality and subjective human ratings. The authors also demonstrate the model's ability to push the limits of zero-shot text-to-speech by generating intelligible speech in extremely low-resource languages.

Critical Analysis

The XTTS model represents a significant advance in multilingual text-to-speech technology, but it also has some limitations and areas for further research. One potential concern is the reliance on a shared acoustic embedding space, which may not be able to capture all the nuances of speech production across vastly different languages and phonological systems.

Additionally, while the model demonstrates impressive zero-shot performance, its generation quality may still lag behind specialized, monolingual TTS systems, especially for high-resource languages. Further research is needed to fully close this performance gap and make XTTS a truly universal TTS solution.

The authors also acknowledge that their evaluation was limited to a relatively small set of languages and suggest that more comprehensive testing is required to fully assess the model's capabilities. Extending XTTS to even more languages, including endangered and extremely low-resource ones, would be a valuable direction for future work.

Overall, the XTTS model represents an important step forward in the field of zero-shot text-to-speech, with the potential to significantly improve the accessibility of speech-based technologies for people around the world.

Conclusion

The XTTS model is a pioneering approach to multilingual text-to-speech that can generate high-quality speech in over 100 languages without requiring any language-specific training data. By leveraging advances in large language models and cross-lingual transfer learning, the researchers have developed a system that can apply its linguistic and acoustic knowledge to new languages in a zero-shot manner.

This massively multilingual capability has widespread implications, from enabling more accessible digital assistants and text-to-speech services to aiding in the preservation of endangered languages. While the model still has some limitations, the results demonstrate the immense potential of zero-shot text-to-speech technology to break down language barriers and bring the power of speech-based AI to people around the world.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

$Towards Zero-Shot Text-To-Speech for Arabic Dialects$

Towards Zero-Shot Text-To-Speech for Arabic Dialects

Khai Duy Doan, Abdul Waheed, Muhammad Abdul-Mageed

Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources. We address this gap for Arabic, a language of more than 450 million native speakers, by first adapting a sizeable existing dataset to suit the needs of speech synthesis. Additionally, we employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting. Subsequently, we fine-tune the XTTSfootnote{https://docs.coqui.ai/en/latest/models/xtts.html}footnote{https://medium.com/machine-learns/xtts-v2-new-version-of-the-open-source-text-to-speech-model-af73914db81f}footnote{https://medium.com/@erogol/xtts-v1-techincal-notes-eb83ff05bdc} model, an open-source architecture. We then evaluate our models on a dataset comprising 31 unseen speakers and an in-house dialectal dataset. Our automated and human evaluation results show convincing performance while capable of generating dialectal speech. Our study highlights significant potential for improvements in this emerging area of research in Arabic.

6/26/2024

cs.CL cs.SD eess.AS

Meta Learning Text-to-Speech Synthesis in over 7000 Languages

Florian Lux, Sarina Meyer, Lyonel Behringer, Frank Zalkow, Phat Do, Matt Coler, Emanuel A. P. Habets, Ngoc Thang Vu

In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech synthesis in languages without any available data. We validate our system's performance through objective measures and human evaluation across a diverse linguistic landscape. By releasing our code and models publicly, we aim to empower communities with limited linguistic resources and foster further innovation in the field of speech technology.

6/11/2024

cs.CL cs.LG cs.SD eess.AS

🧠

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka

Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples.

6/27/2024

eess.AS cs.CL cs.LG cs.SD

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Shun Lei, Yixuan Zhou, Liyang Chen, Dan Luo, Zhiyong Wu, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han, Helen Meng

Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.

4/10/2024

cs.SD eess.AS