CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Read original: arXiv:2407.05407 - Published 7/10/2024 by Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma and 2 others

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Overview

CosyVoice is a scalable, multilingual text-to-speech (TTS) synthesizer that can generate high-quality speech in many languages without needing to train a model for each language.
It uses a novel approach called "Supervised Semantic Tokens" to enable zero-shot learning, allowing the model to generate speech in languages it hasn't been explicitly trained on.
The paper introduces the CosyVoice architecture and demonstrates its ability to achieve state-of-the-art performance on several multilingual TTS benchmarks.

Plain English Explanation

CosyVoice is an advanced text-to-speech system that can generate natural-sounding speech in many different languages, even if it hasn't been trained on that language before. This is a significant advancement, as traditional TTS models need to be trained separately for each language, which can be time-consuming and resource-intensive.

The key innovation in CosyVoice is the use of "Supervised Semantic Tokens." These are a set of language-agnostic labels that the model learns to associate with specific speech sounds and linguistic concepts. By training the model to generate these tokens, it can then use that knowledge to produce speech in new languages it hasn't been exposed to before.

This "zero-shot" learning approach allows CosyVoice to be highly scalable and efficient, as the model doesn't need to be retrained from scratch for each new language. Instead, it can leverage its existing understanding of the Supervised Semantic Tokens to generate high-quality speech in a wide variety of languages.

The paper demonstrates that CosyVoice outperforms other state-of-the-art multilingual TTS systems on several benchmarks, making it a promising solution for applications that require robust, multilingual speech synthesis, such as virtual assistants, language learning tools, or audiobook narration.

Technical Explanation

The CosyVoice architecture [1] is built upon the success of previous work in high-fidelity text-to-speech synthesis, such as CLAM-TTS and MobileSpeech. However, it introduces a novel approach to enable zero-shot learning for multilingual TTS.

At the core of CosyVoice is a set of "Supervised Semantic Tokens" that the model is trained to generate. These tokens represent language-agnostic concepts, such as phonemes, prosody, and other linguistic features. By learning to associate these tokens with the corresponding speech characteristics, the model can then use this knowledge to generate speech in new languages it hasn't been trained on.

The CosyVoice architecture includes several key components:

Acoustic Model: Responsible for generating the raw audio waveform from the Supervised Semantic Tokens.
Language Model: Learns the relationship between text input and the Supervised Semantic Tokens.
Token Predictor: Predicts the appropriate Supervised Semantic Tokens given the text input.

By training this architecture end-to-end, CosyVoice is able to achieve state-of-the-art performance on several multilingual TTS benchmarks, including XTTS and Improving Language Model-based Zero-shot Text-to-Speech.

Critical Analysis

The CosyVoice paper presents a promising approach to multilingual TTS, but it also has some limitations and potential areas for further research:

The paper focuses on evaluating CosyVoice on relatively high-resource languages, such as English, Mandarin, and Spanish. It would be interesting to see how the model performs on low-resource languages, which often present greater challenges for TTS.
The authors mention that the Supervised Semantic Tokens are manually defined, which could limit the model's ability to learn more nuanced linguistic concepts. Exploring ways to learn these tokens in a more unsupervised or self-supervised manner could be an area for future work.
While CosyVoice demonstrates impressive zero-shot performance, the authors note that fine-tuning the model on a small amount of target-language data can still provide significant improvements. Understanding the trade-offs between zero-shot and fine-tuned performance could inform the practical deployment of the system.
The paper does not address potential biases or ethical considerations that may arise from a highly scalable, multilingual TTS system. Exploring these issues could be an important area for further research.

Conclusion

The CosyVoice paper presents a novel and highly scalable approach to multilingual text-to-speech synthesis, using Supervised Semantic Tokens to enable zero-shot learning. By decoupling the linguistic and acoustic components of speech, CosyVoice can generate high-quality speech in a wide range of languages without the need for extensive retraining.

This breakthrough has the potential to significantly improve the accessibility and usability of TTS technology, allowing for more natural and inclusive voice interfaces in virtual assistants, language learning tools, audiobooks, and other applications. As the field of multilingual speech synthesis continues to advance, the insights and techniques introduced in the CosyVoice paper may serve as a foundation for further innovations in this important area of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, Zhijie Yan

Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role in LLM-based TTS models. Current speech tokens are learned in an unsupervised manner, which lacks explicit semantic information and alignment to the text. In this paper, we propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis. Experimental results show that supervised semantic tokens significantly outperform existing unsupervised tokens in terms of content consistency and speaker similarity for zero-shot voice cloning. Moreover, we find that utilizing large-scale data further improves the synthesis performance, indicating the scalable capacity of CosyVoice. To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models.

7/10/2024

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

Cheng Gong, Xin Wang, Erica Cooper, Dan Wells, Longbiao Wang, Jianwu Dang, Korin Richmond, Junichi Yamagishi

Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker's voices, but there is growing interest in developing systems that can synthesize voices for new speakers using only a few seconds of their speech. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper combines text-based and speech-based self-supervised learning models for multilingual speech synthesis. Our proposed model has zero-shot generalization ability not only for unseen speakers but also for unseen languages. We have conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetically low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language.

8/28/2024

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Shun Lei, Yixuan Zhou, Liyang Chen, Dan Luo, Zhiyong Wu, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han, Helen Meng

Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.

4/10/2024

High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model

Joun Yeop Lee, Myeonghun Jeong, Minchan Kim, Ji-Hyun Lee, Hoon-Young Cho, Nam Soo Kim

We propose a novel two-stage text-to-speech (TTS) framework with two types of discrete tokens, i.e., semantic and acoustic tokens, for high-fidelity speech synthesis. It features two core components: the Interpreting module, which processes text and a speech prompt into semantic tokens focusing on linguistic contents and alignment, and the Speaking module, which captures the timbre of the target voice to generate acoustic tokens from semantic tokens, enriching speech reconstruction. The Interpreting stage employs a transducer for its robustness in aligning text to speech. In contrast, the Speaking stage utilizes a Conformer-based architecture integrated with a Grouped Masked Language Model (G-MLM) to boost computational efficiency. Our experiments verify that this innovative structure surpasses the conventional models in the zero-shot scenario in terms of speech quality and speaker similarity.

6/26/2024