Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Read original: arXiv:2406.02430 - Published 6/5/2024 by Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao and 36 others

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Overview

This paper introduces Seed-TTS, a family of high-quality and versatile speech generation models.
Seed-TTS can generate diverse and expressive speech from text, with the ability to control various aspects of the output, such as speaking style, emotion, and audio quality.
The models are built on top of a novel text encoding approach and leverage large-scale speech data for training, resulting in state-of-the-art speech quality and versatility.

Plain English Explanation

The paper presents a new set of speech generation models called Seed-TTS. These models can take text as input and generate high-quality, expressive speech as output. They have the ability to control various aspects of the generated speech, such as the speaking style, emotion, and audio quality.

The key innovation behind Seed-TTS is a novel approach to encoding the text input. This allows the models to better capture the nuances and complexities of human speech. By leveraging large-scale speech data during training, Seed-TTS is able to achieve state-of-the-art performance in terms of speech quality and versatility.

In other words, Seed-TTS can generate speech that sounds very natural and human-like, and it can also be customized to match different speaking styles, emotional tones, and audio characteristics. This makes Seed-TTS a highly versatile and powerful tool for applications like MobileSpeech, SimpleSpeech, Phonetic-Enhanced Language Modeling, NaturalSpeech, and Evaluating Text-to-Speech Synthesis.

Technical Explanation

The core innovation of Seed-TTS is a novel text encoding approach that allows the models to better capture the nuances and complexities of human speech. This text encoding is combined with large-scale speech data to train the Seed-TTS models, resulting in state-of-the-art performance in terms of speech quality and versatility.

The Seed-TTS architecture consists of several key components:

Text Encoder: This module takes the input text and encodes it into a rich, high-dimensional representation that captures the semantic and phonetic information.
Audio Decoder: This component generates the actual speech audio based on the encoded text representation, using a powerful neural network-based synthesis model.
Conditioning Modules: These allow for the customization of the generated speech, enabling control over aspects like speaking style, emotion, and audio quality.

The authors demonstrate the effectiveness of Seed-TTS through extensive experiments, comparing it to other leading text-to-speech models. The results show that Seed-TTS outperforms the competition on a range of objective and subjective metrics, while also offering greater versatility and control over the output.

Critical Analysis

The Seed-TTS paper presents a significant advancement in text-to-speech technology, with impressive results and a high level of versatility. However, the authors do acknowledge some limitations and areas for further research:

Dataset Bias: The performance of Seed-TTS is heavily dependent on the quality and diversity of the training data. The authors note that the current datasets may not fully capture the breadth of human speech, which could limit the models' ability to generalize to more diverse use cases.
Computational Efficiency: While Seed-TTS achieves state-of-the-art performance, the authors mention that the models can be computationally intensive, which may limit their deployment on resource-constrained platforms like mobile devices. Further research into more efficient architectures could address this issue.
Controllability Challenges: While Seed-TTS offers a high degree of control over the generated speech, the authors recognize that fine-tuning these controls can be a complex and unintuitive process for end-users. Improving the user experience and making the controls more accessible could be an important area for future work.

Overall, the Seed-TTS paper represents a significant advancement in text-to-speech technology and demonstrates the potential for highly versatile and customizable speech generation models. However, the research community should continue to address the limitations and challenges highlighted in the paper to further improve the capabilities and usability of these systems.

Conclusion

The Seed-TTS paper introduces a family of high-quality and versatile speech generation models that can produce diverse and expressive speech from text. The key innovation is a novel text encoding approach that, combined with large-scale speech data, allows Seed-TTS to achieve state-of-the-art performance in terms of speech quality and customizability.

The ability to control various aspects of the generated speech, such as speaking style, emotion, and audio quality, makes Seed-TTS a highly valuable tool for a wide range of applications, including MobileSpeech, SimpleSpeech, Phonetic-Enhanced Language Modeling, NaturalSpeech, and Evaluating Text-to-Speech Synthesis. While the research community should continue to address the limitations outlined in the paper, the Seed-TTS models represent a significant step forward in the field of text-to-speech synthesis and hold great promise for a wide range of real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, Junjie Pan, Xin Wang, Yuping Wang, Yuxuan Wang, Zhen Wei, Jian Wu, Chao Yao, Yifeng Yang, Yuanhao Yi, Junteng Zhang, Qidi Zhang, Shuo Zhang, Wenjie Zhang, Yang Zhang, Zilin Zhao, Dejian Zhong, Xiaobin Zhuang

We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, we achieve even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild. Furthermore, we propose a self-distillation method for speech factorization, as well as a reinforcement learning approach to enhance model robustness, speaker similarity, and controllability. We additionally present a non-autoregressive (NAR) variant of the Seed-TTS model, named $text{Seed-TTS}_text{DiT}$, which utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, $text{Seed-TTS}_text{DiT}$ does not depend on pre-estimated phoneme durations and performs speech generation through end-to-end processing. We demonstrate that this variant achieves comparable performance to the language model-based variant and showcase its effectiveness in speech editing. We encourage readers to listen to demos at url{https://bytedancespeech.github.io/seedtts_tech_report}.

6/5/2024

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

Dongchao Yang, Rongjie Huang, Yuanyuan Wang, Haohan Guo, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng

Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective method for improving the diversity and naturalness of synthesized speech. At the high level, previous large-scale TTS models can be categorized into either Auto-regressive (AR) based (textit{e.g.}, VALL-E) or Non-auto-regressive (NAR) based models (textit{e.g.}, NaturalSpeech 2/3). Although these works demonstrate good performance, they still have potential weaknesses. For instance, AR-based models are plagued by unstable generation quality and slow generation speed; meanwhile, some NAR-based models need phoneme-level duration alignment information, thereby increasing the complexity of data pre-processing, model design, and loss design. In this work, we build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods, offering the following key advantages: (1) simplified data preparation; (2) straightforward model and loss design; and (3) stable, high-quality generation performance with fast inference speed. Compared to our previous publication, we present ({romannumeral1}) a detailed analysis of the influence of speech tokenizer and noisy label for TTS performance; ({romannumeral2}) four distinct types of sentence duration predictors; ({romannumeral3}) a novel flow-based scalar latent transformer diffusion model. With these improvement, we show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models. Furthermore, we show that SimpleSpeech 2 can be seamlessly extended to multilingual TTS by training it on multilingual speech datasets. Demos are available on: {https://dongchaoyang.top/SimpleSpeech2_demo/}.

8/29/2024

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li, Xiaoyang Li, Zeyang Li, Zehua Lin, Rui Liu, Shouda Liu, Lu Lu, Yizhou Lu, Jingting Ma, Shengtao Ma, Yulin Pei, Chen Shen, Tian Tan, Xiaogang Tian, Ming Tu, Bo Wang, Hao Wang, Yuping Wang, Yuxuan Wang, Hanzhang Xia, Rui Xia, Shuangyi Xie, Hongmin Xu, Meng Yang, Bihong Zhang, Jun Zhang, Wanyi Zhang, Yang Zhang, Yawei Zhang, Yijie Zheng, Ming Zou

Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance.

7/11/2024

DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer

Keon Lee, Dong Won Kim, Jaehyeon Kim, Jaewoong Cho

Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models for TTS. In this work, we present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders. Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations. To achieve this, we enhance the DiT architecture to suit TTS and improve the alignment by incorporating semantic guidance into the latent space of speech. We scale the training dataset and the model size to 82K hours and 790M parameters, respectively. Our extensive experiments demonstrate that the large-scale diffusion model for TTS without domain-specific modeling not only simplifies the training pipeline but also yields superior or comparable zero-shot performance to state-of-the-art TTS models in terms of naturalness, intelligibility, and speaker similarity. Our speech samples are available at https://ditto-tts.github.io.

6/18/2024