Autoregressive Diffusion Transformer for Text-to-Speech Synthesis

2406.05551

Published 6/11/2024 by Zhijun Liu, Shuai Wang, Sho Inoue, Qibing Bai, Haizhou Li

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis

Abstract

Audio language models have recently emerged as a promising approach for various audio generation tasks, relying on audio tokenizers to encode waveforms into sequences of discrete symbols. Audio tokenization often poses a necessary compromise between code bitrate and reconstruction accuracy. When dealing with low-bitrate audio codes, language models are constrained to process only a subset of the information embedded in the audio, which in turn restricts their generative capabilities. To circumvent these issues, we propose encoding audio as vector sequences in continuous space $mathbb R^d$ and autoregressively generating these sequences using a decoder-only diffusion transformer (ARDiT). Our findings indicate that ARDiT excels in zero-shot text-to-speech and exhibits performance that compares to or even surpasses that of state-of-the-art models. High-bitrate continuous speech representation enables almost flawless reconstruction, allowing our model to achieve nearly perfect speech editing. Our experiments reveal that employing Integral Kullback-Leibler (IKL) divergence for distillation at each autoregressive step significantly boosts the perceived quality of the samples. Simultaneously, it condenses the iterative sampling process of the diffusion model into a single step. Furthermore, ARDiT can be trained to predict several continuous vectors in one step, significantly reducing latency during sampling. Impressively, one of our models can generate $170$ ms of $24$ kHz speech per evaluation step with minimal degradation in performance. Audio samples are available at http://ardit-tts.github.io/ .

Create account to get full access

Overview

This paper introduces a new text-to-speech synthesis model called the Autoregressive Diffusion Transformer (ADT).
ADT combines the strengths of autoregressive and diffusion-based models to generate high-quality speech while maintaining efficient and fast inference.
The model is evaluated on several benchmark datasets and shows improved performance compared to previous state-of-the-art text-to-speech systems.

Plain English Explanation

The paper presents a new approach to text-to-speech (TTS) synthesis, which is the process of converting written text into spoken audio. The proposed model, called the Autoregressive Diffusion Transformer (ADT), aims to combine the advantages of two main types of TTS models: autoregressive and diffusion-based.

Autoregressive models generate speech by predicting the next audio sample based on the previous ones, like a person speaking word-by-word. Diffusion models, on the other hand, start with random noise and gradually refine it into a coherent speech signal. ADT tries to leverage the strengths of both approaches to create a TTS system that is both high-quality and efficient.

The key innovation of ADT is its architecture, which integrates an autoregressive transformer with a diffusion-based module. This allows the model to capture the sequential nature of speech while also taking advantage of the flexibility and parallelization capabilities of diffusion models. The authors evaluate ADT on several standard TTS benchmarks and show that it outperforms previous state-of-the-art methods, such as SimpleSpeech and LaDiC.

Technical Explanation

The Autoregressive Diffusion Transformer (ADT) model proposed in this paper combines the strengths of autoregressive and diffusion-based approaches for text-to-speech synthesis.

The model consists of two main components: an autoregressive transformer and a diffusion-based module. The autoregressive transformer is responsible for capturing the sequential nature of speech, predicting the next audio sample based on the previous ones. The diffusion-based module, on the other hand, starts with random noise and gradually refines it into a coherent speech signal, leveraging the flexibility and parallelization capabilities of diffusion models.

The authors evaluate ADT on several benchmark datasets, including LJSpeech, VCTK, and LibriTTS. The results show that ADT outperforms previous state-of-the-art TTS models, such as LivingSpeech and ViT-TTS, in terms of both audio quality and inference efficiency.

Critical Analysis

The paper provides a compelling approach to addressing the trade-offs between autoregressive and diffusion-based TTS models. By combining the strengths of both paradigms, ADT is able to generate high-quality speech while maintaining efficient and fast inference.

However, the paper could have provided more details on the specific architectural choices and hyperparameters used in the model. Additionally, the authors could have explored the generalization capabilities of ADT, such as its performance on out-of-domain data or its ability to handle diverse speaking styles and emotions.

Furthermore, the paper does not address potential issues with the diffusion-based component, such as the high computational cost during training or the sensitivity to hyperparameters. It would be valuable to see a more thorough discussion of the limitations and caveats of the proposed approach.

Conclusion

The Autoregressive Diffusion Transformer (ADT) presented in this paper offers a novel and promising approach to text-to-speech synthesis. By leveraging the strengths of both autoregressive and diffusion-based models, ADT is able to generate high-quality speech while maintaining efficient and fast inference.

The results on benchmark datasets are encouraging and suggest that ADT could be a valuable addition to the TTS landscape. However, further research is needed to explore the model's generalization capabilities, address potential limitations, and continue advancing the state-of-the-art in this important field of study.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer

Keon Lee, Dong Won Kim, Jaehyeon Kim, Jaewoong Cho

Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models for TTS. In this work, we present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders. Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations. To achieve this, we enhance the DiT architecture to suit TTS and improve the alignment by incorporating semantic guidance into the latent space of speech. We scale the training dataset and the model size to 82K hours and 790M parameters, respectively. Our extensive experiments demonstrate that the large-scale diffusion model for TTS without domain-specific modeling not only simplifies the training pipeline but also yields superior or comparable zero-shot performance to state-of-the-art TTS models in terms of naturalness, intelligibility, and speaker similarity. Our speech samples are available at https://ditto-tts.github.io.

6/18/2024

eess.AS cs.AI cs.CL cs.LG cs.SD

📈

Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study

Chong Zhang, Yanqing Liu, Yang Zheng, Sheng Zhao

Scaling text-to-speech (TTS) with autoregressive language model (LM) to large-scale datasets by quantizing waveform into discrete speech tokens is making great progress to capture the diversity and expressiveness in human speech, but the speech reconstruction quality from discrete speech token is far from satisfaction depending on the compressed speech token compression ratio. Generative diffusion models trained with score-matching loss and continuous normalized flow trained with flow-matching loss have become prominent in generation of images as well as speech. LM based TTS systems usually quantize speech into discrete tokens and generate these tokens autoregressively, and finally use a diffusion model to up sample coarse-grained speech tokens into fine-grained codec features or mel-spectrograms before reconstructing into waveforms with vocoder, which has a high latency and is not realistic for real time speech applications. In this paper, we systematically investigate varied diffusion models for up sampling stage, which is the main bottleneck for streaming synthesis of LM and diffusion-based architecture, we present the model architecture, objective and subjective metrics to show quality and efficiency improvement.

6/10/2024

eess.AS

SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

Dongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu, Helen Meng

In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compact latent space, which alleviates the modeling difficulty of diffusion. More specifically, we propose a novel speech codec model (SQ-Codec) with scalar quantization, SQ-Codec effectively maps the complex speech signal into a finite and compact latent space, named scalar latent space. Benefits from SQ-Codec, we apply a novel transformer diffusion model in the scalar latent space of SQ-Codec. We train SimpleSpeech on 4k hours of a speech-only dataset, it shows natural prosody and voice cloning ability. Compared with previous large-scale TTS models, it presents significant speech quality and generation speed improvement. Demos are released.

6/17/2024

cs.SD eess.AS

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

Yuchi Wang, Shuhuai Ren, Rundong Gao, Linli Yao, Qingyan Guo, Kaikai An, Jianhong Bai, Xu Sun

Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Regressive (AR) models, casting doubt on their applicability for such tasks. In this work, we revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. With these benefits, diffusion models can alleviate the inherent limitations of AR methods, including their slow inference speed, error propagation, and unidirectional constraints. Furthermore, we identify the prior underperformance of diffusion models stemming from the absence of an effective latent space for image-text alignment, and the discrepancy between continuous diffusion processes and discrete textual data. In response, we introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions and integrates a regularization module to manage varying text lengths. Our framework also includes a diffuser for semantic image-to-text conversion and a Back&Refine technique to enhance token interactivity during inference. LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS COCO dataset with 38.2 BLEU@4 and 126.2 CIDEr, demonstrating exceptional performance without pre-training or ancillary modules. This indicates strong competitiveness with AR models, revealing the previously untapped potential of diffusion models in image-to-text generation.

4/17/2024

cs.AI cs.CL cs.CV