Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation

Read original: arXiv:2409.11003 - Published 9/18/2024 by Gerard I. G'allego, Roy Fejgin, Chunghsin Yeh, Xiaoyu Liu, Gautam Bhattacharya

Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation

Overview

This paper presents a novel single-stage text-to-speech (TTS) system that leverages masked audio token modeling and semantic knowledge distillation.
The key innovations include an efficient non-autoregressive (NAR) TTS architecture and a semantic knowledge distillation approach to improve performance.
The proposed method aims to achieve high-quality speech synthesis while reducing computational complexity compared to traditional TTS systems.

Plain English Explanation

The paper describes a new way to generate human-like speech from text using a single-stage TTS system. This system has two main components:

Masked Audio Token Modeling: The model is trained to predict missing parts of the audio signal, which helps it learn the underlying patterns in speech more efficiently. This allows the system to generate speech in a non-autoregressive (faster) way.
Semantic Knowledge Distillation: The model also learns from a separate "teacher" model that has strong language understanding capabilities. This "distills" semantic knowledge into the TTS model, which improves its ability to generate speech that sounds natural and coherent.

By combining these techniques, the researchers were able to create a TTS system that can produce high-quality speech while being more computationally efficient than traditional approaches. This could lead to more accessible and widely deployable speech synthesis technologies.

Technical Explanation

The paper introduces a single-stage TTS architecture that leverages masked audio token modeling and semantic knowledge distillation. The key components are:

Non-Autoregressive (NAR) TTS: The model generates the entire audio sequence in a single pass, rather than iteratively like in traditional autoregressive TTS. This improves inference speed but requires novel training techniques to maintain high quality.
Masked Audio Token Modeling: The model is trained to predict missing parts of the input audio, similar to masked language modeling in natural language processing. This helps the model learn the underlying structure of speech more efficiently.
Semantic Knowledge Distillation: The TTS model is trained to mimic the behavior of a separate "teacher" model that has strong language understanding capabilities. This transfers semantic knowledge to the TTS model, improving its ability to generate coherent and natural-sounding speech.

The authors evaluate their proposed approach on several benchmark TTS datasets and compare it to state-of-the-art TTS systems. They demonstrate that their method can achieve comparable or better audio quality while being significantly faster during inference.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the proposed TTS system. Some potential areas for further research include:

Generalization to Other Languages: The experiments focused on English, so it would be valuable to assess the system's performance on a more diverse set of languages.
Robustness to Noisy or Varied Input: The paper does not address how the system would handle real-world variations in text input, such as typos, abbreviations, or non-standard formatting.
Subjective Evaluation: While the objective metrics are promising, a more comprehensive subjective evaluation with human listeners could provide additional insights into the perceived quality and naturalness of the generated speech.

Overall, the paper makes a compelling case for the benefits of the proposed single-stage TTS approach and presents a promising step towards more efficient and high-quality speech synthesis.

Conclusion

This paper introduces a novel single-stage TTS system that combines masked audio token modeling and semantic knowledge distillation to achieve high-quality speech synthesis with improved computational efficiency. The key innovations address the challenges of non-autoregressive TTS and demonstrate the potential for more accessible and widely deployable speech technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation

Gerard I. G'allego, Roy Fejgin, Chunghsin Yeh, Xiaoyu Liu, Gautam Bhattacharya

Audio token modeling has become a powerful framework for speech synthesis, with two-stage approaches employing semantic tokens remaining prevalent. In this paper, we aim to simplify this process by introducing a semantic knowledge distillation method that enables high-quality speech generation in a single stage. Our proposed model improves speech quality, intelligibility, and speaker similarity compared to a single-stage baseline. Although two-stage systems still lead in intelligibility, our model significantly narrows the gap while delivering comparable speech quality. These findings showcase the potential of single-stage models to achieve efficient, high-quality TTS with a more compact and streamlined architecture.

9/18/2024

High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model

Joun Yeop Lee, Myeonghun Jeong, Minchan Kim, Ji-Hyun Lee, Hoon-Young Cho, Nam Soo Kim

We propose a novel two-stage text-to-speech (TTS) framework with two types of discrete tokens, i.e., semantic and acoustic tokens, for high-fidelity speech synthesis. It features two core components: the Interpreting module, which processes text and a speech prompt into semantic tokens focusing on linguistic contents and alignment, and the Speaking module, which captures the timbre of the target voice to generate acoustic tokens from semantic tokens, enriching speech reconstruction. The Interpreting stage employs a transducer for its robustness in aligning text to speech. In contrast, the Speaking stage utilizes a Conformer-based architecture integrated with a Grouped Masked Language Model (G-MLM) to boost computational efficiency. Our experiments verify that this innovative structure surpasses the conventional models in the zero-shot scenario in terms of speech quality and speaker similarity.

6/26/2024

Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility

Xiaoyu Liu, Xu Li, Joan Serr`a, Santiago Pascual

Speech restoration aims at restoring full-band speech with high quality and intelligibility, considering a diverse set of distortions. MaskSR is a recently proposed generative model for this task. As other models of its kind, MaskSR attains high quality but, as we show, intelligibility can be substantially improved. We do so by boosting the speech encoder component of MaskSR with predictions of semantic representations of the target speech, using a pre-trained self-supervised teacher model. Then, a masked language model is conditioned on the learned semantic features to predict acoustic tokens that encode low level spectral details of the target speech. We show that, with the same MaskSR model capacity and inference time, the proposed model, MaskSR2, significantly reduces the word error rate, a typical metric for intelligibility. MaskSR2 also achieves competitive word error rate among other models, while providing superior quality. An ablation study shows the effectiveness of various semantic representations.

9/17/2024

Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation

Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Masato Mimura, Takatomo Kano, Atsunori Ogawa, Marc Delcroix

This paper introduces a novel approach called sentence-wise speech summarization (Sen-SSum), which generates text summaries from a spoken document in a sentence-by-sentence manner. Sen-SSum combines the real-time processing of automatic speech recognition (ASR) with the conciseness of speech summarization. To explore this approach, we present two datasets for Sen-SSum: Mega-SSum and CSJ-SSum. Using these datasets, our study evaluates two types of Transformer-based models: 1) cascade models that combine ASR and strong text summarization models, and 2) end-to-end (E2E) models that directly convert speech into a text summary. While E2E models are appealing to develop compute-efficient models, they perform worse than cascade models. Therefore, we propose knowledge distillation for E2E models using pseudo-summaries generated by the cascade models. Our experiments show that this proposed knowledge distillation effectively improves the performance of the E2E model on both datasets.

8/2/2024