Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model

2406.03706

Published 6/7/2024 by Jinlong Xue, Yayue Deng, Yicheng Han, Yingming Gao, Ya Li

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model

Abstract

Recent advances in large language models (LLMs) and development of audio codecs greatly propel the zero-shot TTS. They can synthesize personalized speech with only a 3-second speech of an unseen speaker as acoustic prompt. However, they only support short speech prompts and cannot leverage longer context information, as required in audiobook and conversational TTS scenarios. In this paper, we introduce a novel audio codec-based TTS model to adapt context features with multiple enhancements. Inspired by the success of Qformer, we propose a multi-modal context-enhanced Qformer (MMCE-Qformer) to utilize additional multi-modal context information. Besides, we adapt a pretrained LLM to leverage its understanding ability to predict semantic tokens, and use a SoundStorm to generate acoustic tokens thereby enhancing audio quality and speaker similarity. The extensive objective and subjective evaluations show that our proposed method outperforms baselines across various context TTS scenarios.

Create account to get full access

Overview

This paper presents a novel approach to improve audio codec-based zero-shot text-to-speech (TTS) synthesis by leveraging multi-modal context and large language models.
The method combines an audio codec model with a text-based language model to generate high-quality speech from text inputs, without the need for speech data during training.
The researchers explore various techniques to effectively integrate the audio codec and language model, including prompt engineering and cross-modal attention mechanisms.
Experiments show this approach can significantly outperform existing zero-shot TTS methods in terms of speech quality and intelligibility.

Plain English Explanation

The researchers in this paper have developed a new way to generate realistic-sounding speech from text, without needing to train on actual speech samples. This is called "zero-shot" text-to-speech (TTS) synthesis.

Their key insight is to combine an audio codec model - which can convert text into audio - with a powerful language model trained on a large amount of text data.

By carefully integrating these two models, the researchers were able to dramatically improve the quality and naturalness of the generated speech. This is significant because zero-shot TTS could make speech synthesis technology much more accessible and easier to deploy, without the need for extensive speech data collection and model training.

The researchers explored different techniques to combine the audio codec and language model, such as prompting the language model in specific ways and using cross-modal attention mechanisms to better align the text and audio representations. These innovations allowed the system to produce high-fidelity speech that sounds very natural and human-like.

Technical Explanation

The core of the proposed approach is to leverage a pre-trained audio codec model and a large language model to enable zero-shot TTS synthesis. The audio codec model can convert text into raw audio waveforms, while the language model captures rich semantic and syntactic knowledge from text data.

To effectively integrate these two components, the researchers explore several techniques:

Prompt Engineering: They experiment with different prompting strategies to condition the language model on the input text, aiming to elicit more natural and expressive speech outputs.
Cross-Modal Attention: The team introduces a cross-modal attention mechanism that allows the language model to attend to relevant parts of the audio codec's internal representations, improving the alignment between text and audio.
Multi-Modal Context: In addition to the text input, the model also considers visual and speaker identity information as additional context to further enhance the quality of the generated speech.

Extensive experiments on public TTS benchmarks demonstrate the superiority of this approach over existing zero-shot TTS methods. The proposed system is able to generate speech that is significantly more natural, intelligible, and expressive compared to prior work.

Critical Analysis

The paper presents a compelling approach to improving zero-shot TTS by leveraging advances in large language models and multi-modal fusion. The researchers have clearly put a lot of thought into the integration of the audio codec and language model, and the results show significant performance gains.

One potential limitation noted in the paper is the reliance on pre-trained models, which could limit the flexibility and adaptability of the system. Additionally, the authors acknowledge that the current system may struggle with niche or specialized speech domains that are not well-represented in the training data.

Further research could explore ways to make the system more robust and generalizable, such as by incorporating few-shot or transfer learning techniques to adapt the model to new domains or styles of speech. Integrating the system with mobile-friendly speech synthesis frameworks could also broaden its real-world applicability.

Overall, this work represents an important step forward in the field of zero-shot TTS, and the techniques developed here could have far-reaching implications for making high-quality speech synthesis more accessible and widely deployable.

Conclusion

This paper presents a novel approach to significantly improve audio codec-based zero-shot text-to-speech synthesis. By effectively combining a pre-trained audio codec model with a large language model, the researchers were able to generate highly natural and intelligible speech from text inputs, without requiring any speech data during training.

The key innovations include prompt engineering, cross-modal attention mechanisms, and the incorporation of multi-modal context information. Experiments show this approach outperforms existing zero-shot TTS methods, demonstrating the power of leveraging large-scale language models and multi-modal learning for this task.

While the current system has some limitations, the techniques developed in this work represent an important advancement in the field of zero-shot speech synthesis. With further research and refinement, this technology could pave the way for more accessible and versatile text-to-speech applications that can adapt to a wide range of use cases and domains.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Shun Lei, Yixuan Zhou, Liyang Chen, Dan Luo, Zhiyong Wu, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han, Helen Meng

Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.

4/10/2024

cs.SD eess.AS

CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech

Jaehyeon Kim, Keon Lee, Seungjun Chung, Jaewoong Cho

With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, large language models have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis. Despite the ongoing rush towards scaling paradigms, audio tokenization ironically amplifies the scalability challenge, stemming from its long sequence length and the complexity of modelling the multiple sequences. To mitigate these issues, we present CLaM-TTS that employs a probabilistic residual vector quantization to (1) achieve superior compression in the token length, and (2) allow a language model to generate multiple tokens at once, thereby eliminating the need for cascaded modeling to handle the number of token streams. Our experimental results demonstrate that CLaM-TTS is better than or comparable to state-of-the-art neural codec-based TTS models regarding naturalness, intelligibility, speaker similarity, and inference speed. In addition, we examine the impact of the pretraining extent of the language models and their text tokenization strategies on performances.

4/4/2024

eess.AS cs.SD

UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

Dongchao Yang, Haohan Guo, Yuanyuan Wang, Rongjie Huang, Xiang Li, Xu Tan, Xixin Wu, Helen Meng

The Large Language models (LLMs) have demonstrated supreme capabilities in text understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel and LLMs-driven audio codec model, LLM-Codec, to transfer the audio modality into the textual space, textit{i.e.} representing audio tokens with words or sub-words in the vocabulary of LLMs, while keeping high audio reconstruction quality. The key idea is to reduce the modality heterogeneity between text and audio by compressing the audio modality into a well-trained LLMs token space. Thus, the audio representation can be viewed as a new textit{foreign language}, and LLMs can learn the new textit{foreign language} with several demonstrations. In experiments, we investigate the performance of the proposed approach across multiple audio understanding and generation tasks, textit{e.g.} speech emotion classification, audio classification, text-to-speech generation, speech enhancement, etc. The experimental results demonstrate that the LLMs equipped with the proposed LLM-Codec, named as UniAudio 1.5, prompted by only a few examples, can achieve the expected functions in simple scenarios. It validates the feasibility and effectiveness of the proposed cross-modal in-context learning approach. To facilitate research on few-shot audio task learning and multi-modal LLMs, we have open-sourced the LLM-Codec model.

6/17/2024

cs.SD eess.AS

🧠

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka

Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples.

6/27/2024

eess.AS cs.CL cs.LG cs.SD