Text-aware and Context-aware Expressive Audiobook Speech Synthesis

2406.05672

Published 6/13/2024 by Dake Guo, Xinfa Zhu, Liumeng Xue, Yongmao Zhang, Wenjie Tian, Lei Xie

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Abstract

Recent advances in text-to-speech have significantly improved the expressiveness of synthetic speech. However, a major challenge remains in generating speech that captures the diverse styles exhibited by professional narrators in audiobooks without relying on manually labeled data or reference speech. To address this problem, we propose a text-aware and context-aware(TACA) style modeling approach for expressive audiobook speech synthesis. We first establish a text-aware style space to cover diverse styles via contrastive learning with the supervision of the speech style. Meanwhile, we adopt a context encoder to incorporate cross-sentence information and the style embedding obtained from text. Finally, we introduce the context encoder to two typical TTS models, VITS-based TTS and language model-based TTS. Experimental results demonstrate that our proposed approach can effectively capture diverse styles and coherent prosody, and consequently improves naturalness and expressiveness in audiobook speech synthesis.

Create account to get full access

Overview

This paper presents a novel approach to text-to-speech (TTS) synthesis that incorporates both text-aware and context-aware modeling to generate highly expressive audiobook-style speech.
The proposed method leverages large language models and speaker-adaptive techniques to capture the nuanced expressiveness and emotional inflection found in professional narrations.
Key innovations include a text-aware encoder that models linguistic and semantic features, and a context-aware decoder that conditions on speaker identity and other contextual information.
Evaluations on a benchmark dataset demonstrate significant improvements in perceived naturalness, expressiveness, and speaker similarity compared to prior TTS systems.

Plain English Explanation

The researchers have developed a new way to generate realistic and expressive speech for audiobook-style applications. Traditional text-to-speech systems can sound robotic or lack the natural inflection and emotion that a professional narrator would use. To address this, the researchers' approach uses advanced language models to better understand the text and its meaning, as well as speaker-adaptive techniques to capture the unique voice and speaking style of the narrator.

By modeling both the linguistic content and the broader context, the system is able to produce synthetic speech with rich expressiveness, seamlessly shifting tone, emphasis and pacing to match the narrative. This could enable more immersive and engaging audiobook experiences, as well as improve accessibility for visually impaired readers. The researchers demonstrate that their approach outperforms previous text-to-speech methods on measures of natural sounding speech and preserving the identity of the speaker.

Technical Explanation

The key innovations in this work are the text-aware encoder and context-aware decoder components of the TTS model. The text-aware encoder uses a large pre-trained language model to extract linguistic and semantic features from the input text. This allows the system to better understand the meaning and intent behind the words, rather than just converting them to acoustic features in a literal way.

The context-aware decoder then conditions the speech synthesis on additional contextual information, such as the identity of the speaker. This enables the model to generate speech that matches the unique vocal characteristics and expressive style of the narrator. The researchers utilize speaker-adaptive techniques to efficiently capture speaker-specific attributes without requiring large amounts of training data per speaker.

Experiments on a benchmark expressive audiobook dataset demonstrate that this text- and context-aware approach outperforms prior state-of-the-art TTS models in terms of perceived naturalness, expressiveness, and speaker similarity. The system is able to generate synthetic speech that closely mimics the nuanced delivery and emotional inflection of professional narrators.

Critical Analysis

While the proposed method represents a significant advance in expressive TTS, the authors acknowledge several limitations and areas for future work. The evaluations were conducted on a curated dataset, so further testing is needed to ensure the model generalizes well to more diverse real-world scenarios and speaker styles.

Additionally, the resource requirements of the language model and speaker-adaptive components may limit the scalability and deployability of the system, particularly on resource-constrained edge devices. The authors suggest exploring techniques to improve the efficiency of the model as an area for future research.

Another potential concern is the ethical implications of highly realistic synthetic voices. If not properly controlled, this technology could be misused to create deceptive audiovisual content. Careful consideration of safeguards and responsible deployment guidelines will be crucial as this field continues to advance.

Conclusion

This work presents a novel TTS framework that combines text-aware and context-aware modeling to generate highly expressive and natural-sounding audiobook-style speech. By leveraging large language models and speaker-adaptive techniques, the system is able to capture the nuanced delivery and emotional inflection of professional narrators.

Experimental results demonstrate significant improvements over prior TTS methods, opening up new possibilities for immersive audiobook experiences and accessible text-to-speech applications. While some limitations and ethical considerations remain, this research represents an important step forward in advancing the state of the art in expressive speech synthesis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations

Sen Liu, Yiwei Guo, Xie Chen, Kai Yu

While acoustic expressiveness has long been studied in expressive text-to-speech (ETTS), the inherent expressiveness in text lacks sufficient attention, especially for ETTS of artistic works. In this paper, we introduce StoryTTS, a highly ETTS dataset that contains rich expressiveness both in acoustic and textual perspective, from the recording of a Mandarin storytelling show. A systematic and comprehensive labeling framework is proposed for textual expressiveness. We analyze and define speech-related textual expressiveness in StoryTTS to include five distinct dimensions through linguistics, rhetoric, etc. Then we employ large language models and prompt them with a few manual annotation examples for batch annotation. The resulting corpus contains 61 hours of consecutive and highly prosodic speech equipped with accurate text transcriptions and rich textual expressiveness annotations. Therefore, StoryTTS can aid future ETTS research to fully mine the abundant intrinsic textual and acoustic features. Experiments are conducted to validate that TTS models can generate speech with improved expressiveness when integrating with the annotated textual labels in StoryTTS.

4/24/2024

cs.SD cs.CL eess.AS

🗣️

Boosting Multi-Speaker Expressive Speech Synthesis with Semi-supervised Contrastive Learning

Xinfa Zhu, Yuke Li, Yi Lei, Ning Jiang, Guoqing Zhao, Lei Xie

This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, emotion, and speaker representations from speech for style and emotion transfer. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on multi-domain data demonstrate the effectiveness of the proposed method.

4/26/2024

eess.AS cs.SD

An efficient text augmentation approach for contextualized Mandarin speech recognition

Naijun Zheng, Xucheng Wan, Kai Liu, Ziqing Du, Zhou Huan

Although contextualized automatic speech recognition (ASR) systems are commonly used to improve the recognition of uncommon words, their effectiveness is hindered by the inherent limitations of speech-text data availability. To address this challenge, our study proposes to leverage extensive text-only datasets and contextualize pre-trained ASR models using a straightforward text-augmentation (TA) technique, all while keeping computational costs minimal. In particular, to contextualize a pre-trained CIF-based ASR, we construct a codebook using limited speech-text data. By utilizing a simple codebook lookup process, we convert available text-only data into latent text embeddings. These embeddings then enhance the inputs for the contextualized ASR. Our experiments on diverse Mandarin test sets demonstrate that our TA approach significantly boosts recognition performance. The top-performing system shows relative CER improvements of up to 30% on rare words and 15% across all words in general.

6/17/2024

cs.SD cs.CL eess.AS

🌿

Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans

Accent plays a significant role in speech communication, influencing one's capability to understand as well as conveying a person's identity. This paper introduces a novel and efficient framework for accented Text-to-Speech (TTS) synthesis based on a Conditional Variational Autoencoder. It has the ability to synthesize a selected speaker's voice, which is converted to any desired target accent. Our thorough experiments validate the effectiveness of the proposed framework using both objective and subjective evaluations. The results also show remarkable performance in terms of the ability to manipulate accents in the synthesized speech and provide a promising avenue for future accented TTS research.

6/4/2024

eess.AS cs.LG cs.SD