MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

2404.18398

Published 4/30/2024 by Xiang Li, Zhi-Qi Cheng, Jun-Yan He, Xiaojiang Peng, Alexander G. Hauptmann

❗

Abstract

Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-Speech System (MM-TTS), a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. MM-TTS consists of two key components: (1) the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information; and (2) the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations across diverse datasets demonstrate the superior performance of MM-TTS compared to traditional E-TTS models. Objective metrics, including Word Error Rate (WER) and Character Error Rate (CER), show significant improvements on ESD dataset, with MM-TTS achieving scores of 7.35% and 3.07%, respectively. Subjective assessments further validate that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech. Our code and pre-trained models are publicly available at https://anonymous.4open.science/r/MMTTS-D214

Create account to get full access

Overview

Presents a new system called Multimodal Emotional Text-to-Speech (MM-TTS) that generates expressive and emotionally resonant speech
Leverages emotional cues from multiple modalities (text, audio, and visual) to improve upon traditional Emotional Text-to-Speech (E-TTS) approaches
Consists of two key components: Emotion Prompt Alignment Module (EP-Align) and Emotion Embedding-Induced TTS (EMI-TTS)
Extensively evaluated on diverse datasets, showing significant improvements over existing E-TTS models

Plain English Explanation

The paper introduces a new system called Multimodal Emotional Text-to-Speech (MM-TTS) that aims to generate expressive and emotionally resonant speech. Traditional Emotional Text-to-Speech (E-TTS) approaches often struggle to capture the complexity of human emotions, relying on oversimplified emotional labels or single-modality inputs.

To address these limitations, MM-TTS leverages emotional cues from multiple modalities, including text, audio, and visual information. The system consists of two key components:

Emotion Prompt Alignment Module (EP-Align): This module uses contrastive learning to align emotional features across the different modalities, ensuring a coherent fusion of the multimodal information.
Emotion Embedding-Induced TTS (EMI-TTS): This component integrates the aligned emotional embeddings with state-of-the-art Text-to-Speech (TTS) models to synthesize speech that accurately reflects the intended emotions.

The researchers extensively evaluated MM-TTS on diverse datasets and found that it outperformed traditional E-TTS models. Objective metrics, such as Word Error Rate (WER) and Character Error Rate (CER), showed significant improvements, and subjective assessments validated that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech.

Technical Explanation

The paper presents the Multimodal Emotional Text-to-Speech (MM-TTS) system, which aims to generate highly expressive and emotionally resonant speech by leveraging emotional cues from multiple modalities.

The key components of MM-TTS are:

Emotion Prompt Alignment Module (EP-Align): This module employs contrastive learning to align emotional features across the text, audio, and visual modalities. This ensures a coherent fusion of the multimodal information, allowing the system to capture the complex nuances of human emotions.
Emotion Embedding-Induced TTS (EMI-TTS): This component integrates the aligned emotional embeddings from the EP-Align module with state-of-the-art Text-to-Speech (TTS) models to synthesize speech that accurately reflects the intended emotions. This approach leverages the expressiveness of the multimodal emotional cues to enhance the quality and emotional fidelity of the generated speech.

The researchers conducted extensive evaluations of MM-TTS across diverse datasets, including the ESD dataset. Objective metrics, such as Word Error Rate (WER) and Character Error Rate (CER), showed significant improvements over traditional E-TTS models, with MM-TTS achieving scores of 7.35% and 3.07%, respectively. Subjective assessments further validated that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech.

Critical Analysis

The paper presents a comprehensive and innovative approach to Emotional Text-to-Speech (E-TTS) synthesis by leveraging multimodal emotional cues. The proposed Multimodal Emotional Text-to-Speech (MM-TTS) system addresses the limitations of traditional E-TTS approaches, which often rely on oversimplified emotional labels or single-modality inputs.

One potential limitation of the research is the scope of the evaluation, which was primarily focused on objective metrics and subjective assessments. While these evaluations demonstrated the superior performance of MM-TTS, it would be beneficial to explore the system's generalization capabilities by testing it on a wider range of datasets and scenarios, including real-world applications of emotional speech synthesis.

Additionally, the paper does not provide a detailed analysis of the trade-offs or limitations of the Emotion Prompt Alignment Module (EP-Align) and the Emotion Embedding-Induced TTS (EMI-TTS) components. A more in-depth discussion of the strengths, weaknesses, and potential areas for improvement of these key modules would help readers understand the nuances of the system and its potential for further refinement.

Despite these minor limitations, the MM-TTS system presents a significant advancement in the field of emotional speech synthesis, with the potential to enhance human-computer interaction and improve the user experience in various applications.

Conclusion

The Multimodal Emotional Text-to-Speech (MM-TTS) system proposed in this paper represents a notable advancement in the field of emotional speech synthesis. By leveraging emotional cues from multiple modalities, including text, audio, and visual information, MM-TTS is able to generate highly expressive and emotionally resonant speech, outperforming traditional Emotional Text-to-Speech (E-TTS) approaches.

The key innovations of the MM-TTS system, the Emotion Prompt Alignment Module (EP-Align) and the Emotion Embedding-Induced TTS (EMI-TTS), demonstrate the potential of multimodal learning to capture the complex nuances of human emotions and integrate them seamlessly into synthetic speech. The promising results of the evaluation, both in terms of objective metrics and subjective assessments, suggest that MM-TTS has the potential to significantly enhance human-computer interaction and contribute to the development of more natural and engaging conversational interfaces.

As the field of emotional speech synthesis continues to evolve, the MM-TTS system presented in this paper offers a valuable contribution and a promising direction for future research in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Exploring speech style spaces with language models: Emotional TTS without emotion labels

Shreeram Suresh Chandra, Zongyang Du, Berrak Sisman

Many frameworks for emotional text-to-speech (E-TTS) rely on human-annotated emotion labels that are often inaccurate and difficult to obtain. Learning emotional prosody implicitly presents a tough challenge due to the subjective nature of emotions. In this study, we propose a novel approach that leverages text awareness to acquire emotional styles without the need for explicit emotion labels or text prompts. We present TEMOTTS, a two-stage framework for E-TTS that is trained without emotion labels and is capable of inference without auxiliary inputs. Our proposed method performs knowledge transfer between the linguistic space learned by BERT and the emotional style space constructed by global style tokens. Our experimental results demonstrate the effectiveness of our proposed framework, showcasing improvements in emotional accuracy and naturalness. This is one of the first studies to leverage the emotional correlation between spoken content and expressive delivery for emotional TTS.

5/21/2024

eess.AS cs.LG

Controlling Emotion in Text-to-Speech with Natural Language Prompts

Thomas Bott, Florian Lux, Ngoc Thang Vu

In recent years, prompting has quickly become one of the standard ways of steering the outputs of generative machine learning models, due to its intuitive use of natural language. In this work, we propose a system conditioned on embeddings derived from an emotionally rich text that serves as prompt. Thereby, a joint representation of speaker and prompt embeddings is integrated at several points within a transformer-based architecture. Our approach is trained on merged emotional speech and text datasets and varies prompts in each training iteration to increase the generalization capabilities of the model. Objective and subjective evaluation results demonstrate the ability of the conditioned synthesis system to accurately transfer the emotions present in a prompt to speech. At the same time, precise tractability of speaker identities as well as overall high speech quality and intelligibility are maintained.

6/13/2024

cs.CL cs.SD eess.AS

💬

EE-TTS: Emphatic Expressive TTS with Linguistic Information

Yi Zhong, Chen Zhang, Xule Liu, Chenxi Sun, Weishan Deng, Haifeng Hu, Zhongqian Sun

While Current TTS systems perform well in synthesizing high-quality speech, producing highly expressive speech remains a challenge. Emphasis, as a critical factor in determining the expressiveness of speech, has attracted more attention nowadays. Previous works usually enhance the emphasis by adding intermediate features, but they can not guarantee the overall expressiveness of the speech. To resolve this matter, we propose Emphatic Expressive TTS (EE-TTS), which leverages multi-level linguistic information from syntax and semantics. EE-TTS contains an emphasis predictor that can identify appropriate emphasis positions from text and a conditioned acoustic model to synthesize expressive speech with emphasis and linguistic information. Experimental results indicate that EE-TTS outperforms baseline with MOS improvements of 0.49 and 0.67 in expressiveness and naturalness. EE-TTS also shows strong generalization across different datasets according to AB test results.

4/16/2024

cs.SD cs.CL eess.AS

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Shun Lei, Yixuan Zhou, Liyang Chen, Dan Luo, Zhiyong Wu, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han, Helen Meng

Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.

4/10/2024

cs.SD eess.AS