MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion

Read original: arXiv:2409.09352 - Published 9/17/2024 by Sho Inoue, Shuai Wang, Wanxing Wang, Pengcheng Zhu, Mengxiao Bi, Haizhou Li

MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion

Overview

Proposes a multi-accent speech synthesis system called MacST that can generate speech in different accents by transliterating input text
Trained on a large dataset of speech samples with diverse accents
Leverages text transliteration to map input text to target accent pronunciations

Plain English Explanation

MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion presents a speech synthesis system that can generate audio in a variety of accents. The key innovation is using text transliteration to map the input text to the target accent's pronunciation.

The system is trained on a large dataset of speech samples with diverse accents. By learning the relationship between text and accent-specific pronunciations, the model can then take new text as input and output speech with the desired accent. This avoids the need to build separate speech synthesis models for each accent.

The plain language explanation focuses on the core idea - using text transliteration to enable multi-accent speech generation. This makes the technology more accessible and understandable to a general audience.

Technical Explanation

MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion introduces a novel speech synthesis system called MacST that can generate audio in multiple accents. The core of the approach is text transliteration, which maps the input text to the target accent's pronunciation.

The system is trained on a large multi-accent speech dataset. It learns to associate the input text with the corresponding accent-specific pronunciations. At inference time, the model takes new text as input and uses the learned transliteration mapping to produce speech in the desired accent.

This avoids the need to build separate speech synthesis models for each accent. The text-to-speech model can dynamically adapt its output based on the target accent specified. The researchers evaluate MacST on a variety of accents and find it outperforms previous multi-accent speech synthesis approaches.

Critical Analysis

The MacST paper presents a promising approach for enabling multi-accent speech synthesis. The use of text transliteration is a clever way to handle the challenge of modeling diverse pronunciations without requiring separate models.

However, the paper does not address some potential limitations. For example, the quality of the generated speech may degrade for accents that are very different from those represented in the training data. The system also relies on having a large multi-accent speech corpus available, which may not always be feasible.

Additionally, the paper does not explore how well the transliteration-based approach would generalize to less common or non-standard accents. Further research is needed to understand the system's robustness and limitations across a wider range of accents.

Overall, the MacST paper makes a valuable contribution, but there are still opportunities to extend and refine the approach to make it more widely applicable and robust.

Conclusion

MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion introduces an innovative speech synthesis system that can generate audio in multiple accents. By leveraging text transliteration, the model can dynamically adapt its output to the target accent, avoiding the need for separate models.

This technology has the potential to significantly improve the accessibility and usability of speech-based interfaces, enabling users to interact with systems in their preferred accent. Additionally, the transliteration-based approach could be applied to other speech-related tasks, such as accent conversion or multi-lingual text-to-speech.

While the paper demonstrates promising results, further research is needed to address the system's limitations and expand its capabilities to handle a broader range of accents. Nonetheless, the MacST paper represents an important step forward in the field of multi-accent speech synthesis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion

Sho Inoue, Shuai Wang, Wanxing Wang, Pengcheng Zhu, Mengxiao Bi, Haizhou Li

In accented voice conversion or accent conversion, we seek to convert the accent in speech from one another while preserving speaker identity and semantic content. In this study, we formulate a novel method for creating multi-accented speech samples, thus pairs of accented speech samples by the same speaker, through text transliteration for training accent conversion systems. We begin by generating transliterated text with Large Language Models (LLMs), which is then fed into multilingual TTS models to synthesize accented English speech. As a reference system, we built a sequence-to-sequence model on the synthetic parallel corpus for accent conversion. We validated the proposed method for both native and non-native English speakers. Subjective and objective evaluations further validate our dataset's effectiveness in accent conversion studies.

9/17/2024

Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT

Kazuki Yamauchi, Yuki Saito, Hiroshi Saruwatari

We explore cross-dialect text-to-speech (CD-TTS), a task to synthesize learned speakers' voices in non-native dialects, especially in pitch-accent languages. CD-TTS is important for developing voice agents that naturally communicate with people across regions. We present a novel TTS model comprising three sub-modules to perform competitively at this task. We first train a backbone TTS model to synthesize dialect speech from a text conditioned on phoneme-level accent latent variables (ALVs) extracted from speech by a reference encoder. Then, we train an ALV predictor to predict ALVs tailored to a target dialect from input text leveraging our novel multi-dialect phoneme-level BERT. We conduct multi-dialect TTS experiments and evaluate the effectiveness of our model by comparing it with a baseline derived from conventional dialect TTS methods. The results show that our model improves the dialectal naturalness of synthetic speech in CD-TTS.

9/12/2024

Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis

Xuehao Zhou, Mingyang Zhang, Yi Zhou, Zhizheng Wu, Haizhou Li

Synthesizing speech across different accents while preserving the speaker identity is essential for various real-world customer applications. However, the individual and accurate modeling of accents and speakers in a text-to-speech (TTS) system is challenging due to the complexity of accent variations and the intrinsic entanglement between the accent and speaker identity. In this paper, we present a novel approach for multi-speaker multi-accent TTS synthesis, which aims to synthesize voices of multiple speakers, each with various accents. Our proposed approach employs a multi-scale accent modeling strategy to address accent variations at different levels. Specifically, we introduce both global (utterance level) and local (phoneme level) accent modeling, supervised by individual accent classifiers to capture the overall variation within accented utterances and fine-grained variations between phonemes, respectively. To control accents and speakers separately, speaker-independent accent modeling is necessary, which is achieved by adversarial training with speaker classifiers to disentangle speaker identity within the multi-scale accent modeling. Consequently, we obtain speaker-independent and accent-discriminative multi-scale embeddings as comprehensive accent features. Additionally, we propose a local accent prediction model that allows to generate accented speech directly from phoneme inputs. Extensive experiments are conducted on an accented English speech corpus. Both objective and subjective evaluations show the superiority of our proposed system compared to baselines systems. Detailed component analysis demonstrates the effectiveness of global and local accent modeling, and speaker disentanglement on multi-speaker multi-accent speech synthesis.

6/18/2024

Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training

Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans

With rapid globalization, the need to build inclusive and representative speech technology cannot be overstated. Accent is an important aspect of speech that needs to be taken into consideration while building inclusive speech synthesizers. Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent. We note that state-of-the-art Text-to-Speech (TTS) systems may currently not be suitable for all people, regardless of their background, as they are designed to generate high-quality voices without focusing on accent. In this paper, we propose a TTS model that utilizes a Multi-Level Variational Autoencoder with adversarial learning to address accented speech synthesis and conversion in TTS, with a vision for more inclusive systems in the future. We evaluate the performance through both objective metrics and subjective listening tests. The results show an improvement in accent conversion ability compared to the baseline.

6/4/2024