Zero-shot Cross-lingual Voice Transfer for TTS

Read original: arXiv:2409.13910 - Published 9/24/2024 by Fadi Biadsy, Youzheng Chen, Isaac Elias, Kyle Kastner, Gary Wang, Andrew Rosenberg, Bhuvana Ramabhadran

Zero-shot Cross-lingual Voice Transfer for TTS

Overview

This paper presents a zero-shot cross-lingual voice transfer technique for text-to-speech (TTS) systems.
The proposed method allows transferring the voice identity and speaking style of a source speaker to generate speech in a target language, without requiring any parallel data or speech samples from the target speaker.
This can help improve accessibility by enabling TTS systems to generate speech in multiple languages using a single voice identity.

Plain English Explanation

The research paper describes a new technique for text-to-speech (TTS) systems that can transfer the voice and speaking style of one person to generate speech in a different language. Typically, TTS systems require training data from the target speaker in order to mimic their voice. However, this new zero-shot approach can transfer the voice identity and characteristics of a source speaker to produce speech in a target language, without needing any recordings from the target speaker.

This is significant because it can help make TTS more accessible and versatile. Instead of needing to train a new voice model for each language, the same voice can be used across multiple languages. This could be valuable for people who rely on TTS, as they would be able to hear familiar voices speaking in different languages. The technique may also have applications in areas like language learning, dubbing, and accessibility for multilingual content.

Technical Explanation

The key components of the proposed zero-shot cross-lingual voice transfer framework are:

Voice Encoder: A neural network model that encodes the voice characteristics and speaking style of a source speaker into a fixed-length embedding vector.
Language Encoder: A separate model that encodes the linguistic content of the input text into a vector representation.
Voice Converter: A network that takes the source speaker's voice embedding and the target language's linguistic encoding, and generates the corresponding speech waveform in the target language while preserving the source speaker's voice identity.

The voice encoder and language encoder are trained separately on labeled datasets. During inference, the source speaker's voice is encoded, combined with the target language's encoding, and passed through the voice converter to synthesize the target speech. This zero-shot approach enables cross-lingual voice transfer without requiring parallel data or speech samples from the target speaker.

The authors evaluate their framework on several multilingual TTS benchmarks, demonstrating its effectiveness in preserving the source speaker's voice identity while generating intelligible speech in the target language.

Critical Analysis

The paper provides a compelling solution to the challenge of enabling multilingual TTS with a consistent voice identity. The zero-shot approach is a notable advancement, as it avoids the need for labor-intensive data collection and model training for each new language and speaker.

However, the authors acknowledge some limitations of their framework. The quality of the generated speech, while generally intelligible, may not match the fidelity of traditional TTS models trained on large amounts of target speaker data. Additionally, the voice conversion process can sometimes introduce subtle artifacts or distortions.

Further research could explore ways to improve the naturalness and seamlessness of the cross-lingual voice transfers, perhaps by incorporating more sophisticated voice conversion techniques or leveraging additional contextual information. Evaluation on a broader range of languages and speaker demographics would also help validate the robustness and versatility of the approach.

Conclusion

This paper presents an innovative zero-shot cross-lingual voice transfer technique for text-to-speech systems. By separating the modeling of voice identity and linguistic content, the proposed framework can generate speech in a target language while preserving the characteristics of a source speaker's voice. This capability has the potential to enhance the accessibility and versatility of TTS systems, enabling users to interact with familiar voices across multiple languages. While the current implementation has some room for improvement, this research represents an important step forward in developing more flexible and inclusive speech synthesis technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Zero-shot Cross-lingual Voice Transfer for TTS

Fadi Biadsy, Youzheng Chen, Isaac Elias, Kyle Kastner, Gary Wang, Andrew Rosenberg, Bhuvana Ramabhadran

In this paper, we introduce a zero-shot Voice Transfer (VT) module that can be seamlessly integrated into a multi-lingual Text-to-speech (TTS) system to transfer an individual's voice across languages. Our proposed VT module comprises a speaker-encoder that processes reference speech, a bottleneck layer, and residual adapters, connected to preexisting TTS layers. We compare the performance of various configurations of these components and report Mean Opinion Score (MOS) and Speaker Similarity across languages. Using a single English reference speech per speaker, we achieve an average voice transfer similarity score of 73% across nine target languages. Vocal characteristics contribute significantly to the construction and perception of individual identity. The loss of one's voice, due to physical or neurological conditions, can lead to a profound sense of loss, impacting one's core identity. As a case study, we demonstrate that our approach can not only transfer typical speech but also restore the voices of individuals with dysarthria, even when only atypical speech samples are available - a valuable utility for those who have never had typical speech or banked their voice. Cross-lingual typical audio samples, plus videos demonstrating voice restoration for dysarthric speakers are available here (google.github.io/tacotron/publications/zero_shot_voice_transfer).

9/24/2024

CrossVoice: Crosslingual Prosody Preserving Cascade-S2ST using Transfer Learning

Medha Hira, Arnav Goel, Anubha Gupta

This paper presents CrossVoice, a novel cascade-based Speech-to-Speech Translation (S2ST) system employing advanced ASR, MT, and TTS technologies with cross-lingual prosody preservation through transfer learning. We conducted comprehensive experiments comparing CrossVoice with direct-S2ST systems, showing improved BLEU scores on tasks such as Fisher Es-En, VoxPopuli Fr-En and prosody preservation on benchmark datasets CVSS-T and IndicTTS. With an average mean opinion score of 3.75 out of 4, speech synthesized by CrossVoice closely rivals human speech on the benchmark, highlighting the efficacy of cascade-based systems and transfer learning in multilingual S2ST with prosody transfer.

6/19/2024

Cross-Lingual Transfer Learning for Speech Translation

Rao Ma, Yassir Fathullah, Mengjie Qian, Siyuan Tang, Mark Gales, Kate Knill

There has been increasing interest in building multilingual foundation models for NLP and speech research. Zero-shot cross-lingual transfer has been demonstrated on a range of NLP tasks where a model fine-tuned on task-specific data in one language yields performance gains in other languages. Here, we explore whether speech-based models exhibit the same transfer capability. Using Whisper as an example of a multilingual speech foundation model, we examine the utterance representation generated by the speech encoder. Despite some language-sensitive information being preserved in the audio embedding, words from different languages are mapped to a similar semantic space, as evidenced by a high recall rate in a speech-to-speech retrieval task. Leveraging this shared embedding space, zero-shot cross-lingual transfer is demonstrated in speech translation. When the Whisper model is fine-tuned solely on English-to-Chinese translation data, performance improvements are observed for input utterances in other languages. Additionally, experiments on low-resource languages show that Whisper can perform speech translation for utterances from languages unseen during pre-training by utilizing cross-lingual representations.

7/2/2024

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

Edresson Casanova, Kelly Davis, Eren Golge, Gorkem Goknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, Julian Weber

Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just a few high/medium resource languages, limiting the applications of these models in most of the low/medium resource languages. In this paper, we aim to alleviate this issue by proposing and making publicly available the XTTS system. Our method builds upon the Tortoise model and adds several novel modifications to enable multilingual training, improve voice cloning, and enable faster training and inference. XTTS was trained in 16 languages and achieved state-of-the-art (SOTA) results in most of them.

6/10/2024