Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

2406.10223

Published 6/17/2024 by Nameer Hirschkind, Xiao Yu, Mahesh Kumar Nandwana, Joseph Liu, Eloi DuBois, Dao Le, Nicolas Thiebaut, Colin Sinclair, Kyle Spence, Charles Shang and 2 others

cs.LG cs.SD eess.AS

Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Abstract

We introduce DiffuseST, a low-latency, direct speech-to-speech translation system capable of preserving the input speaker's voice zero-shot while translating from multiple source languages into English. We experiment with the synthesizer component of the architecture, comparing a Tacotron-based synthesizer to a novel diffusion-based synthesizer. We find the diffusion-based synthesizer to improve MOS and PESQ audio quality metrics by 23% each and speaker similarity by 5% while maintaining comparable BLEU scores. Despite having more than double the parameter count, the diffusion synthesizer has lower latency, allowing the entire model to run more than 5$times$ faster than real-time.

Create account to get full access

Overview

• This paper presents a novel "Diffusion Synthesizer" for efficient multilingual speech-to-speech translation.

• The approach leverages diffusion models, a type of generative AI, to translate speech from one language to another without the need for parallel speech data.

• The system can translate speech between a wide range of languages, making it a potentially valuable tool for global communication and accessibility.

Plain English Explanation

The researchers have developed a new way to translate spoken language from one language to another. Rather than relying on having parallel audio recordings of the same speech in different languages, their "Diffusion Synthesizer" uses a modern AI technique called diffusion models to generate the translated speech.

Diffusion models work by starting with random noise and gradually transforming it into more structured outputs, like speech, through a step-by-step process. This allows the system to generate natural-sounding translated speech without needing to have examples of the exact same speech in multiple languages.

The key advantage of this approach is that it can translate between a wide variety of languages, not just the few that have lots of parallel speech data available. This makes it a potentially powerful tool for global communication and accessibility, as it could help break down language barriers.

Technical Explanation

The paper proposes a "Diffusion Synthesizer" that uses diffusion models to perform efficient multilingual speech-to-speech translation. Diffusion models are a type of generative AI system that can learn to transform random noise into structured outputs like speech, without requiring parallel data between the input and output domains.

The key innovation is applying diffusion models to the speech translation task, which typically relies on having large datasets of parallel speech recordings in multiple languages. By casting translation as a diffusion process, the system can generate translated speech from a single input audio sample, without needing to explicitly model the correspondence between the source and target languages.

The authors demonstrate the effectiveness of their Diffusion Synthesizer on a range of language pairs, showing that it can achieve high-quality translation results comparable to prior work, but with significantly reduced requirements for parallel speech data. This makes the approach a promising direction for building efficient and multilingual speech translation systems.

Critical Analysis

The Diffusion Synthesizer presents an interesting and novel approach to the challenge of speech-to-speech translation. By leveraging the power of diffusion models, the system is able to sidestep the need for large parallel datasets, which is a key limitation of many existing translation methods.

However, the paper acknowledges several caveats and areas for further research. For example, the current system is limited to translating between a fixed set of languages, and its performance may degrade for language pairs that are more linguistically distant. Additionally, the quality of the generated speech, while impressive, may not yet match the fidelity of other state-of-the-art speech synthesis models.

Further research could explore ways to expand the language coverage of the Diffusion Synthesizer, as well as investigate techniques to improve the naturalness and intelligibility of the translated speech. Comparisons to alternative approaches, such as autoregressive diffusion models or zero-shot speech synthesis, could also provide additional insights into the strengths and limitations of the proposed method.

Conclusion

The Diffusion Synthesizer represents an exciting development in the field of speech-to-speech translation. By leveraging the power of diffusion models, the system is able to generate high-quality translated speech without requiring large parallel datasets, a significant limitation of many existing approaches.

This innovation has the potential to make cross-lingual communication more accessible, as the system can be applied to a wide range of language pairs, not just the few that have ample parallel data available. As the technology continues to evolve, it could play an important role in breaking down language barriers and enabling more seamless global collaboration and understanding.

While the current system has some limitations, the core ideas presented in this paper point to a promising direction for the future of efficient and multilingual speech translation. Further advancements in this area could have far-reaching implications for a variety of applications, from real-time translation services to accessibility tools for the hearing impaired.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer

Keon Lee, Dong Won Kim, Jaehyeon Kim, Jaewoong Cho

Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models for TTS. In this work, we present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders. Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations. To achieve this, we enhance the DiT architecture to suit TTS and improve the alignment by incorporating semantic guidance into the latent space of speech. We scale the training dataset and the model size to 82K hours and 790M parameters, respectively. Our extensive experiments demonstrate that the large-scale diffusion model for TTS without domain-specific modeling not only simplifies the training pipeline but also yields superior or comparable zero-shot performance to state-of-the-art TTS models in terms of naturalness, intelligibility, and speaker similarity. Our speech samples are available at https://ditto-tts.github.io.

6/18/2024

eess.AS cs.AI cs.CL cs.LG cs.SD

New!DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

Hyun Joon Park, Jin Sob Kim, Wooseok Shin, Sung Won Han

Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to synthesize natural speech, but there are limitations to obtaining well-represented styles and improving model generalization ability. In this study, we present Diffusion-based EXpressive TTS (DEX-TTS), an acoustic model designed for reference-based speech synthesis with enhanced style representations. Based on a general diffusion TTS framework, DEX-TTS includes encoders and adapters to handle styles extracted from reference speech. Key innovations contain the differentiation of styles into time-invariant and time-variant categories for effective style extraction, as well as the design of encoders and adapters with high generalization ability. In addition, we introduce overlapping patchify and convolution-frequency patch embedding strategies to improve DiT-based diffusion networks for TTS. DEX-TTS yields outstanding performance in terms of objective and subjective evaluation in English multi-speaker and emotional multi-speaker datasets, without relying on pre-training strategies. Lastly, the comparison results for the general TTS on a single-speaker dataset verify the effectiveness of our enhanced diffusion backbone. Demos are available here.

6/28/2024

eess.AS cs.AI

SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

Dongchao Yang, Dingdong Wang, Haohan Guo, Xueyuan Chen, Xixin Wu, Helen Meng

In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compact latent space, which alleviates the modeling difficulty of diffusion. More specifically, we propose a novel speech codec model (SQ-Codec) with scalar quantization, SQ-Codec effectively maps the complex speech signal into a finite and compact latent space, named scalar latent space. Benefits from SQ-Codec, we apply a novel transformer diffusion model in the scalar latent space of SQ-Codec. We train SimpleSpeech on 4k hours of a speech-only dataset, it shows natural prosody and voice cloning ability. Compared with previous large-scale TTS models, it presents significant speech quality and generation speed improvement. Demos are released.

6/17/2024

cs.SD eess.AS

Diff-ETS: Learning a Diffusion Probabilistic Model for Electromyography-to-Speech Conversion

Zhao Ren, Kevin Scheck, Qinhan Hou, Stefano van Gogh, Michael Wand, Tanja Schultz

Electromyography-to-Speech (ETS) conversion has demonstrated its potential for silent speech interfaces by generating audible speech from Electromyography (EMG) signals during silent articulations. ETS models usually consist of an EMG encoder which converts EMG signals to acoustic speech features, and a vocoder which then synthesises the speech signals. Due to an inadequate amount of available data and noisy signals, the synthesised speech often exhibits a low level of naturalness. In this work, we propose Diff-ETS, an ETS model which uses a score-based diffusion probabilistic model to enhance the naturalness of synthesised speech. The diffusion model is applied to improve the quality of the acoustic features predicted by an EMG encoder. In our experiments, we evaluated fine-tuning the diffusion model on predictions of a pre-trained EMG encoder, and training both models in an end-to-end fashion. We compared Diff-ETS with a baseline ETS model without diffusion using objective metrics and a listening test. The results indicated the proposed Diff-ETS significantly improved speech naturalness over the baseline.

5/15/2024

cs.SD eess.AS