Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Read original: arXiv:2309.07566 - Published 7/22/2024 by Yongqi Wang, Jionghao Bai, Rongjie Huang, Ruiqi Li, Zhiqing Hong, Zhou Zhao

🔄

Overview

Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved high accuracy.
However, it is unable to preserve the speaker's voice characteristics (timbre) in the translated speech.
Obtaining high-quality speaker-parallel data for learning style transfer during translation is challenging.

Plain English Explanation

Direct speech-to-speech translation is a technology that can translate speech from one language to another while maintaining the original speaker's voice. Recent advances in this field have led to impressive accuracy, but one key limitation is that the translated speech doesn't sound like the original speaker.

This is because the current systems rely on discrete self-supervised representations of the speech, which capture the content but not the speaker's unique vocal characteristics. Additionally, the lack of high-quality data that pairs the original and translated speech from the same speaker (known as "speaker-parallel data") makes it difficult to learn how to transfer the speaker's style during the translation process.

To address these challenges, the researchers have designed a new S2ST pipeline that can not only translate the speech, but also preserve the original speaker's voice in the translated output. Their approach leverages discrete self-supervised speech representations and codec units to achieve this style transfer capability.

The key innovation is an acoustic language model that can learn to transfer the speaker's style without needing any speaker-parallel data. Instead, it uses self-supervised in-context learning techniques to acquire this style transfer ability. This allows the model to work well even when high-quality speaker-parallel data is scarce.

The researchers show that their model can achieve zero-shot cross-lingual style transfer, meaning it can transfer the speaker's voice to translations of languages it has never seen before. Experiments demonstrate that the translated speeches have high fidelity and sound very similar to the original speaker's voice.

Technical Explanation

The researchers designed an S2ST pipeline that can preserve the speaker's timbre (voice characteristics) in the translated speech. Their approach is based on discrete self-supervised speech representations and codec units.

The core component is an acoustic language model that can learn to transfer the speaker's style during translation. This model leverages self-supervised in-context learning techniques to acquire the style transfer capability, without relying on any speaker-parallel data.

By using extensive training data, the researchers were able to achieve zero-shot cross-lingual style transfer, where the model can transfer the speaker's voice to translations of languages it has never seen before.

Experiments show that the translated speeches generated by their model have high fidelity and speaker similarity, preserving the original speaker's unique vocal characteristics.

Critical Analysis

The researchers have addressed an important limitation of existing S2ST systems by enabling style transfer without requiring speaker-parallel data. This is a significant advancement, as obtaining high-quality paired data can be challenging in practice.

However, the paper does not delve into potential limitations or caveats of their approach. It would be helpful to understand the model's performance on edge cases, such as highly expressive or emotionally charged speech, or its ability to handle noisy or low-quality input audio.

Additionally, the researchers could have discussed the computational and memory requirements of their model, as well as any potential trade-offs between translation quality, speaker similarity, and inference speed. These factors would be crucial for real-world deployment and usability of the system.

Further research could explore the generalization of this style transfer capability to other speech-related tasks, such as voice conversion or text-to-speech synthesis, where preserving the speaker's identity is also important.

Conclusion

The researchers have developed an innovative S2ST pipeline that can preserve the speaker's timbre in the translated speech, overcoming a key limitation of existing systems. By leveraging discrete self-supervised representations and an acoustic language model with self-supervised style transfer capabilities, their approach achieves high-quality cross-lingual style transfer without relying on scarce speaker-parallel data.

This work represents a significant step forward in making speech translation systems more natural and user-friendly, as the translated output can now retain the unique vocal characteristics of the original speaker. The potential applications of this technology include multilingual communication, language learning, and accessibility for people with speech or hearing impairments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔄

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Yongqi Wang, Jionghao Bai, Rongjie Huang, Ruiqi Li, Zhiqing Hong, Zhou Zhao

Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation. We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and codec units. The acoustic language model we introduce for style transfer leverages self-supervised in-context learning, acquiring style transfer ability without relying on any speaker-parallel data, thereby overcoming data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and speaker similarity. Audio samples are available at http://stylelm.github.io/ .

7/22/2024

Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation

Min-Jae Hwang, Ilia Kulikov, Benjamin Peloquin, Hongyu Gong, Peng-Jen Chen, Ann Lee

In this paper, we propose a textless acoustic model with a self-supervised distillation strategy for noise-robust expressive speech-to-speech translation (S2ST). Recently proposed expressive S2ST systems have achieved impressive expressivity preservation performances by cascading unit-to-speech (U2S) generator to the speech-to-unit translation model. However, these systems are vulnerable to the presence of noise in input speech, which is an assumption in real-world translation scenarios. To address this limitation, we propose a U2S generator that incorporates a distillation with no label (DINO) self-supervised training strategy into it's pretraining process. Because the proposed method captures noise-agnostic expressivity representation, it can generate qualified speech even in noisy environment. Objective and subjective evaluation results verified that the proposed method significantly improved the performance of the expressive S2ST system in noisy environments while maintaining competitive performance in clean environments.

6/6/2024

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?

Qingkai Fang, Shaolei Zhang, Zhengrui Ma, Min Zhang, Yang Feng

Recently proposed two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS) within an end-to-end model, yielding promising results. However, the training of these models still relies on parallel speech data, which is extremely challenging to collect. In contrast, S2TT and TTS have accumulated a large amount of data and pretrained models, which have not been fully utilized in the development of S2ST models. Inspired by this, in this paper, we first introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model. Furthermore, to eliminate the reliance on parallel speech data, we propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data. It aligns representations in the latent space through contrastive learning, enabling the speech synthesis capability learned from the TTS data to generalize to S2ST in a zero-shot manner. Experimental results on the CVSS dataset show that when the parallel speech data is available, ComSpeech surpasses previous two-pass models like UnitY and Translatotron 2 in both translation quality and decoding speed. When there is no parallel speech data, ComSpeech-ZS lags behind name by only 0.7 ASR-BLEU and outperforms the cascaded models.

6/12/2024

SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought

Hongyu Gong, Bandhav Veluri

Expressive speech-to-speech translation (S2ST) is a key research topic in seamless communication, which focuses on the preservation of semantics and speaker vocal style in translated speech. Early works synthesized speaker style aligned speech in order to directly learn the mapping from speech to target speech spectrogram. Without reliance on style aligned data, recent studies leverage the advances of language modeling (LM) and build cascaded LMs on semantic and acoustic tokens. This work proposes SeamlessExpressiveLM, a single speech language model for expressive S2ST. We decompose the complex source-to-target speech mapping into intermediate generation steps with chain-of-thought prompting. The model is first guided to translate target semantic content and then transfer the speaker style to multi-stream acoustic units. Evaluated on Spanish-to-English and Hungarian-to-English translations, SeamlessExpressiveLM outperforms cascaded LMs in both semantic quality and style transfer, meanwhile achieving better parameter efficiency.

6/3/2024