Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation

Read original: arXiv:2308.01831 - Published 8/20/2024 by Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro

🏋️

Overview

The paper proposes a "textless" training method for multilingual speech-to-speech translation.
It can also benefit the transfer of pre-trained knowledge to text-based systems, text-to-speech synthesis, and text-to-speech translation.
The method represents multilingual speech using discrete speech units derived from a self-supervised speech model.
These speech units are treated as "pseudo-text" to focus on the linguistic content of the speech.
An encoder-decoder model is trained in a many-to-many "Unit-to-Unit Translation" (UTUT) setting.
The trained model can be easily transferred to text-related tasks, even if it was trained in a textless manner.

Plain English Explanation

The paper describes a new approach for training speech translation systems without using any written text. Instead of working with written words, the method represents speech using special "speech units" - discrete representations of the audio features. By treating these speech units as a kind of "pseudo-text", the system can learn to translate between different languages while focusing only on the linguistic content of the speech, without needing to deal with written words.

The key idea is to train an encoder-decoder model to convert the speech units of the source language into the speech units of the target language. The encoder is conditioned on the source language, while the decoder is conditioned on the target language. This allows the model to build an understanding of how different languages are comprehended and how to translate between them.

Since the speech units can be easily extracted from both audio and text data, the trained model can be easily transferred to other text-related tasks, like text-to-speech synthesis and translation, even though it was originally trained in a "textless" manner without using any written words.

Technical Explanation

The paper proposes a "textless" training approach for multilingual Speech-to-Speech Translation (S2ST). The core idea is to represent multilingual speech using discrete speech units derived from a self-supervised speech model. These speech units are treated as "pseudo-text" to focus on the linguistic content of the speech, which can be easily associated with both speech and text modalities at the phonetic level.

The authors propose training an encoder-decoder model in a many-to-many "Unit-to-Unit Translation" (UTUT) setting. The encoder is conditioned on the source language token to correctly understand the input spoken language, while the decoder is conditioned on the target language token to generate the translated speech in the target language. This allows the model to build knowledge of how languages are comprehended and how to relate them to different languages.

Since the speech units can be easily associated from both audio and text by quantization and phonemization respectively, the trained UTUT model can be easily transferred to text-related tasks, such as multilingual Text-to-Speech Synthesis (T2S) and [Text-to-Speech Translation (T2ST)], requiring only minimal fine-tuning steps on text inputs.

Critical Analysis

The paper presents a novel and compelling approach for multilingual speech translation that avoids the need for written text. By using discrete speech units as a "pseudo-text" representation, the method can learn to translate between languages while focusing solely on the linguistic content of the speech.

One potential limitation is that the performance of the UTUT model may be dependent on the quality and robustness of the underlying self-supervised speech model used to extract the speech units. If the speech units do not adequately capture the relevant linguistic information, the translation performance could suffer.

Additionally, while the paper demonstrates the ability to transfer the UTUT model to text-related tasks, it would be valuable to further investigate the potential performance gaps between the textless UTUT model and text-based systems, as well as the specific fine-tuning requirements for each task.

Overall, the research provides an intriguing direction for advancing speech translation capabilities, especially in multilingual settings where text resources may be scarce. Further exploration of the limitations and potential refinements of the UTUT approach could yield valuable insights for the field.

Conclusion

The proposed "textless" training method for multilingual speech-to-speech translation represents a significant step forward in leveraging the linguistic content of speech without relying on written text. By using discrete speech units as a "pseudo-text" representation, the method can build knowledge of how languages are comprehended and translated, while enabling easy transfer to text-based tasks like synthesis and translation.

This approach has the potential to greatly benefit multilingual speech applications, especially in settings where written text resources are limited. The ability to train effective speech translation systems without needing large text corpora opens up new possibilities for expanding access to communication technologies in underserved languages and communities.

Further research on the limitations and refinements of the UTUT model, as well as its broader implications for the field of speech and language processing, could yield valuable insights and accelerate progress in this important domain.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏋️

Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation

Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro

This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation that can also benefit the transfer of pre-trained knowledge to text-based systems, text-to-speech synthesis and text-to-speech translation. To this end, we represent multilingual speech with speech units that are the discretized representations of speech features derived from a self-supervised speech model. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech, which can be easily associated with both speech and text modalities at the phonetic level information. By setting both the inputs and outputs of our learning problem as speech units, we propose to train an encoder-decoder model in a many-to-many spoken language translation setting, namely Unit-to-Unit Translation (UTUT). Specifically, the encoder is conditioned on the source language token to correctly understand the input spoken language, while the decoder is conditioned on the target language token to generate the translated speech in the target language. Therefore, during the training, the model can build the knowledge of how languages are comprehended and how to relate them to different languages. Since speech units can be easily associated from both audio and text by quantization and phonemization respectively, the trained model can easily transferred to text-related tasks, even if it is trained in a textless manner. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST), requiring only minimal fine-tuning steps on text inputs. By conducting comprehensive experiments encompassing various languages, we validate the efficacy of the proposed method across diverse multilingual tasks.

8/20/2024

🗣️

Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation

Jarod Duret (LIA), Yannick Est`eve (LIA), Titouan Parcollet (CAM)

Recent advancements in textless speech-to-speech translation systems have been driven by the adoption of self-supervised learning techniques. Although most state-of-the-art systems adopt a similar architecture to transform source language speech into sequences of discrete representations in the target language, the criteria for selecting these target speech units remains an open question. This work explores the selection process through a study of downstream tasks such as automatic speech recognition, speech synthesis, speaker recognition, and emotion recognition. Interestingly, our findings reveal a discrepancy in the optimization of discrete speech units: units that perform well in resynthesis performance do not necessarily correlate with those that enhance translation efficacy. This discrepancy underscores the nuanced complexity of target feature selection and its impact on the overall performance of speech-to-speech translation systems.

7/29/2024

🗣️

Compact Speech Translation Models via Discrete Speech Units Pretraining

Tsz Kin Lam, Alexandra Birch, Barry Haddow

We propose a pretraining method to use Self-Supervised Speech (SSS) model to creating more compact Speech-to-text Translation. In contrast to using the SSS model for initialization, our method is more suitable to memory constrained scenario such as on-device deployment. Our method is based on Discrete Speech Units (DSU) extracted from the SSS model. In the first step, our method pretrains two smaller encoder-decoder models on 1) Filterbank-to-DSU (Fbk-to-DSU) and 2) DSU-to-Translation (DSU-to-Trl) data respectively. The DSU thus become the distillation inputs of the smaller models. Subsequently, the encoder from the Fbk-to-DSU model and the decoder from the DSU-to-Trl model are taken to initialise the compact model. Finally, the compact model is finetuned on the paired Fbk-Trl data. In addition to being compact, our method requires no transcripts, making it applicable to low-resource settings. It also avoids speech discretization in inference and is more robust to the DSU tokenization. Evaluation on CoVoST-2 (X-En) shows that our method has consistent improvement over the baseline in three metrics while being compact i.e., only half the SSS model size.

6/27/2024

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Fadi Biadsy, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Franc{c}oise Beaufays, Hadar Shemtov

Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data sources, thereby leveraging massively multilingual joint speech and text representation learning. Without any transcribed speech in a new language, this TTS model can generate intelligible speech in >30 unseen languages (CER difference of <10% to ground truth). With just 15 minutes of transcribed, found data, we can reduce the intelligibility difference to 1% or less from the ground-truth, and achieve naturalness scores that match the ground-truth in several languages.

7/17/2024