Lightweight Zero-shot Text-to-Speech with Mixture of Adapters

Read original: arXiv:2407.01291 - Published 7/2/2024 by Kenichi Fujita, Takanori Ashihara, Marc Delcroix, Yusuke Ijima

Lightweight Zero-shot Text-to-Speech with Mixture of Adapters

Overview

This paper introduces a novel approach called Lightweight Zero-shot Text-to-Speech with Mixture of Adapters (LZ-TTS-MoA), which enables efficient text-to-speech (TTS) for unseen speakers without the need for fine-tuning the entire model.
The key idea is to use a mixture of lightweight adapter modules that can be quickly adapted to new speakers, rather than retraining the entire TTS model.
This allows the model to generate high-quality speech for unseen speakers with a small computational footprint, making it suitable for deployment on resource-constrained devices.

Plain English Explanation

The paper presents a new way to generate human-like speech from text for speakers that the model hasn't seen before, without having to completely retrain the entire speech generation system. Instead, the researchers use a collection of small "adapter" modules that can be quickly adjusted to adapt to new speakers. This allows the model to produce high-quality speech for new speakers, while using much less computing power compared to retraining the whole system. This makes the technology more practical for use on devices with limited resources, like smartphones or smart speakers.

Technical Explanation

The researchers introduce the Lightweight Zero-shot Text-to-Speech with Mixture of Adapters (LZ-TTS-MoA) approach. The key innovation is using a "mixture of adapters" - a collection of small neural network modules that can be efficiently combined and fine-tuned to adapt the TTS model to new speakers.

Rather than retraining the entire TTS model for each new speaker, the adapters allow the model to be quickly and easily customized. The adapters are added to the base TTS model, and their parameters are the only ones updated during fine-tuning, leaving the majority of the model unchanged.

The experiments show that LZ-TTS-MoA can achieve high-quality speech generation for unseen speakers, while using significantly fewer parameters and less computational resources than traditional fine-tuning approaches. This makes the model suitable for deployment on edge devices with limited capabilities.

Critical Analysis

The paper presents a compelling solution to the challenge of efficient zero-shot text-to-speech generation. The use of adapter modules is a clever way to balance speaker adaptation and model complexity, and the results demonstrate the effectiveness of this approach.

One potential limitation is that the paper only evaluates the model on a relatively small number of target speakers (10). It would be interesting to see how well the approach scales to a larger and more diverse set of speakers.

Additionally, the paper does not directly compare LZ-TTS-MoA to other recent zero-shot TTS methods, such as USAT, MobileSpeech, or Ditto-TTS. A more comprehensive benchmarking against these related approaches could further validate the advantages of the proposed method.

Conclusion

The Lightweight Zero-shot Text-to-Speech with Mixture of Adapters (LZ-TTS-MoA) approach introduced in this paper represents an innovative solution for efficient, high-quality text-to-speech generation for unseen speakers. By leveraging a mixture of lightweight adapter modules, the model can be quickly adapted to new speakers without the need for computationally expensive retraining of the entire system.

This advance has the potential to enable more widespread deployment of text-to-speech technology, particularly on resource-constrained devices such as smartphones and smart speakers. As language models and speech synthesis continue to advance, approaches like LZ-TTS-MoA will be crucial for bringing these capabilities to a wide range of applications and users.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Lightweight Zero-shot Text-to-Speech with Mixture of Adapters

Kenichi Fujita, Takanori Ashihara, Marc Delcroix, Yusuke Ijima

The advancements in zero-shot text-to-speech (TTS) methods, based on large-scale models, have demonstrated high fidelity in reproducing speaker characteristics. However, these models are too large for practical daily use. We propose a lightweight zero-shot TTS method using a mixture of adapters (MoA). Our proposed method incorporates MoA modules into the decoder and the variance adapter of a non-autoregressive TTS model. These modules enhance the ability to adapt a wide variety of speakers in a zero-shot manner by selecting appropriate adapters associated with speaker characteristics on the basis of speaker embeddings. Our method achieves high-quality speech synthesis with minimal additional parameters. Through objective and subjective evaluations, we confirmed that our method achieves better performance than the baseline with less than 40% of parameters at 1.9 times faster inference speed. Audio samples are available on our demo page (https://ntt-hilab-gensp.github.io/is2024lightweightTTS/).

7/2/2024

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Wenbin Wang, Yang Song, Sanjay Jha

Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. In addition, prior approaches only provide either zero-shot or few-shot adaptation, constraining their utility across varied real-world scenarios with different demands. Besides, most current evaluations of speaker-adaptive TTS are conducted only on datasets of native speakers, inadvertently neglecting a vast portion of non-native speakers with diverse accents. Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as instant and fine-grained adaptations based on their merits. To alleviate the insufficient generalization performance observed in zero-shot speaker adaptation, we designed two innovative discriminators and introduced a memory mechanism for the speech decoder. To prevent catastrophic forgetting and reduce storage implications for few-shot speaker adaptation, we designed two adapters and a unique adaptation procedure.

4/30/2024

💬

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

Shengpeng Ji, Ziyue Jiang, Hanting Wang, Jialong Zuo, Zhou Zhao

Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice cloning capabilities, requiring only a few seconds of unseen speaker voice prompts. However, all previous work has been developed for cloud-based systems. Taking autoregressive models as an example, although these approaches achieve high-fidelity voice cloning, they fall short in terms of inference speed, model size, and robustness. Therefore, we propose MobileSpeech, which is a fast, lightweight, and robust zero-shot text-to-speech system based on mobile devices for the first time. Specifically: 1) leveraging discrete codec, we design a parallel speech mask decoder module called SMD, which incorporates hierarchical information from the speech codec and weight mechanisms across different codec layers during the generation process. Moreover, to bridge the gap between text and speech, we introduce a high-level probabilistic mask that simulates the progression of information flow from less to more during speech generation. 2) For speaker prompts, we extract fine-grained prompt duration from the prompt speech and incorporate text, prompt speech by cross attention in SMD. We demonstrate the effectiveness of MobileSpeech on multilingual datasets at different levels, achieving state-of-the-art results in terms of generating speed and speech quality. MobileSpeech achieves RTF of 0.09 on a single A100 GPU and we have successfully deployed MobileSpeech on mobile devices. Audio samples are available at url{https://mobilespeech.github.io/} .

6/4/2024

VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech

Heeseung Kim, Sang-gil Lee, Jiheum Yeom, Che Hyun Lee, Sungwon Kim, Sungroh Yoon

We propose VoiceTailor, a parameter-efficient speaker-adaptive text-to-speech (TTS) system, by equipping a pre-trained diffusion-based TTS model with a personalized adapter. VoiceTailor identifies pivotal modules that benefit from the adapter based on a weight change ratio analysis. We utilize Low-Rank Adaptation (LoRA) as a parameter-efficient adaptation method and incorporate the adapter into pivotal modules of the pre-trained diffusion decoder. To achieve powerful adaptation performance with few parameters, we explore various guidance techniques for speaker adaptation and investigate the best strategies to strengthen speaker information. VoiceTailor demonstrates comparable speaker adaptation performance to existing adaptive TTS models by fine-tuning only 0.25% of the total parameters. VoiceTailor shows strong robustness when adapting to a wide range of real-world speakers, as shown in the demo.

8/29/2024