StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion

Read original: arXiv:2408.02178 - Published 8/6/2024 by Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Lei Xie, Yuping Wang

StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion

Overview

Proposes StreamVoice+, an end-to-end streaming zero-shot voice conversion model
Leverages language models for parameter-efficient fine-tuning
Achieves real-time, high-fidelity voice conversion without audio samples from the target speaker

Plain English Explanation

StreamVoice+ is a new AI model that can convert one person's voice to sound like another person's voice, even if it has never heard that person's voice before. This is called "zero-shot" voice conversion. The model works in real-time, so the voice conversion happens instantly without any lag.

The key innovation is that StreamVoice+ uses a language model to help it learn how to do the voice conversion. This makes the model more efficient and easier to fine-tune for different voices, without needing lots of audio samples from the target speaker.

Instead of trying to directly map one voice to another, StreamVoice+ first converts the input voice into a more abstract "voice representation." It then uses the language model to guide the conversion of this representation into the desired target voice. This allows the model to generalize and handle voices it has never encountered before.

The result is a voice conversion system that is fast, high-quality, and can work with any target speaker, without requiring lots of training data for that speaker. This could have many applications, like making digital assistants or audiobooks sound more natural and personalized.

Technical Explanation

StreamVoice+ is an end-to-end streaming voice conversion model that leverages the power of large language models to achieve zero-shot voice conversion capabilities.

The core architecture consists of an encoder that maps the input speech into a latent voice representation, and a decoder that converts this representation into the target voice. Crucially, the model uses parameter-efficient fine-tuning of the language model to guide the voice conversion process, rather than relying on extensive training data from the target speaker.

This approach allows StreamVoice+ to achieve real-time, high-fidelity voice conversion without any audio samples from the target speaker. The residual connection and transducer-based design further enhances the model's robustness and performance.

Critical Analysis

The authors thoroughly evaluate StreamVoice+ on a range of zero-shot voice conversion benchmarks, demonstrating its superior performance compared to previous approaches. However, the paper does not extensively discuss potential limitations or failure modes of the model.

For example, it is unclear how well StreamVoice+ would handle highly accented or non-native speech, or whether the model's performance would degrade for target speakers with very distinct vocal characteristics. Additionally, the paper does not explore the model's sample efficiency or its ability to generalize to unseen languages or domains.

Further research could investigate these areas, as well as the model's interpretability and the potential fairness issues that could arise from deploying such a powerful voice conversion system in the real world.

Conclusion

StreamVoice+ represents an important step forward in the field of zero-shot voice conversion. By leveraging language models, the model can achieve real-time, high-fidelity voice conversion without requiring any target speaker audio samples. This opens up exciting possibilities for more natural and personalized voice interfaces, as well as various creative and accessibility applications.

The technical innovations behind StreamVoice+ demonstrate the value of exploring the intersection of speech and language processing, and suggest that continued advancements in this direction could yield further breakthroughs in the years to come.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion

Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Lei Xie, Yuping Wang

StreamVoice has recently pushed the boundaries of zero-shot voice conversion (VC) in the streaming domain. It uses a streamable language model (LM) with a context-aware approach to convert semantic features from automatic speech recognition (ASR) into acoustic features with the desired speaker timbre. Despite its innovations, StreamVoice faces challenges due to its dependency on a streaming ASR within a cascaded framework, which complicates system deployment and optimization, affects VC system's design and performance based on the choice of ASR, and struggles with conversion stability when faced with low-quality semantic inputs. To overcome these limitations, we introduce StreamVoice+, an enhanced LM-based end-to-end streaming framework that operates independently of streaming ASR. StreamVoice+ integrates a semantic encoder and a connector with the original StreamVoice framework, now trained using a non-streaming ASR. This model undergoes a two-stage training process: initially, the StreamVoice backbone is pre-trained for voice conversion and the semantic encoder for robust semantic extraction. Subsequently, the system is fine-tuned end-to-end, incorporating a LoRA matrix to activate comprehensive streaming functionality. Furthermore, StreamVoice+ mainly introduces two strategic enhancements to boost conversion quality: a residual compensation mechanism in the connector to ensure effective semantic transmission and a self-refinement strategy that leverages pseudo-parallel speech pairs generated by the conversion backbone to improve speech decoupling. Experiments demonstrate that StreamVoice+ not only achieves higher naturalness and speaker similarity in voice conversion than its predecessor but also provides versatile support for both streaming and non-streaming conversion scenarios.

8/6/2024

StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion

Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Lei Xie, Yuping Wang

Recent language model (LM) advancements have showcased impressive zero-shot voice conversion (VC) performance. However, existing LM-based VC models usually apply offline conversion from source semantics to acoustic features, demanding the complete source speech and limiting their deployment to real-time applications. In this paper, we introduce StreamVoice, a novel streaming LM-based model for zero-shot VC, facilitating real-time conversion given arbitrary speaker prompts and source speech. Specifically, to enable streaming capability, StreamVoice employs a fully causal context-aware LM with a temporal-independent acoustic predictor, while alternately processing semantic and acoustic features at each time step of autoregression which eliminates the dependence on complete source speech. To address the potential performance degradation from the incomplete context in streaming processing, we enhance the context-awareness of the LM through two strategies: 1) teacher-guided context foresight, using a teacher model to summarize the present and future semantic context during training to guide the model's forecasting for missing context; 2) semantic masking strategy, promoting acoustic prediction from preceding corrupted semantic and acoustic input, enhancing context-learning ability. Notably, StreamVoice is the first LM-based streaming zero-shot VC model without any future look-ahead. Experiments demonstrate StreamVoice's streaming conversion capability while achieving zero-shot performance comparable to non-streaming VC systems.

7/22/2024

DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

Ziqian Ning, Shuai Wang, Pengcheng Zhu, Zhichao Wang, Jixun Yao, Lei Xie, Mengxiao Bi

Streaming voice conversion has become increasingly popular for its potential in real-time applications. The recently proposed DualVC 2 has achieved robust and high-quality streaming voice conversion with a latency of about 180ms. Nonetheless, the recognition-synthesis framework hinders end-to-end optimization, and the instability of automatic speech recognition (ASR) model with short chunks makes it challenging to further reduce latency. To address these issues, we propose an end-to-end model, DualVC 3. With speaker-independent semantic tokens to guide the training of the content encoder, the dependency on ASR is removed and the model can operate under extremely small chunks, with cascading errors eliminated. A language model is trained on the content encoder output to produce pseudo context by iteratively predicting future frames, providing more contextual information for the decoder to improve conversion quality. Experimental results demonstrate that DualVC 3 achieves comparable performance to DualVC 2 in subjective and objective metrics, with a latency of only 50 ms.

6/13/2024

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

Linhan Ma, Xinfa Zhu, Yuanjun Lv, Zhichao Wang, Ziqian Wang, Wendi He, Hongbin Zhou, Lei Xie

Zero-shot voice conversion (VC) aims to transform source speech into arbitrary unseen target voice while keeping the linguistic content unchanged. Recent VC methods have made significant progress, but semantic losses in the decoupling process as well as training-inference mismatch still hinder conversion performance. In this paper, we propose Vec-Tok-VC+, a novel prompt-based zero-shot VC model improved from Vec-Tok Codec, achieving voice conversion given only a 3s target speaker prompt. We design a residual-enhanced K-Means decoupler to enhance the semantic content extraction with a two-layer clustering process. Besides, we employ teacher-guided refinement to simulate the conversion process to eliminate the training-inference mismatch, forming a dual-mode training strategy. Furthermore, we design a multi-codebook progressive loss function to constrain the layer-wise output of the model from coarse to fine to improve speaker similarity and content accuracy. Objective and subjective evaluations demonstrate that Vec-Tok-VC+ outperforms the strong baselines in naturalness, intelligibility, and speaker similarity.

6/17/2024