Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations

Read original: arXiv:2409.08039 - Published 9/14/2024 by Wangjin Zhou, Fengrun Zhang, Yiming Liu, Wenhao Guan, Yi Zhao, He Qu

Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations

Overview

This paper proposes a zero-shot sing voice conversion (SVC) method that leverages clustering-based phoneme representations.
The approach aims to enable sing voice conversion between any source and target singers without requiring paired training data.
The key ideas involve using clustering to learn phoneme representations and employing these representations to guide the conversion process.

Plain English Explanation

The paper presents a new way to convert one singer's voice into another singer's voice, without needing training data that directly matches the two singers. This is called "zero-shot" sing voice conversion, meaning the system can do the conversion even if it hasn't been trained on the specific voices involved.

The core of the approach is using machine learning to identify the basic building blocks of speech, called "phonemes," and then representing each singer's voice in terms of these phoneme patterns. By learning these phoneme representations in a general way, the system can apply them to convert between any pair of singers, even if it's never heard their voices before.

The key insight is that phonemes - the fundamental speech sounds that make up words - provide a common language that can bridge the gap between different singing voices. Even if two singers have very different vocal styles, they're still using the same underlying phonemes. The system learns to map these phonemes in a way that allows it to transform one singer's voice into another's, without needing to be trained on that specific pairing.

This zero-shot capability is valuable because it means the voice conversion system can be applied much more broadly, without the need to collect large amounts of training data for every possible singer pair. It opens up the potential for more flexible and accessible voice conversion technology.

Technical Explanation

The proposed approach consists of two key components:

Clustering-based Phoneme Representation Learning: The method first learns general phoneme representations in an unsupervised manner by clustering acoustic features extracted from a large corpus of speech data. This results in a set of cluster centroids that serve as the phoneme representations.
Zero-Shot Sing Voice Conversion: During the conversion stage, the source singer's speech is first mapped to the learned phoneme representations. These phoneme-level features are then transformed and mapped to the target singer's voice characteristics using a neural network model. This allows converting between any source and target singers without requiring parallel training data.

The key technical innovation is the use of the clustering-based phoneme representations to bridge the gap between arbitrary singers. By learning phoneme patterns in a general way, the system can apply this knowledge to convert between new singer pairs, overcoming the limitation of needing paired training data for each conversion scenario.

The paper presents experiments demonstrating the effectiveness of this zero-shot SVC approach, showing it can achieve comparable performance to supervised methods that require paired training data. This highlights the potential of the proposed technique to enable more flexible and accessible voice conversion applications.

Critical Analysis

The paper makes a compelling case for the value of zero-shot sing voice conversion, but there are a few limitations and areas for further research worth noting:

The study is limited to English speech and does not explore the approach's generalization to other languages. Extending the phoneme representation learning to multilingual settings could broaden the applicability.
The experiments focus on converting between professional singers. Assessing the method's performance on converting between amateur or non-professional voices could reveal additional challenges or considerations.
While the zero-shot capability is a key strength, the paper does not discuss the system's robustness to variability in the source and target singers' vocal characteristics, recording conditions, or singing styles. Further evaluation of these factors would be valuable.
The conversion quality is evaluated primarily through objective metrics. Incorporating subjective human evaluation could provide additional insights into the perceptual fidelity of the converted voices.

Overall, the proposed zero-shot SVC approach based on clustering-based phoneme representations is a promising step forward in making voice conversion technology more flexible and accessible. Continued research into the technique's broader applicability and robustness could further enhance its real-world impact.

Conclusion

This paper presents a novel zero-shot sing voice conversion method that leverages clustering-based phoneme representations. By learning general phoneme patterns in an unsupervised manner, the approach can convert between any source and target singers without requiring paired training data.

The key contribution is the use of these phoneme representations to bridge the gap between different singing voices, enabling flexible voice conversion without the constraints of traditional supervised methods. The experiments demonstrate the effectiveness of this approach, suggesting its potential to enable more accessible and broadly applicable voice conversion applications.

While the paper focuses on specific technical aspects, the underlying idea of using learned phoneme representations to facilitate zero-shot voice conversion holds promise for advancing the field of speech and singing synthesis. Further research exploring the method's generalization, robustness, and human perceptual evaluation could help unlock the full potential of this approach.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations

Wangjin Zhou, Fengrun Zhang, Yiming Liu, Wenhao Guan, Yi Zhao, He Qu

This study presents an innovative Zero-Shot any-to-any Singing Voice Conversion (SVC) method, leveraging a novel clustering-based phoneme representation to effectively separate content, timbre, and singing style. This approach enables precise voice characteristic manipulation. We discovered that datasets with fewer recordings per artist are more susceptible to timbre leakage. Extensive testing on over 10,000 hours of singing and user feedback revealed our model significantly improves sound quality and timbre accuracy, aligning with our objectives and advancing voice conversion technology. Furthermore, this research advances zero-shot SVC and sets the stage for future work on discrete speech representation, emphasizing the preservation of rhyme.

9/14/2024

LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance

Shihao Chen, Yu Gu, Jie Zhang, Na Li, Rilin Chen, Liping Chen, Lirong Dai

Any-to-any singing voice conversion (SVC) is an interesting audio editing technique, aiming to convert the singing voice of one singer into that of another, given only a few seconds of singing data. However, during the conversion process, the issue of timbre leakage is inevitable: the converted singing voice still sounds like the original singer's voice. To tackle this, we propose a latent diffusion model for SVC (LDM-SVC) in this work, which attempts to perform SVC in the latent space using an LDM. We pretrain a variational autoencoder structure using the noted open-source So-VITS-SVC project based on the VITS framework, which is then used for the LDM training. Besides, we propose a singer guidance training method based on classifier-free guidance to further suppress the timbre of the original singer. Experimental results show the superiority of the proposed method over previous works in both subjective and objective evaluations of timbre similarity.

6/11/2024

SaMoye: Zero-shot Singing Voice Conversion Based on Feature Disentanglement and Synthesis

Zihao Wang, Le Ma, Yongsheng Feng, Xin Pan, Yuhang Jin, Kejun Zhang

Singing voice conversion (SVC) aims to convert a singer's voice to another singer's from a reference audio while keeping the original semantics. However, existing SVC methods can hardly perform zero-shot due to incomplete feature disentanglement or dependence on the speaker look-up table. We propose the first open-source high-quality zero-shot SVC model SaMoye that can convert singing to human and non-human timbre. SaMoye disentangles the singing voice's features into content, timbre, and pitch features, where we combine multiple ASR models and compress the content features to reduce timbre leaks. Besides, we enhance the timbre features by unfreezing the speaker encoder and mixing the speaker embedding with top-3 similar speakers. We also establish an unparalleled large-scale dataset to guarantee zero-shot performance, which comprises more than 1,815 hours of pure singing voice and 6,367 speakers. We conduct objective and subjective experiments to find that SaMoye outperforms other models in zero-shot SVC tasks even under extreme conditions like converting singing to animals' timbre. The code and weight of SaMoye are available on https://github.com/CarlWangChina/SaMoye-SVC.

9/16/2024

🧪

Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion

Xueyao Zhang, Zihao Fang, Yicheng Gu, Haopeng Chen, Lexiao Zou, Junan Zhang, Liumeng Xue, Zhizheng Wu

Singing Voice Conversion (SVC) is a technique that enables any singer to perform any song. To achieve this, it is essential to obtain speaker-agnostic representations from the source audio, which poses a significant challenge. A common solution involves utilizing a semantic-based audio pretrained model as a feature extractor. However, the degree to which the extracted features can meet the SVC requirements remains an open question. This includes their capability to accurately model melody and lyrics, the speaker-independency of their underlying acoustic information, and their robustness for in-the-wild acoustic environments. In this study, we investigate the knowledge within classical semantic-based pretrained models in much detail. We discover that the knowledge of different models is diverse and can be complementary for SVC. Based on the above, we design a Singing Voice Conversion framework based on Diverse Semantic-based Feature Fusion (DSFF-SVC). Experimental results demonstrate that DSFF-SVC can be generalized and improve various existing SVC models, particularly in challenging real-world conversion tasks. Our demo website is available at https://diversesemanticsvc.github.io/.

9/17/2024