SingIt! Singer Voice Transformation

Read original: arXiv:2405.04627 - Published 5/9/2024 by Amit Eliav, Aaron Taub, Renana Opochinsky, Sharon Gannot

Overview

This paper introduces "SingIt!", a novel system for transforming a singer's voice to match a different singing style or persona.
The system utilizes deep learning techniques to enable fine-grained control over various aspects of a singer's voice, such as timbre, pitch, and vibrato.
SingIt! is designed to empower singers and music producers by giving them the ability to creatively manipulate vocal performances.

Plain English Explanation

SingIt! is a new technology that allows singers to transform the sound of their voice. It uses advanced artificial intelligence and machine learning to give singers more control over the different qualities of their singing voice, like the tone, pitch, and vibrato.

This is useful for singers and music producers who want to experiment with different vocal styles or characters. Maybe a singer wants to sound younger or older, or maybe they want to blend their voice with a different genre. SingIt! gives them the tools to try out these creative ideas and explore new possibilities for their vocal performances.

The key innovation of SingIt! is that it provides fine-grained control over the various elements of a singer's voice. Rather than just applying a simple effect or filter, the system can intelligently manipulate the underlying characteristics of the voice to achieve the desired transformation. This allows for much more nuanced and natural-sounding results compared to traditional voice-changing techniques.

Technical Explanation

SingIt! is a deep learning-based system for singer voice transformation. It takes as input a recorded vocal performance and allows the user to adjust parameters like timbre, pitch, and vibrato to alter the singer's voice.

The core of the system is a neural network architecture that models the complex relationships between these voice characteristics. By learning from a large dataset of singing voice samples, the model is able to generate new vocal renditions that match the desired settings specified by the user.

A key technical innovation is the use of disentangled representations, which enable the network to independently control different aspects of the voice. This allows for fine-grained manipulation that goes beyond simple voice conversion approaches.

The system is trained in a self-supervised manner, leveraging unpaired singing voice data to learn the underlying structure of vocal performances. This enables SingIt! to work with a wide range of singers and musical styles, without requiring specialized training data for each case.

Critical Analysis

The research presented in this paper demonstrates the impressive capabilities of deep learning for singer voice transformation. By providing granular control over vocal characteristics, SingIt! opens up new creative possibilities for singers and producers.

However, the paper does not address potential ethical concerns around the technology. There could be issues around the misuse of voice transformation to impersonate individuals or create synthetic media. The authors should have discussed safeguards or guidelines to ensure SingIt! is used responsibly.

Additionally, the evaluation of the system is primarily focused on objective metrics like audio quality and similarity to target voices. More subjective assessments of the artistic and emotional expressiveness of the transformed vocals would have provided a more complete picture of the system's capabilities.

Overall, the technical advances presented in this paper are significant, but the authors could have done more to address the broader implications and limitations of their work.

Conclusion

SingIt! is an innovative deep learning-based system that empowers singers and producers to transform vocal performances in unprecedented ways. By providing fine-grained control over voice characteristics, the technology opens up new creative possibilities for musical expression and experimentation.

While the technical achievements are impressive, the paper could have done more to address potential ethical considerations and subjective assessments of the system's artistic merits. Nevertheless, SingIt! represents a significant step forward in the field of voice manipulation and synthesis, with exciting implications for the future of music creation and performance.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

SingIt! Singer Voice Transformation

Amit Eliav, Aaron Taub, Renana Opochinsky, Sharon Gannot

In this paper, we propose a model which can generate a singing voice from normal speech utterance by harnessing zero-shot, many-to-many style transfer learning. Our goal is to give anyone the opportunity to sing any song in a timely manner. We present a system comprising several available blocks, as well as a modified auto-encoder, and show how this highly-complex challenge can be achieved by tailoring rather simple solutions together. We demonstrate the applicability of the proposed system using a group of 25 non-expert listeners. Samples of the data generated from our model are provided.

5/9/2024

Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations

Wangjin Zhou, Fengrun Zhang, Yiming Liu, Wenhao Guan, Yi Zhao, He Qu

This study presents an innovative Zero-Shot any-to-any Singing Voice Conversion (SVC) method, leveraging a novel clustering-based phoneme representation to effectively separate content, timbre, and singing style. This approach enables precise voice characteristic manipulation. We discovered that datasets with fewer recordings per artist are more susceptible to timbre leakage. Extensive testing on over 10,000 hours of singing and user feedback revealed our model significantly improves sound quality and timbre accuracy, aligning with our objectives and advancing voice conversion technology. Furthermore, this research advances zero-shot SVC and sets the stage for future work on discrete speech representation, emphasizing the preservation of rhyme.

9/14/2024

SaMoye: Zero-shot Singing Voice Conversion Based on Feature Disentanglement and Synthesis

Zihao Wang, Le Ma, Yongsheng Feng, Xin Pan, Yuhang Jin, Kejun Zhang

Singing voice conversion (SVC) aims to convert a singer's voice to another singer's from a reference audio while keeping the original semantics. However, existing SVC methods can hardly perform zero-shot due to incomplete feature disentanglement or dependence on the speaker look-up table. We propose the first open-source high-quality zero-shot SVC model SaMoye that can convert singing to human and non-human timbre. SaMoye disentangles the singing voice's features into content, timbre, and pitch features, where we combine multiple ASR models and compress the content features to reduce timbre leaks. Besides, we enhance the timbre features by unfreezing the speaker encoder and mixing the speaker embedding with top-3 similar speakers. We also establish an unparalleled large-scale dataset to guarantee zero-shot performance, which comprises more than 1,815 hours of pure singing voice and 6,367 speakers. We conduct objective and subjective experiments to find that SaMoye outperforms other models in zero-shot SVC tasks even under extreme conditions like converting singing to animals' timbre. The code and weight of SaMoye are available on https://github.com/CarlWangChina/SaMoye-SVC.

9/16/2024

Real-Time and Accurate: Zero-shot High-Fidelity Singing Voice Conversion with Multi-Condition Flow Synthesis

Hui Li, Hongyu Wang, Zhijin Chen, Bohan Sun, Bo Li

Singing voice conversion is to convert the source singing voice into the target singing voice except for the content. Currently, flow-based models can complete the task of voice conversion, but they struggle to effectively extract latent variables in the more rhythmically rich and emotionally expressive task of singing voice conversion, while also facing issues with low efficiency in speech processing. In this paper, we propose a high-fidelity flow-based model based on multi-decoupling feature constraints called RASVC, which enhances the capture of vocal details by integrating multiple latent attribute encoders. We also use Multi-stream inverse short-time Fourier transform(MS-iSTFT) to enhance the speed of speech processing by skipping some complicated decoder processing steps. We compare the synthesized singing voice with other models from multiple dimensions, and our proposed model is highly consistent with the current state-of-the-art, with the demo which is available at url{https://lazycat1119.github.io/RASVC-demo/}.

9/10/2024