CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition

Read original: arXiv:2404.19187 - Published 5/1/2024 by Jianzong Wang, Pengcheng Li, Xulong Zhang, Ning Cheng, Jing Xiao

🌿

Overview

Describes a system called ConTuner that can "beautify" singing voices by correcting pitch and enhancing expressiveness
Addresses limitations of existing methods that rely on paired data or only focus on pitch correction
Proposes a diffusion model-based approach that generates an improved Mel-spectrogram based on optimized pitch and expressiveness

Plain English Explanation

Singing voice beautifying is a technology that aims to improve the quality of a person's singing voice without changing the original sound. This can be useful in people's daily lives, for example, when they want to sound more professional or expressive while singing.

Existing methods for singing voice beautifying have some drawbacks. They often require access to "professional" recordings of the same person singing, which can be hard to obtain. Additionally, they tend to focus only on correcting the pitch of the voice, but singing expression involves more than just pitch.

The researchers behind this paper have developed a system called ConTuner that overcomes these limitations. ConTuner uses a diffusion model, a type of machine learning algorithm, to generate an improved version of the singing voice's Mel-spectrogram (a visual representation of the voice's frequency content). The key aspects of ConTuner are:

Pitch Correction: ConTuner establishes a mapping relationship between the MIDI note information, the voice's spectral envelope, and the actual pitch, allowing it to correct the pitch.
Expressiveness Enhancement: ConTuner includes an "expressiveness enhancer" that can convert an amateur vocal tone to sound more professional and expressive, beyond just pitch correction.

ConTuner is able to achieve satisfactory beautification effects on both Mandarin and English songs. The researchers also show that the expressiveness enhancer and a generator-based acceleration method they developed are effective components of the system.

Technical Explanation

The paper presents a novel system called ConTuner that aims to "beautify" singing voices by correcting pitch and enhancing expressiveness without changing the original timbre and content.

The key innovations of ConTuner are:

Diffusion Model-based Generation: ConTuner uses a diffusion model, a type of generative model, to generate an improved Mel-spectrogram of the singing voice. The diffusion model is combined with a "modified condition" that consists of optimized pitch and expressiveness information.
Pitch Correction: The researchers establish a mapping relationship between the MIDI note information, the voice's spectral envelope, and the actual pitch. This allows ConTuner to correct the pitch of the input singing voice.
Expressiveness Enhancement: To make amateur singing voices sound more expressive, ConTuner includes an "expressiveness enhancer" that operates in the latent space of the diffusion model. This component converts the amateur vocal tone to a more professional-sounding one.

The researchers demonstrate the effectiveness of ConTuner on both Mandarin and English songs. They also perform an ablation study, which shows that the expressiveness enhancer and the generator-based acceleration method are important components of the system.

Critical Analysis

The ConTuner system addresses an important problem in the field of singing voice processing, as improving the quality and expressiveness of singing voices has many practical applications. The researchers' approach of combining a diffusion model with optimized pitch and expressiveness information is novel and shows promising results.

However, the paper does not provide much detail on the specific architecture of the diffusion model or the training process. Additionally, the evaluation is limited to only a few examples, and the researchers do not compare ConTuner's performance to other state-of-the-art singing voice beautifying systems.

It would be valuable for the researchers to conduct a more comprehensive evaluation, including subjective listening tests with a larger number of participants, to better understand the perceptual quality of the beautified singing voices. Additionally, exploring how ConTuner's capabilities transfer to different languages, genres, or singing styles could help assess its broader applicability.

Further research could also investigate the potential limitations of the expressiveness enhancer, such as how it might handle diverse vocal styles or whether it could inadvertently introduce unnatural-sounding artifacts. Exploring ways to make the system more interpretable and controllable for users would also be an interesting direction for future work.

Conclusion

The ConTuner system presented in this paper represents an interesting and valuable contribution to the field of singing voice processing. By combining a diffusion model with optimized pitch and expressiveness information, the researchers have developed a novel approach to singing voice beautification that can improve the quality and expressiveness of amateur singing performances.

While the evaluation is limited, the results demonstrate the potential of this technology to enhance people's singing experiences in their daily lives. Further research and development of ConTuner could lead to even more advanced and user-friendly singing voice beautifying solutions, with applications in music production, karaoke, and other entertainment-related domains.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌿

CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition

Jianzong Wang, Pengcheng Li, Xulong Zhang, Ning Cheng, Jing Xiao

Singing voice beautifying is a novel task that has application value in people's daily life, aiming to correct the pitch of the singing voice and improve the expressiveness without changing the original timbre and content. Existing methods rely on paired data or only concentrate on the correction of pitch. However, professional songs and amateur songs from the same person are hard to obtain, and singing voice beautifying doesn't only contain pitch correction but other aspects like emotion and rhythm. Since we propose a fast and high-fidelity singing voice beautifying system called ConTuner, a diffusion model combined with the modified condition to generate the beautified Mel-spectrogram, where the modified condition is composed of optimized pitch and expressiveness. For pitch correction, we establish a mapping relationship from MIDI, spectrum envelope to pitch. To make amateur singing more expressive, we propose the expressiveness enhancer in the latent space to convert amateur vocal tone to professional. ConTuner achieves a satisfactory beautification effect on both Mandarin and English songs. Ablation study demonstrates that the expressiveness enhancer and generator-based accelerate method in ConTuner are effective.

5/1/2024

Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

Yongqi Wang, Ruofan Hu, Rongjie Huang, Zhiqing Hong, Ruiqi Li, Wenrui Liu, Fuming You, Tao Jin, Zhou Zhao

Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation that enables text-conditioned vocal range control while keeping melodic accuracy. Furthermore, we explore various experiment settings, including different types of text representations, text encoder fine-tuning, and introducing speech data to alleviate data scarcity, aiming to facilitate further research. Experiments show that our model achieves favorable controlling ability and audio quality. Audio samples are available at http://prompt-singer.github.io .

7/10/2024

Real-Time and Accurate: Zero-shot High-Fidelity Singing Voice Conversion with Multi-Condition Flow Synthesis

Hui Li, Hongyu Wang, Zhijin Chen, Bohan Sun, Bo Li

Singing voice conversion is to convert the source singing voice into the target singing voice except for the content. Currently, flow-based models can complete the task of voice conversion, but they struggle to effectively extract latent variables in the more rhythmically rich and emotionally expressive task of singing voice conversion, while also facing issues with low efficiency in speech processing. In this paper, we propose a high-fidelity flow-based model based on multi-decoupling feature constraints called RASVC, which enhances the capture of vocal details by integrating multiple latent attribute encoders. We also use Multi-stream inverse short-time Fourier transform(MS-iSTFT) to enhance the speed of speech processing by skipping some complicated decoder processing steps. We compare the synthesized singing voice with other models from multiple dimensions, and our proposed model is highly consistent with the current state-of-the-art, with the demo which is available at url{https://lazycat1119.github.io/RASVC-demo/}.

9/10/2024

Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

Zehua Kcriss Li, Meiying Melissa Chen, Yi Zhong, Pinxin Liu, Zhiyao Duan

Expressive speech synthesis aims to generate speech that captures a wide range of para-linguistic features, including emotion and articulation, though current research primarily emphasizes emotional aspects over the nuanced articulatory features mastered by professional voice actors. Inspired by this, we explore expressive speech synthesis through the lens of articulatory phonetics. Specifically, we define a framework with three dimensions: Glottalization, Tenseness, and Resonance (GTR), to guide the synthesis at the voice production level. With this framework, we record a high-quality speech dataset named GTR-Voice, featuring 20 Chinese sentences articulated by a professional voice actor across 125 distinct GTR combinations. We verify the framework and GTR annotations through automatic classification and listening tests, and demonstrate precise controllability along the GTR dimensions on two fine-tuned expressive TTS models. We open-source the dataset and TTS models.

6/18/2024