Robust Singing Voice Transcription Serves Synthesis

Read original: arXiv:2405.09940 - Published 6/4/2024 by Ruiqi Li, Yu Zhang, Yongqi Wang, Zhiqing Hong, Rongjie Huang, Zhou Zhao

Robust Singing Voice Transcription Serves Synthesis

Overview

This paper presents a novel approach to robust singing voice transcription, which can serve as a foundation for singing voice synthesis.
The proposed method aims to accurately transcribe sung vocals, even in the presence of complex musical accompaniment, and use this transcription to enable advanced singing voice synthesis capabilities.
The research leverages a combination of machine learning techniques, including neural networks, to address the challenges of singing voice transcription.

Plain English Explanation

The paper discusses a new way to accurately convert sung vocals into a written format, even when there is complex music playing in the background. This written representation of the sung vocals can then be used to create high-quality synthetic singing voices.

The researchers developed a system that uses machine learning algorithms, like neural networks, to analyze the audio of someone singing and transcribe what they are singing into text. This is challenging because the singing often occurs alongside other complex musical instruments and sounds, which can make it hard to isolate the vocal part.

By being able to accurately transcribe the sung vocals, the researchers can then use this text-based representation to generate synthetic singing that closely matches the original performance. This could be useful for applications like VIT-TTS, Text-to-Song, and FastSAG, where synthesized singing is needed.

Technical Explanation

The paper proposes a robust singing voice transcription system that can accurately transcribe sung vocals even in the presence of complex musical accompaniment. The approach leverages a combination of neural network architectures, including:

A singing voice detection model to isolate the vocal components from the full audio mix
A singing voice transcription model to convert the detected vocal segments into phonetic and pitch representations
An optional lyrics alignment model to synchronize the transcribed text with the original audio

By breaking down the task into these modular components, the researchers are able to address the key challenges of singing voice transcription, such as dealing with background instrumentation, vibrato, and pitch variations.

The system is evaluated on a range of singing datasets, demonstrating its ability to achieve state-of-the-art transcription performance even in complex musical scenarios. The transcribed output can then be used as input to singing voice synthesis models like SingIt and VoiceCraft, enabling advanced capabilities in singing voice manipulation and generation.

Critical Analysis

The paper presents a well-designed and comprehensive approach to singing voice transcription, tackling the key challenges in this domain. The modular architecture allows for flexibility in adapting the system to different scenarios and datasets.

However, the paper does not delve into potential limitations or caveats of the proposed method. For example, it would be valuable to understand the system's performance on more diverse genres of singing, or its ability to handle regional accents and vocal styles. Additionally, the paper could have discussed potential biases in the training data and how they might affect the transcription accuracy.

Further research could also explore ways to integrate the transcription system more seamlessly with downstream singing voice synthesis models, to create a truly end-to-end pipeline for singing voice manipulation and generation.

Conclusion

This paper presents a robust singing voice transcription system that can accurately convert sung vocals into text-based representations, even in complex musical environments. By bridging the gap between audio and text, the proposed approach lays the foundation for advanced singing voice synthesis capabilities, with potential applications in areas like text-to-speech, music generation, and voice editing.

The technical innovations and insights shared in this work contribute to the ongoing progress in the field of music and audio processing, and could inspire further research and development in this exciting domain.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Robust Singing Voice Transcription Serves Synthesis

Ruiqi Li, Yu Zhang, Yongqi Wang, Zhiqing Hong, Rongjie Huang, Zhou Zhao

Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications. Current AST methods, however, struggle with accuracy and robustness when used for practical annotation. This paper presents ROSVOT, the first robust AST model that serves SVS, incorporating a multi-scale framework that effectively captures coarse-grained note information and ensures fine-grained frame-level segmentation, coupled with an attention-based pitch decoder for reliable pitch prediction. We also established a comprehensive annotation-and-training pipeline for SVS to test the model in real-world settings. Experimental findings reveal that ROSVOT achieves state-of-the-art transcription accuracy with either clean or noisy inputs. Moreover, when trained on enlarged, automatically annotated datasets, the SVS model outperforms its baseline, affirming the capability for practical application. Audio samples are available at https://rosvot.github.io.

6/4/2024

Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion

Ruiqi Li, Rongjie Huang, Yongqi Wang, Zhiqing Hong, Zhou Zhao

Speech-to-singing voice conversion (STS) task always suffers from data scarcity, because it requires paired speech and singing data. Compounding this issue are the challenges of content-pitch alignment and the suboptimal quality of generated outputs, presenting significant hurdles in STS research. This paper presents SVPT, an STS approach boosted by a self-supervised singing voice pre-training model. We leverage spoken language model techniques to tackle the rhythm alignment problem and the in-context learning capability to achieve zero-shot conversion. We adopt discrete-unit random resampling and pitch corruption strategies, enabling training with unpaired singing data and thus mitigating the issue of data scarcity. SVPT also serves as an effective backbone for singing voice synthesis (SVS), offering insights into scaling up SVS models. Experimental results indicate that SVPT delivers notable improvements in both STS and SVS endeavors. Audio samples are available at https://speech2sing.github.io.

6/5/2024

Real-Time and Accurate: Zero-shot High-Fidelity Singing Voice Conversion with Multi-Condition Flow Synthesis

Hui Li, Hongyu Wang, Zhijin Chen, Bohan Sun, Bo Li

Singing voice conversion is to convert the source singing voice into the target singing voice except for the content. Currently, flow-based models can complete the task of voice conversion, but they struggle to effectively extract latent variables in the more rhythmically rich and emotionally expressive task of singing voice conversion, while also facing issues with low efficiency in speech processing. In this paper, we propose a high-fidelity flow-based model based on multi-decoupling feature constraints called RASVC, which enhances the capture of vocal details by integrating multiple latent attribute encoders. We also use Multi-stream inverse short-time Fourier transform(MS-iSTFT) to enhance the speed of speech processing by skipping some complicated decoder processing steps. We compare the synthesized singing voice with other models from multiple dimensions, and our proposed model is highly consistent with the current state-of-the-art, with the demo which is available at url{https://lazycat1119.github.io/RASVC-demo/}.

9/10/2024

📊

Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing

Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin Jin, Shinji Watanabe

In singing voice synthesis (SVS), generating singing voices from musical scores faces challenges due to limited data availability. This study proposes a unique strategy to address the data scarcity in SVS. We employ an existing singing voice synthesizer for data augmentation, complemented by detailed manual tuning, an approach not previously explored in data curation, to reduce instances of unnatural voice synthesis. This innovative method has led to the creation of two expansive singing voice datasets, ACE-Opencpop and ACE-KiSing, which are instrumental for large-scale, multi-singer voice synthesis. Through thorough experimentation, we establish that these datasets not only serve as new benchmarks for SVS but also enhance SVS performance on other singing voice datasets when used as supplementary resources. The corpora, pre-trained models, and their related training recipes are publicly available at ESPnet-Muskits (url{https://github.com/espnet/espnet})

6/14/2024