SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models

Read original: arXiv:2406.08905 - Published 6/21/2024 by Yuxun Tang, Yuning Wu, Jiatong Shi, Qin Jin

SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models

Overview

This paper proposes a new method called SingOMD (Singing Oriented Multi-resolution Discrete Representation Construction) for creating a multi-resolution discrete representation from speech models to enable singing voice synthesis.
The method aims to leverage pre-trained speech models to capture the necessary linguistic, acoustic, and prosodic information for high-quality singing voice synthesis.
The approach involves constructing a multi-resolution discrete representation by distilling knowledge from pre-trained speech models, allowing for flexible and efficient singing voice generation.

Plain English Explanation

The researchers have developed a new technique called SingOMD that can be used to generate realistic-sounding singing voices. The key insight is to start with pre-existing speech models that have already learned a lot about how human language and speech work. By distilling the knowledge in these pre-trained speech models, the SingOMD method is able to create a multi-level discrete representation that captures the necessary information for singing, including the words, melody, rhythm, and tone of a singer's voice.

This multi-resolution representation is more flexible and efficient than trying to build a singing model from scratch. It allows the system to generate high-quality singing by piecing together the relevant linguistic, acoustic, and expressive elements, rather than having to learn all of those aspects independently. The end result is a singing voice synthesis system that can produce natural-sounding singing by building on the foundational knowledge contained in pre-trained speech models.

Technical Explanation

The paper proposes a method called SingOMD (Singing Oriented Multi-resolution Discrete Representation Construction) that leverages pre-trained speech models to create a multi-resolution discrete representation for singing voice synthesis. This builds on prior work in areas like TokSing, leveraging diverse semantic-based audio pre-trained models, adversarial multi-task learning for disentangling timbre and pitch, and end-to-end singing voice synthesis systems like ViSinger2.

The key idea is to distill the linguistic, acoustic, and prosodic knowledge captured in pre-trained speech models into a multi-resolution discrete representation that can be used for efficient and flexible singing voice generation. This builds on work on encoding speaker-specific latent speech features.

The SingOMD architecture consists of several components:

A speech encoder that maps the input audio to a multi-resolution discrete representation
A discrete token predictor that generates the appropriate discrete tokens for singing
A waveform generator that converts the discrete representation back into a synthesized singing voice waveform

By using this multi-resolution discrete approach, the model is able to capture the necessary elements of singing - including the lyrics, melody, rhythm, and vocal tone - in an efficient and controllable manner. Experiments demonstrate the effectiveness of the SingOMD method for high-quality singing voice synthesis.

Critical Analysis

The paper makes a compelling case for the SingOMD approach and provides extensive experimental results demonstrating its effectiveness. However, there are a few potential limitations and areas for further research that could be explored:

The paper focuses on leveraging pre-trained speech models, but it's unclear how well the approach would generalize to other types of pre-trained audio models, such as those trained on musical or singing-specific data. Exploring a wider range of pre-trained models could help further improve the performance.
The paper does not provide a detailed analysis of the types of errors or artifacts that may arise in the generated singing voices. A more in-depth evaluation of the model's limitations and failure cases would be helpful for understanding its real-world applicability.
While the multi-resolution discrete representation is claimed to be flexible and efficient, the paper does not provide a clear comparison to alternative end-to-end singing voice synthesis approaches in terms of factors like inference speed, sample quality, and control over the generated output.
The paper does not address potential ethical considerations around the use of such singing voice synthesis technology, such as concerns about the authenticity or misuse of synthetic voices. Discussing these issues could help readers better understand the broader implications of the research.

Overall, the SingOMD method presents a promising approach to leveraging pre-trained speech models for high-quality singing voice synthesis. Further research and evaluation to address the above points could help strengthen the real-world impact of this work.

Conclusion

The SingOMD method proposed in this paper represents a significant advance in the field of singing voice synthesis. By distilling knowledge from pre-trained speech models, the approach is able to create a multi-resolution discrete representation that captures the necessary linguistic, acoustic, and prosodic information for generating natural-sounding singing voices.

This efficient and flexible representation allows for greater control and customization of the generated singing, opening up new possibilities for applications in areas like music production, virtual performances, and human-computer interaction. While there are some areas for further research and evaluation, the SingOMD method demonstrates the power of leveraging pre-existing speech technologies to tackle the challenge of singing voice synthesis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models

Yuxun Tang, Yuning Wu, Jiatong Shi, Qin Jin

Discrete representation has shown advantages in speech generation tasks, wherein discrete tokens are derived by discretizing hidden features from self-supervised learning (SSL) pre-trained models. However, the direct application of speech SSL models to singing generation encounters domain gaps between speech and singing. Furthermore, singing generation necessitates a more refined representation than typical speech. To address these challenges, we introduce SingOMD, a novel method to extract singing-oriented multi-resolution discrete representations from speech SSL models. Specifically, we first adapt the features from speech SSL through a resynthesis task and incorporate multi-resolution modules based on resampling to better serve singing generation. These adapted multi-resolution features are then discretized via clustering. Extensive experiments demonstrate the robustness, efficiency, and effectiveness of these representations in singing vocoders and singing voice synthesis.

6/21/2024

TokSing: Singing Voice Synthesis based on Discrete Tokens

Yuning Wu, Chunlei zhang, Jiatong Shi, Yuxun Tang, Shan Yang, Qin Jin

Recent advancements in speech synthesis witness significant benefits by leveraging discrete tokens extracted from self-supervised learning (SSL) models. Discrete tokens offer higher storage efficiency and greater operability in intermediate representations compared to traditional continuous Mel spectrograms. However, when it comes to singing voice synthesis(SVS), achieving higher levels of melody expression poses a great challenge for utilizing discrete tokens. In this paper, we introduce TokSing, a discrete-based SVS system equipped with a token formulator that offers flexible token blendings. We observe a melody degradation during discretization, prompting us to integrate a melody signal with the discrete token and incorporate a specially-designed melody enhancement strategy in the musical encoder. Extensive experiments demonstrate that our TokSing achieves better performance against the Mel spectrogram baselines while offering advantages in intermediate representation space cost and convergence speed.

6/21/2024

🗣️

MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model

Jiatong Shi, Xutai Ma, Hirofumi Inaguma, Anna Sun, Shinji Watanabe

Speech discrete representation has proven effective in various downstream applications due to its superior compression rate of the waveform, fast convergence during training, and compatibility with other modalities. Discrete units extracted from self-supervised learning (SSL) models have emerged as a prominent approach for obtaining speech discrete representation. However, while discrete units have shown effectiveness compared to spectral features, they still lag behind continuous SSL representations. In this work, we propose MMM, a multi-layer multi-residual multi-stream discrete units extraction method from SSL. Specifically, we introduce iterative residual vector quantization with K-means for different layers in an SSL model to extract multi-stream speech discrete representation. Through extensive experiments in speech recognition, speech resynthesis, and text-to-speech, we demonstrate the proposed MMM can surpass or on-par with neural codec's performance under various conditions.

6/17/2024

SingMOS: An extensive Open-Source Singing Voice Dataset for MOS Prediction

Yuxun Tang, Jiatong Shi, Yuning Wu, Qin Jin

In speech generation tasks, human subjective ratings, usually referred to as the opinion score, are considered the gold standard for speech quality evaluation, with the mean opinion score (MOS) serving as the primary evaluation metric. Due to the high cost of human annotation, several MOS prediction systems have emerged in the speech domain, demonstrating good performance. These MOS prediction models are trained using annotations from previous speech-related challenges. However, compared to the speech domain, the singing domain faces data scarcity and stricter copyright protections, leading to a lack of high-quality MOS-annotated datasets for singing. To address this, we propose SingMOS, a high-quality and diverse MOS dataset for singing, covering a range of Chinese and Japanese datasets. These synthesized vocals are generated using state-of-the-art models in singing synthesis, conversion, or resynthesis tasks and are rated by professional annotators alongside real vocals. Data analysis demonstrates the diversity and reliability of our dataset. Additionally, we conduct further exploration on SingMOS, providing insights for singing MOS prediction and guidance for the continued expansion of SingMOS.

6/21/2024