TokSing: Singing Voice Synthesis based on Discrete Tokens

Read original: arXiv:2406.08416 - Published 6/21/2024 by Yuning Wu, Chunlei zhang, Jiatong Shi, Yuxun Tang, Shan Yang, Qin Jin

TokSing: Singing Voice Synthesis based on Discrete Tokens

Overview

This paper introduces TokSing, a novel singing voice synthesis model that generates singing audio from discrete tokens.
TokSing uses a transformer-based architecture to generate singing voice from text, melody, and other input features.
The authors demonstrate that TokSing outperforms existing singing voice synthesis models on objective and subjective evaluation metrics.

Plain English Explanation

TokSing is a new system that can generate singing voices from text, melody, and other information. It uses a type of neural network called a transformer to convert these inputs into realistic-sounding singing. This is an important advancement in the field of singing voice synthesis, which aims to create artificial singing voices that sound natural and human-like.

The key innovation in TokSing is its use of "discrete tokens" - small, distinct units of information that represent the different aspects of singing, like pitch, rhythm, and timbre. By modeling singing at this more granular level, TokSing is able to generate more nuanced and expressive singing voices compared to previous approaches, which tended to produce more robotic-sounding results.

Overall, the authors show that TokSing outperforms existing singing voice synthesis systems on both objective metrics (like how closely the generated audio matches the target) and subjective evaluations (how natural and musical the singing sounds to human listeners). This represents an important step forward in the quest to create high-quality, data-efficient singing voice synthesis systems.

Technical Explanation

TokSing uses a transformer-based architecture to generate singing voice from text, melody, and other input features. The core of the model is a vector quantized variational autoencoder (VQ-VAE) that encodes the input features into a sequence of discrete tokens. These tokens represent the fundamental units of singing, such as pitch, duration, and timbre.

The encoder network maps the input features (text, melody, etc.) into a high-dimensional latent space, which is then quantized into a sequence of discrete tokens by the VQ-VAE. The decoder network then takes this token sequence and generates the corresponding singing voice audio. This two-stage process allows the model to learn a more granular and expressive representation of singing compared to previous end-to-end approaches.

The authors also incorporate diverse semantic-based audio pretraining to initialize the encoder and improve the model's generalization. Experiments on public singing voice datasets show that TokSing outperforms state-of-the-art baselines in both objective and subjective evaluations of singing quality.

Critical Analysis

The authors acknowledge several limitations of the current TokSing model. First, the training dataset used is relatively small, consisting of only about 2 hours of singing data. Scaling up the singing voice data used to train the model could lead to further improvements in synthesis quality.

Additionally, the authors note that TokSing is currently limited to generating singing in a single language (English). Extending the model to handle multilingual singing synthesis is an important area for future work. The authors also suggest that incorporating more diverse musical styles and genres into the training data could enhance the model's expressive capabilities.

Overall, TokSing represents a promising step forward in singing voice synthesis research. By leveraging discrete tokens and semantic-based pretraining, the model is able to generate more natural and musical singing voices than previous approaches. However, further research is needed to scale up the data, extend the model to new languages and genres, and fully realize the potential of this discrete token-based synthesis paradigm.

Conclusion

The TokSing paper introduces a novel singing voice synthesis model that generates high-quality singing audio from text, melody, and other input features. The key innovation is the use of a vector quantized variational autoencoder to encode the input into a sequence of discrete tokens, which allows the model to learn a more granular and expressive representation of singing.

Experiments show that TokSing outperforms existing singing voice synthesis systems on both objective and subjective evaluation metrics. This represents an important advancement in the field, bringing us closer to the goal of data-efficient, high-quality singing voice synthesis that can be widely deployed in various applications.

While the current TokSing model has some limitations in terms of dataset size and language support, the authors have outlined promising directions for future research. Scaling up the training data, extending to multilingual singing, and incorporating more diverse musical styles could further enhance the model's capabilities. Overall, this paper makes a valuable contribution to the ongoing efforts to create advanced singing voice synthesis systems that can produce natural, expressive, and human-like singing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

TokSing: Singing Voice Synthesis based on Discrete Tokens

Yuning Wu, Chunlei zhang, Jiatong Shi, Yuxun Tang, Shan Yang, Qin Jin

Recent advancements in speech synthesis witness significant benefits by leveraging discrete tokens extracted from self-supervised learning (SSL) models. Discrete tokens offer higher storage efficiency and greater operability in intermediate representations compared to traditional continuous Mel spectrograms. However, when it comes to singing voice synthesis(SVS), achieving higher levels of melody expression poses a great challenge for utilizing discrete tokens. In this paper, we introduce TokSing, a discrete-based SVS system equipped with a token formulator that offers flexible token blendings. We observe a melody degradation during discretization, prompting us to integrate a melody signal with the discrete token and incorporate a specially-designed melody enhancement strategy in the musical encoder. Extensive experiments demonstrate that our TokSing achieves better performance against the Mel spectrogram baselines while offering advantages in intermediate representation space cost and convergence speed.

6/21/2024

Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model

Siyang Wang, 'Eva Sz'ekely

Recent advances in generative language modeling applied to discrete speech tokens presented a new avenue for text-to-speech (TTS) synthesis. These speech language models (SLMs), similarly to their textual counterparts, are scalable, probabilistic, and context-aware. While they can produce diverse and natural outputs, they sometimes face issues such as unintelligibility and the inclusion of non-speech noises or hallucination. As the adoption of this innovative paradigm in speech synthesis increases, there is a clear need for an in-depth evaluation of its capabilities and limitations. In this paper, we evaluate TTS from a discrete token-based SLM, through both automatic metrics and listening tests. We examine five key dimensions: speaking style, intelligibility, speaker consistency, prosodic variation, spontaneous behaviour. Our results highlight the model's strength in generating varied prosody and spontaneous outputs. It is also rated higher in naturalness and context appropriateness in listening tests compared to a conventional TTS. However, the model's performance in intelligibility and speaker consistency lags behind traditional TTS. Additionally, we show that increasing the scale of SLMs offers a modest boost in robustness. Our findings aim to serve as a benchmark for future advancements in generative SLMs for speech synthesis.

5/17/2024

SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models

Yuxun Tang, Yuning Wu, Jiatong Shi, Qin Jin

Discrete representation has shown advantages in speech generation tasks, wherein discrete tokens are derived by discretizing hidden features from self-supervised learning (SSL) pre-trained models. However, the direct application of speech SSL models to singing generation encounters domain gaps between speech and singing. Furthermore, singing generation necessitates a more refined representation than typical speech. To address these challenges, we introduce SingOMD, a novel method to extract singing-oriented multi-resolution discrete representations from speech SSL models. Specifically, we first adapt the features from speech SSL through a resynthesis task and incorporate multi-resolution modules based on resampling to better serve singing generation. These adapted multi-resolution features are then discretized via clustering. Extensive experiments demonstrate the robustness, efficiency, and effectiveness of these representations in singing vocoders and singing voice synthesis.

6/21/2024

DNN-based ensemble singing voice synthesis with interactions between singers

Hiroaki Hyodo, Shinnosuke Takamichi, Tomohiko Nakamura, Junya Koguchi, Hiroshi Saruwatari

We propose a singing voice synthesis (SVS) method for a more unified ensemble singing voice by modeling interactions between singers. Most existing SVS methods aim to synthesize a solo voice, and do not consider interactions between singers, i.e., adjusting one's own voice to the others' voices. Since the production of ensemble voices from solo singing voices ignores the interactions, it can degrade the unity of the vocal ensemble. Therefore, we propose a SVS that reproduces the interactions. It is based on an architecture that uses musical scores of multiple voice parts, and loss functions that simulate the interactions' effect to acoustic features. Experimental results show that our methods improve the unity of the vocal ensemble.

9/17/2024