FastSAG: Towards Fast Non-Autoregressive Singing Accompaniment Generation

Read original: arXiv:2405.07682 - Published 5/14/2024 by Jianyi Chen, Wei Xue, Xu Tan, Zhen Ye, Qifeng Liu, Yike Guo

🛸

Overview

This paper presents a new method for Singing Accompaniment Generation (SAG), which aims to generate high-quality and coherent instrumental music to accompany vocal input.
The state-of-the-art method, SingSong, uses a multi-stage autoregressive (AR) model that is extremely slow, making it impractical for real-time applications.
The proposed "Fast SAG" method uses a non-AR diffusion-based framework to directly generate the Mel spectrogram of the target accompaniment, significantly simplifying the process and accelerating generation.

Plain English Explanation

The paper describes a new way to create background music that goes well with a singer's voice. The current best method, called SingSong, is very slow because it generates the music piece-by-piece in a long, recursive process. This makes it unusable for real-time applications like live performances or interactive systems.

The new "Fast SAG" approach instead uses a diffusion-based framework to directly generate the entire musical accompaniment at once. This greatly simplifies the process and speeds it up, while still producing high-quality, coherent music that fits the vocal input. The researchers designed specific techniques to ensure the generated accompaniment has the right semantic and rhythmic qualities to match the voice.

Technical Explanation

The paper proposes a non-autoregressive, diffusion-based framework for Singing Accompaniment Generation (SAG). Unlike the previous state-of-the-art SingSong method, which uses a multi-stage autoregressive model to generate semantic and acoustic tokens recursively, the new "Fast SAG" approach generates the Mel spectrogram of the target accompaniment directly.

This simplification is enabled by carefully designing the conditioning information inferred from the vocal signals. The researchers also incorporate semantic projection, prior projection blocks, and a set of loss functions to ensure the generated accompaniment has semantic and rhythm coherence with the vocal input.

Through extensive experiments, the paper demonstrates that the proposed Fast SAG method can generate better samples than SingSong, while accelerating the generation process by at least 30 times. This makes the method much more practical for real-time applications like interactive music generation systems or controllable music generation.

Critical Analysis

The paper presents a compelling approach to improving the speed and quality of Singing Accompaniment Generation (SAG) systems. By moving away from the complex, recursive autoregressive model used in SingSong, the researchers have significantly simplified the process while maintaining coherent, semantically-aligned musical accompaniment.

However, the paper does not fully address the potential limitations of the diffusion-based framework. Diffusion models, while powerful, can be computationally intensive and may struggle with fine-grained control or real-time responsiveness, especially for high-fidelity audio generation. Music style transfer using diffusion models has shown promising results, but there may be challenges in adapting this approach for interactive, low-latency applications.

Additionally, the paper focuses on generating the Mel spectrogram of the accompaniment, but does not provide details on how the final audio waveform is reconstructed. The use of Mel spectrograms may limit the quality or fidelity of the generated music, and the paper could benefit from a more in-depth discussion of this aspect.

Overall, the Fast SAG method represents an interesting and potentially impactful advance in the field of singing-based music generation. Further research exploring the limitations and real-world performance of this approach would be valuable for advancing the state of the art in human-AI symbiotic art creation.

Conclusion

This paper presents a new "Fast SAG" method for Singing Accompaniment Generation that uses a non-autoregressive, diffusion-based framework to directly generate high-quality, coherent musical accompaniment for vocal input. By carefully designing the conditioning information and incorporating specialized techniques, the researchers have significantly simplified the process compared to the previous state-of-the-art SingSong method, while also accelerating the generation by at least 30 times.

This advancement has the potential to enable more practical, real-time applications of singing-based music generation systems, which could be valuable for interactive art creation, live performances, and other human-AI collaborative experiences. However, the paper also highlights the need for further research to address potential limitations of the diffusion-based approach, such as computational complexity and fine-grained control. Overall, the Fast SAG method represents an important step forward in the quest to develop seamless, efficient systems for symbiotic human-AI art creation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

FastSAG: Towards Fast Non-Autoregressive Singing Accompaniment Generation

Jianyi Chen, Wei Xue, Xu Tan, Zhen Ye, Qifeng Liu, Yike Guo

Singing Accompaniment Generation (SAG), which generates instrumental music to accompany input vocals, is crucial to developing human-AI symbiotic art creation systems. The state-of-the-art method, SingSong, utilizes a multi-stage autoregressive (AR) model for SAG, however, this method is extremely slow as it generates semantic and acoustic tokens recursively, and this makes it impossible for real-time applications. In this paper, we aim to develop a Fast SAG method that can create high-quality and coherent accompaniments. A non-AR diffusion-based framework is developed, which by carefully designing the conditions inferred from the vocal signals, generates the Mel spectrogram of the target accompaniment directly. With diffusion and Mel spectrogram modeling, the proposed method significantly simplifies the AR token-based SingSong framework, and largely accelerates the generation. We also design semantic projection, prior projection blocks as well as a set of loss functions, to ensure the generated accompaniment has semantic and rhythm coherence with the vocal signal. By intensive experimental studies, we demonstrate that the proposed method can generate better samples than SingSong, and accelerate the generation by at least 30 times. Audio samples and code are available at https://fastsag.github.io/.

5/14/2024

🗣️

Parallel Synthesis for Autoregressive Speech Generation

Po-chun Hsu, Da-rong Liu, Andy T. Liu, Hung-yi Lee

Autoregressive neural vocoders have achieved outstanding performance in speech synthesis tasks such as text-to-speech and voice conversion. An autoregressive vocoder predicts a sample at some time step conditioned on those at previous time steps. Though it synthesizes natural human speech, the iterative generation inevitably makes the synthesis time proportional to the utterance length, leading to low efficiency. Many works were dedicated to generating the whole speech sequence in parallel and proposed GAN-based, flow-based, and score-based vocoders. This paper proposed a new thought for the autoregressive generation. Instead of iteratively predicting samples in a time sequence, the proposed model performs frequency-wise autoregressive generation (FAR) and bit-wise autoregressive generation (BAR) to synthesize speech. In FAR, a speech utterance is split into frequency subbands, and a subband is generated conditioned on the previously generated one. Similarly, in BAR, an 8-bit quantized signal is generated iteratively from the first bit. By redesigning the autoregressive method to compute in domains other than the time domain, the number of iterations in the proposed model is no longer proportional to the utterance length but to the number of subbands/bits, significantly increasing inference efficiency. Besides, a post-filter is employed to sample signals from output posteriors; its training objective is designed based on the characteristics of the proposed methods. Experimental results show that the proposed model can synthesize speech faster than real-time without GPU acceleration. Compared with baseline vocoders, the proposed model achieves better MUSHRA results and shows good generalization ability for unseen speakers and 44 kHz speech.

6/6/2024

Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment

Zhiqing Hong, Rongjie Huang, Xize Cheng, Yongqi Wang, Ruiqi Li, Fuming You, Zhou Zhao, Zhimeng Zhang

A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built up to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in https://text2songMelodist.github.io/Sample/.

5/21/2024

Efficient Autoregressive Audio Modeling via Next-Scale Prediction

Kai Qiu, Xiang Li, Hao Chen, Jie Sun, Jinglu Wang, Zhe Lin, Marios Savvides, Bhiksha Raj

Audio generation has achieved remarkable progress with the advance of sophisticated generative models, such as diffusion models (DMs) and autoregressive (AR) models. However, due to the naturally significant sequence length of audio, the efficiency of audio generation remains an essential issue to be addressed, especially for AR models that are incorporated in large language models (LLMs). In this paper, we analyze the token length of audio tokenization and propose a novel textbf{S}cale-level textbf{A}udio textbf{T}okenizer (SAT), with improved residual quantization. Based on SAT, a scale-level textbf{A}coustic textbf{A}utotextbf{R}egressive (AAR) modeling framework is further proposed, which shifts the next-token AR prediction to next-scale AR prediction, significantly reducing the training cost and inference time. To validate the effectiveness of the proposed approach, we comprehensively analyze design choices and demonstrate the proposed AAR framework achieves a remarkable textbf{35}$times$ faster inference speed and +textbf{1.33} Fr'echet Audio Distance (FAD) against baselines on the AudioSet benchmark. Code: url{https://github.com/qiuk2/AAR}.

8/20/2024