Singer separation for karaoke content generation

Read original: arXiv:2110.06707 - Published 8/20/2024 by Hsuan-Yu Lin, Xuanjun Chen, Jyh-Shing Roger Jang

🛸

Overview

The rapid development of deep learning has enabled the successful separation of singing voice from mono audio music.
However, this separation can only extract human voices from other musical instruments, which is not ideal for karaoke content generation applications that only require the separation of lead singers.
The authors propose a singer separation system that generates karaoke content for one or two separated lead singers.
They introduce three models for the singer separation task and design an automatic model selection scheme to distinguish how many lead singers are in the song.
The authors also collected a large dataset, MIR-SingerSeparation, which has been publicly released to advance research in this area.
The singer separation system is most suitable for sentimental ballads and can be directly applied to karaoke content generation.

Plain English Explanation

The paper discusses a singer separation system that can separate the lead vocals from musical accompaniment, which is particularly useful for karaoke applications.

Traditional voice separation methods can extract human voices from other instruments, but this is not enough for karaoke, where you only want the lead singer's voice. The authors' system can take a song with multiple vocalists (like a duet) and isolate just the lead vocals, or extract a single lead vocal from a song with vocal harmony.

To do this, the researchers developed three different machine learning models and designed a way for the system to automatically figure out how many lead singers are in a song. They also collected a large dataset called MIR-SingerSeparation, which they've made publicly available to help advance research in this area.

The singer separation system works best on emotional ballads and can be directly used to generate karaoke versions of songs. As far as the authors know, this is the first work on singer separation specifically for real-world karaoke applications.

Technical Explanation

The paper proposes a singer separation system that can isolate the lead vocal tracks from musical accompaniment, which is crucial for generating karaoke content.

Traditional voice separation techniques can extract human voices from other instruments, but this is not sufficient for karaoke, where the goal is to isolate just the lead singer's voice. The authors introduce three different deep learning models for the singer separation task:

A model that can separate a single lead vocal from a song
A model that can separate two lead vocals (e.g. a male-female duet)
An automatic model selection scheme to determine how many lead singers are present in a song

The authors also collected a large dataset called MIR-SingerSeparation, which contains songs with annotated lead vocals. This dataset has been released publicly to spur further research in this area.

The proposed singer separation system is most effective on sentimental ballads and can be directly applied to generate karaoke content. To the best of the authors' knowledge, this is the first work on singer separation specifically targeting real-world karaoke applications.

Critical Analysis

The paper presents a novel approach to singer separation that is tailored for karaoke applications. The use of multiple models and an automatic model selection scheme is an interesting technical contribution that allows the system to handle a variety of real-world scenarios.

However, the paper does not provide a detailed discussion of the limitations of the proposed approach. For example, it's unclear how well the system would perform on more complex musical arrangements with dense instrumentation or vocal harmonies. Additionally, the authors do not address potential biases in the dataset, such as an overrepresentation of certain music genres or a lack of diversity in the lead vocalists.

Furthermore, the paper could have provided a more thorough analysis of the potential societal implications of this technology. While the focus on karaoke is a practical application, the ability to isolate and manipulate individual vocalists could raise ethical concerns around the use of such systems, such as the potential for misuse in deepfake or voice cloning applications.

Overall, the paper presents a promising technical approach, but could have benefited from a more critical and comprehensive examination of the potential limitations and broader implications of the proposed singer separation system.

Conclusion

This paper introduces a novel singer separation system that can isolate lead vocals from musical accompaniment, a crucial capability for generating karaoke content. The authors developed three deep learning models and an automatic model selection scheme to handle a variety of real-world scenarios, such as separating lead vocals from duets or extracting a single lead vocal from songs with vocal harmony.

The authors also collected and released a large dataset, MIR-SingerSeparation, to advance research in this area. While the paper presents a promising technical approach, it could have benefited from a more critical examination of the system's limitations and potential broader implications, such as ethical concerns around the use of such technology.

Overall, the proposed singer separation system represents an important step forward in enabling more sophisticated karaoke content generation, and the publicly available dataset will undoubtedly spur further research and development in this domain.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

Singer separation for karaoke content generation

Hsuan-Yu Lin, Xuanjun Chen, Jyh-Shing Roger Jang

Due to the rapid development of deep learning, we can now successfully separate singing voice from mono audio music. However, this separation can only extract human voices from other musical instruments, which is undesirable for karaoke content generation applications that only require the separation of lead singers. For this karaoke application, we need to separate the music containing male and female duets into two vocals, or extract a single lead vocal from the music containing vocal harmony. For this reason, we propose in this article to use a singer separation system, which generates karaoke content for one or two separated lead singers. In particular, we introduced three models for the singer separation task and designed an automatic model selection scheme to distinguish how many lead singers are in the song. We also collected a large enough data set, MIR-SingerSeparation, which has been publicly released to advance the frontier of this research. Our singer separation is most suitable for sentimental ballads and can be directly applied to karaoke content generation. As far as we know, this is the first singer-separation work for real-world karaoke applications.

8/20/2024

Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation

Karn N. Watcharasupat, Chih-Wei Wu, Iroro Orife

Cinematic audio source separation (CASS), as a standalone problem of extracting individual stems from their mixture, is a fairly new subtask of audio source separation. A typical setup of CASS is a three-stem problem, with the aim of separating the mixture into the dialogue (DX), music (MX), and effects (FX) stems. Given the creative nature of cinematic sound production, however, several edge cases exist; some sound sources do not fit neatly in any of these three stems, necessitating the use of additional auxiliary stems in production. One very common edge case is the singing voice in film audio, which may belong in either the DX or MX or neither, depending heavily on the cinematic context. In this work, we demonstrate a very straightforward extension of the dedicated-decoder Bandit and query-based single-decoder Banquet models to a four-stem problem, treating non-musical dialogue, instrumental music, singing voice, and effects as separate stems. Interestingly, the query-based Banquet model outperformed the dedicated-decoder Bandit model. We hypothesized that this is due to a better feature alignment at the bottleneck as enforced by the band-agnostic FiLM layer. Dataset and model implementation will be made available at https://github.com/kwatcharasupat/source-separation-landing.

8/27/2024

From Real to Cloned Singer Identification

Dorian Desblancs, Gabriel Meseguer-Brocal, Romain Hennequin, Manuel Moussallam

Cloned voices of popular singers sound increasingly realistic and have gained popularity over the past few years. They however pose a threat to the industry due to personality rights concerns. As such, methods to identify the original singer in synthetic voices are needed. In this paper, we investigate how singer identification methods could be used for such a task. We present three embedding models that are trained using a singer-level contrastive learning scheme, where positive pairs consist of segments with vocals from the same singers. These segments can be mixtures for the first model, vocals for the second, and both for the third. We demonstrate that all three models are highly capable of identifying real singers. However, their performance deteriorates when classifying cloned versions of singers in our evaluation set. This is especially true for models that use mixtures as an input. These findings highlight the need to understand the biases that exist within singer identification systems, and how they can influence the identification of voice deepfakes in music.

7/12/2024

SongCreator: Lyrics-based Universal Song Generation

Shun Lei, Yixuan Zhou, Boshi Tang, Max W. Y. Lam, Feng Liu, Hangyu Liu, Jingcheng Wu, Shiyin Kang, Zhiyong Wu, Helen Meng

Music is an integral part of human culture, embodying human intelligence and creativity, of which songs compose an essential part. While various aspects of song generation have been explored by previous works, such as singing voice, vocal composition and instrumental arrangement, etc., generating songs with both vocals and accompaniment given lyrics remains a significant challenge, hindering the application of music generation models in the real world. In this light, we propose SongCreator, a song-generation system designed to tackle this challenge. The model features two novel designs: a meticulously designed dual-sequence language model (DSLM) to capture the information of vocals and accompaniment for song generation, and an additional attention mask strategy for DSLM, which allows our model to understand, generate and edit songs, making it suitable for various song-related generation tasks. Extensive experiments demonstrate the effectiveness of SongCreator by achieving state-of-the-art or competitive performances on all eight tasks. Notably, it surpasses previous works by a large margin in lyrics-to-song and lyrics-to-vocals. Additionally, it is able to independently control the acoustic conditions of the vocals and accompaniment in the generated song through different prompts, exhibiting its potential applicability. Our samples are available at https://songcreator.github.io/.

9/11/2024