Toward Deep Drum Source Separation

Read original: arXiv:2312.09663 - Published 5/21/2024 by Alessandro Ilic Mezza, Riccardo Giampiccolo, Alberto Bernardini, Augusto Sarti

Overview

This paper introduces two novel deep learning models, StemGMD and LarsNet, for the task of drum source separation from polyphonic music.
Drum source separation is the process of isolating the drum tracks from a complex musical mixture, which has applications in music production, music information retrieval, and other audio signal processing domains.
The proposed models leverage deep neural networks to effectively separate the drum components from the rest of the musical elements, outperforming previous state-of-the-art methods.

Plain English Explanation

Imagine you have a song with lots of different instruments playing at the same time - guitar, piano, drums, and so on. Drum source separation is the process of taking that complex musical mixture and isolating just the drum parts. This is useful for a few reasons:

In music production, it allows engineers to individually adjust the volume and effects of the drum track to get the desired sound.
In music information retrieval applications, isolated drum tracks can be used for tasks like beat detection or rhythmic analysis.
It also has potential uses in audio classification and other audio signal processing domains.

The researchers in this paper developed two new deep learning models, called StemGMD and LarsNet, that are particularly good at separating the drum components from the rest of a musical recording. These models use advanced neural network architectures to analyze the audio signals and identify the drum parts, allowing them to isolate the drums much more accurately than previous methods.

Technical Explanation

The paper introduces two novel deep learning models for the task of drum source separation:

StemGMD: This model utilizes a Generalized Masking Decomposition (GMD) approach, which learns to decompose the input audio mixture into individual source stem estimates, including the drum stem. The GMD module is combined with a U-Net-like convolutional neural network to effectively model the complex relationships between the sources.
LarsNet: This model is inspired by the Least Angle Regression (LARS) algorithm, which is a feature selection method. LarsNet combines a LARS-inspired module with a convolutional neural network to selectively focus on the most relevant features for drum separation, leading to improved performance.

Both StemGMD and LarsNet are trained on the task of blind source separation of drums from polyphonic music recordings. The models are evaluated on standard drum separation benchmarks, where they demonstrate state-of-the-art results, outperforming previous deep learning and signal processing-based methods.

Critical Analysis

The paper provides a thorough evaluation of the proposed models, including comparisons to several baselines and ablation studies to understand the contributions of the key components. However, the authors acknowledge some limitations of their work:

The models are primarily evaluated on synthetic drum-containing mixtures, and their performance on real-world recordings with complex musical arrangements may differ.
The models are designed for single-channel (mono) input audio, and extending them to handle multichannel (stereo) recordings could be an area for future research.
While the models demonstrate impressive separation quality, there is still room for improvement, especially in terms of preserving the transient and timbral characteristics of the drum sounds.

Additionally, the paper does not delve into the computational complexity and inference speed of the proposed models, which are important factors for real-world applications, especially in audio fake detection or edge computing scenarios.

Conclusion

This paper presents two novel deep learning models, StemGMD and LarsNet, that significantly advance the state-of-the-art in drum source separation from polyphonic music. By leveraging specialized neural network architectures, the models are able to effectively isolate the drum components from complex musical mixtures, with potential applications in music production, music information retrieval, and other audio signal processing domains. While the models demonstrate impressive performance, the authors identify several areas for future research, such as improving their robustness to real-world recordings and exploring their applicability to multichannel audio.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Toward Deep Drum Source Separation

Alessandro Ilic Mezza, Riccardo Giampiccolo, Alberto Bernardini, Augusto Sarti

In the past, the field of drum source separation faced significant challenges due to limited data availability, hindering the adoption of cutting-edge deep learning methods that have found success in other related audio applications. In this manuscript, we introduce StemGMD, a large-scale audio dataset of isolated single-instrument drum stems. Each audio clip is synthesized from MIDI recordings of expressive drums performances using ten real-sounding acoustic drum kits. Totaling 1224 hours, StemGMD is the largest audio dataset of drums to date and the first to comprise isolated audio clips for every instrument in a canonical nine-piece drum kit. We leverage StemGMD to develop LarsNet, a novel deep drum source separation model. Through a bank of dedicated U-Nets, LarsNet can separate five stems from a stereo drum mixture faster than real-time and is shown to significantly outperform state-of-the-art nonnegative spectro-temporal factorization methods.

5/21/2024

A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems

Karn N. Watcharasupat, Alexander Lerch

Despite significant recent progress across multiple subtasks of audio source separation, few music source separation systems support separation beyond the four-stem vocals, drums, bass, and other (VDBO) setup. Of the very few current systems that support source separation beyond this setup, most continue to rely on an inflexible decoder setup that can only support a fixed pre-defined set of stems. Increasing stem support in these inflexible systems correspondingly requires increasing computational complexity, rendering extensions of these systems computationally infeasible for long-tail instruments. In this work, we propose Banquet, a system that allows source separation of multiple stems using just one decoder. A bandsplit source separation model is extended to work in a query-based setup in tandem with a music instrument recognition PaSST model. On the MoisesDB dataset, Banquet, at only 24.9 M trainable parameters, approached the performance level of the significantly more complex 6-stem Hybrid Transformer Demucs on VDBO stems and outperformed it on guitar and piano. The query-based setup allows for the separation of narrow instrument classes such as clean acoustic guitars, and can be successfully applied to the extraction of less common stems such as reeds and organs. Implementation is available at https://github.com/kwatcharasupat/query-bandit.

8/27/2024

Improving Real-Time Music Accompaniment Separation with MMDenseNet

Chun-Hsiang Wang, Chung-Che Wang, Jun-You Wang, Jyh-Shing Roger Jang, Yen-Hsun Chu

Music source separation aims to separate polyphonic music into different types of sources. Most existing methods focus on enhancing the quality of separated results by using a larger model structure, rendering them unsuitable for deployment on edge devices. Moreover, these methods may produce low-quality output when the input duration is short, making them impractical for real-time applications. Therefore, the goal of this paper is to enhance a lightweight model, MMDenstNet, to strike a balance between separation quality and latency for real-time applications. Different directions of improvement are explored or proposed in this paper, including complex ideal ratio mask, self-attention, band-merge-split method, and feature look back. Source-to-distortion ratio, real-time factor, and optimal latency are employed to evaluate the performance. To align with our application requirements, the evaluation process in this paper focuses on the separation performance of the accompaniment part. Experimental results demonstrate that our improvement achieves low real-time factor and optimal latency while maintaining acceptable separation quality.

7/2/2024

👨‍🏫

Benchmarks and leaderboards for sound demixing tasks

Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva

Music demixing is the task of separating different tracks from the given single audio signal into components, such as drums, bass, and vocals from the rest of the accompaniment. Separation of sources is useful for a range of areas, including entertainment and hearing aids. In this paper, we introduce two new benchmarks for the sound source separation tasks and compare popular models for sound demixing, as well as their ensembles, on these benchmarks. For the models' assessments, we provide the leaderboard at https://mvsep.com/quality_checker/, giving a comparison for a range of models. The new benchmark datasets are available for download. We also develop a novel approach for audio separation, based on the ensembling of different models that are suited best for the particular stem. The proposed solution was evaluated in the context of the Music Demixing Challenge 2023 and achieved top results in different tracks of the challenge. The code and the approach are open-sourced on GitHub.

5/8/2024