Pre-training Music Classification Models via Music Source Separation

Read original: arXiv:2310.15845 - Published 4/24/2024 by Christos Garoufis, Athanasia Zlatintsi, Petros Maragos

🏷️

Overview

This paper investigates whether using music source separation as a pre-training strategy can improve the performance of music representation learning for classification tasks.
The researchers pre-trained U-Net networks on various music source separation objectives, such as isolating vocals or instruments from a musical piece.
They then attached a convolutional tail network to the pre-trained U-Net and fine-tuned the whole network.
Experiments on two music datasets showed that pre-training the U-Nets with source separation tasks can outperform training the whole network from scratch or using the tail network alone for music auto-tagging and genre classification.

Plain English Explanation

The researchers wanted to see if a technique called music source separation could be used to help train machine learning models for classifying music. Music source separation is the process of isolating individual instruments or vocals from a musical recording.

The researchers first trained U-Net neural networks to perform various music source separation tasks, such as pulling out just the vocals or just the instruments from a song. This pre-training step allowed the U-Nets to learn useful features about the different components of music.

Next, the researchers took these pre-trained U-Nets and added an additional "tail" network on top of them. They then fine-tuned the entire system, allowing the tail network to learn from the features already captured by the U-Net during the source separation pre-training.

When the researchers tested this approach on two music classification tasks - auto-tagging songs and classifying music genres - they found that the pre-training with source separation helped the model perform better than training the whole system from scratch or using just the tail network alone. The features learned during the source separation pre-training seemed to give the model a helpful head start for the classification tasks.

Technical Explanation

The researchers pre-trained U-Net networks on various music source separation objectives, including isolating vocal and instrumental sources from musical pieces. They then attached a convolutional "tail" network to the pre-trained U-Nets and jointly fine-tuned the whole system, allowing the features learned by the separation network to be propagated to the tail network through skip connections.

Experiments were conducted on two widely used music datasets: one for music auto-tagging and one for music genre classification. The results showed that pre-training the U-Nets with source separation objectives can improve performance compared to training the entire network from scratch or using the tail network alone. For the auto-tagging task, vocal separation pre-training performed best, while for genre classification, multi-source separation pre-training was most effective.

Critical Analysis

The paper provides a compelling demonstration of how pre-training on music source separation can benefit downstream music classification tasks. However, the authors acknowledge that the specific pre-training objectives and network architectures may need to be tailored to the target classification task for optimal performance.

Additionally, the paper does not explore the interpretability of the learned representations or whether they capture musically meaningful features. Further analysis of the internal representations could shed light on what the model is learning and why the pre-training approach is effective.

The researchers also note that their experiments were limited to a relatively small number of datasets and tasks. Evaluating the approach on a wider range of music classification problems would help establish the broader applicability and robustness of the method.

Conclusion

This paper demonstrates that using music source separation as a pre-training strategy can be an effective way to improve the performance of music representation learning for classification tasks. By first training U-Net networks to isolate different components of music, the researchers were able to leverage those learned features to boost the accuracy of downstream auto-tagging and genre classification models.

The findings suggest that pre-training on relevant auxiliary tasks can be a powerful technique for enhancing the performance of music AI systems, opening up new possibilities for applications in areas like music recommendation, analysis, and generation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏷️

Pre-training Music Classification Models via Music Source Separation

Christos Garoufis, Athanasia Zlatintsi, Petros Maragos

In this paper, we study whether music source separation can be used as a pre-training strategy for music representation learning, targeted at music classification tasks. To this end, we first pre-train U-Net networks under various music source separation objectives, such as the isolation of vocal or instrumental sources from a musical piece; afterwards, we attach a classification network to the pre-trained U-Net and jointly finetune the whole network. The features learned by the separation network are also propagated to the tail network through a convolutional feature adaptation module. Experimental results in two widely used and publicly available datasets indicate that pre-training the U-Nets with a music source separation objective can improve performance compared to both training the whole network from scratch and using the tail network as a standalone in two music classification tasks, music auto-tagging and music genre classification. We also show that our proposed framework can be successfully integrated into both convolutional and Transformer-based backends, highlighting its modularity.

4/24/2024

An Experimental Comparison Of Multi-view Self-supervised Methods For Music Tagging

Gabriel Meseguer-Brocal, Dorian Desblancs, Romain Hennequin

Self-supervised learning has emerged as a powerful way to pre-train generalizable machine learning models on large amounts of unlabeled data. It is particularly compelling in the music domain, where obtaining labeled data is time-consuming, error-prone, and ambiguous. During the self-supervised process, models are trained on pretext tasks, with the primary objective of acquiring robust and informative features that can later be fine-tuned for specific downstream tasks. The choice of the pretext task is critical as it guides the model to shape the feature space with meaningful constraints for information encoding. In the context of music, most works have relied on contrastive learning or masking techniques. In this study, we expand the scope of pretext tasks applied to music by investigating and comparing the performance of new self-supervised methods for music tagging. We open-source a simple ResNet model trained on a diverse catalog of millions of tracks. Our results demonstrate that, although most of these pre-training methods result in similar downstream results, contrastive learning consistently results in better downstream performance compared to other self-supervised pre-training methods. This holds true in a limited-data downstream context.

4/16/2024

🌐

The Whole Is Greater than the Sum of Its Parts: Improving Music Source Separation by Bridging Network

Ryosuke Sawata, Naoya Takahashi, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji

This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency- and time-domain representations of audio signals. We modify the target network, i.e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information. MDL is then applied to the combinations of the output sources as well as each independent source; hence, we called it CL. MDL and CL can easily be applied to many DNN-based separation methods as they are merely loss functions that are only used during training and do not affect the inference step. Bridging operation does not increase the number of learnable parameters in the network. Experimental results showed that the validity of Open-Unmix (UMX), densely connected dilated DenseNet (D3Net) and convolutional time-domain audio separation network (Conv-TasNet) extended with our X-scheme, respectively called X-UMX, X-D3Net and X-Conv-TasNet, by comparing them with their original versions. We also verified the effectiveness of X-scheme in a large-scale data regime, showing its generality with respect to data size. X-UMX Large (X-UMXL), which was trained on large-scale internal data and used in our experiments, is newly available at https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX.

8/7/2024

Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation

Adam Sorrenti

Separating vocal elements from musical tracks is a longstanding challenge in audio signal processing. This study tackles the distinct separation of vocal components from musical spectrograms. We employ the Short Time Fourier Transform (STFT) to extract audio waves into detailed frequency-time spectrograms, utilizing the benchmark MUSDB18 dataset for music separation. Subsequently, we implement a UNet neural network to segment the spectrogram image, aiming to delineate and extract singing voice components accurately. We achieved noteworthy results in audio source separation using of our U-Net-based models. The combination of frequency-axis normalization with Min/Max scaling and the Mean Absolute Error (MAE) loss function achieved the highest Source-to-Distortion Ratio (SDR) of 7.1 dB, indicating a high level of accuracy in preserving the quality of the original signal during separation. This setup also recorded impressive Source-to-Interference Ratio (SIR) and Source-to-Artifact Ratio (SAR) scores of 25.2 dB and 7.2 dB, respectively. These values significantly outperformed other configurations, particularly those using Quantile-based normalization or a Mean Squared Error (MSE) loss function. Our source code, model weights, and demo material can be found at the project's GitHub repository: https://github.com/mbrotos/SoundSeg

5/31/2024