Improved symbolic drum style classification with grammar-based hierarchical representations

Read original: arXiv:2407.17536 - Published 7/26/2024 by L'eo G'er'e (CNAM Paris, CEDRIC - VERTIGO), Philippe Rigaux (CEDRIC - VERTIGO, CNAM Paris), Nicolas Audebert (CEDRIC - VERTIGO, CNAM, IGN, LaSTIG)

🏷️

Overview

Deep learning models have become essential for analyzing and classifying musical data.
These models can work with either audio signals (e.g., waveforms or spectrograms) or symbolic representations (e.g., MIDI).
Existing symbolic representations often rely on basic features like durations, pitches, and velocities, or generic tokenization strategies from natural language processing.
This work examines how enriched representations of symbolic data can impact the performance of deep learning models, such as Transformers and RNNs, for music style classification.

Plain English Explanation

Deep learning models have become a powerful tool for working with musical data. These models can analyze and categorize music in different ways - they can look at the actual audio signal (the waveform or spectrogram) or they can work with a symbolic representation of the music, like MIDI files.

Most existing approaches that use symbolic data, like MIDI, tend to reduce the musical information down to simple features like note lengths, pitches, and volumes. Or they use generic techniques from natural language processing to represent the data. This work evaluates how using a more detailed, "enriched" representation of the symbolic data can improve the performance of deep learning models for classifying different music styles.

The key idea is that the way the symbolic music data is represented can have a big impact on how well deep learning models can learn and understand the underlying musical information. The researchers introduce a new way to represent MIDI data using a tree-based structure based on a musical grammar. This representation can more effectively capture the high-level rhythmic organization of the music compared to simpler approaches.

The researchers show that this new representation outperforms existing encodings on a task of classifying different drumming styles in the GrooveMIDI dataset. It does this while also being more compact and efficient in terms of the number of parameters required by the deep learning models.

Technical Explanation

The paper evaluates the impact of using enriched representations of symbolic music data, such as MIDI, on the performance of deep learning models for music style classification tasks.

The authors introduce a new tree-based representation of MIDI data that is built upon a context-free musical grammar. This representation is designed to explicitly capture high-level rhythmic information that is often implicit in basic MIDI encodings.

The researchers compare this grammar-based representation against more generic tokenization strategies and matrix representations (e.g. piano rolls) when used as input to Transformer and RNN-based models. Experiments on the GrooveMIDI dataset for drumming style classification show that the grammar-based representation outperforms existing approaches while being more compact and parameter-efficient.

The key insight is that leveraging musical knowledge to create enriched data representations can lead to significant performance gains for deep learning models operating on symbolic music data. The grammar-based encoding is able to better capture the hierarchical rhythmic structure of the music compared to flatter, feature-based representations.

Critical Analysis

The paper provides a compelling demonstration of how the choice of data representation can impact the performance of deep learning models for music analysis tasks. The introduction of the grammar-based MIDI encoding is a novel contribution that effectively captures high-level musical structure.

However, the evaluation is limited to a single dataset and classification task. It would be valuable to see how the representations generalize to other symbolic music datasets and a wider range of modeling objectives, such as generation or music transcription.

Additionally, the paper does not provide a detailed analysis of the computational efficiency of the different representations in terms of training time, inference speed, or memory usage. This type of evaluation would help quantify the practical benefits of the more compact grammar-based encoding.

Further research could also explore combining the grammar-based representation with other musical features, such as pitch, dynamics, or higher-level musical attributes. Integrating multiple modalities of musical information may lead to even greater performance improvements for deep learning models.

Overall, this work highlights the importance of thoughtful data representation design when applying deep learning to symbolic music processing tasks. The grammar-based MIDI encoding is a promising direction for enhancing the musical understanding of these models.

Conclusion

This paper demonstrates that using enriched representations of symbolic music data, such as MIDI, can significantly improve the performance of deep learning models for music style classification tasks.

The researchers introduce a novel tree-based representation of MIDI data that is built upon a context-free musical grammar. This representation is able to more effectively capture the high-level rhythmic structure of the music compared to more generic tokenization or matrix-based approaches.

Experiments on the GrooveMIDI dataset show that deep learning models, including Transformers and RNNs, achieve better classification accuracy when using the grammar-based MIDI encoding as input. Importantly, this representation is also more compact and parameter-efficient than existing encodings.

The key takeaway is that thoughtful data representation design, grounded in musical knowledge, can be a critical factor in developing effective deep learning systems for symbolic music processing. This work provides a valuable contribution in this direction and suggests promising avenues for future research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏷️

Improved symbolic drum style classification with grammar-based hierarchical representations

L'eo G'er'e (CNAM Paris, CEDRIC - VERTIGO), Philippe Rigaux (CEDRIC - VERTIGO, CNAM Paris), Nicolas Audebert (CEDRIC - VERTIGO, CNAM, IGN, LaSTIG)

Deep learning models have become a critical tool for analysis and classification of musical data. These models operate either on the audio signal, e.g. waveform or spectrogram, or on a symbolic representation, such as MIDI. In the latter, musical information is often reduced to basic features, i.e. durations, pitches and velocities. Most existing works then rely on generic tokenization strategies from classical natural language processing, or matrix representations, e.g. piano roll. In this work, we evaluate how enriched representations of symbolic data can impact deep models, i.e. Transformers and RNN, for music style classification. In particular, we examine representations that explicitly incorporate musical information implicitly present in MIDI-like encodings, such as rhythmic organization, and show that they outperform generic tokenization strategies. We introduce a new tree-based representation of MIDI data built upon a context-free musical grammar. We show that this grammar representation accurately encodes high-level rhythmic information and outperforms existing encodings on the GrooveMIDI Dataset for drumming style classification, while being more compact and parameter-efficient.

7/26/2024

Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

Seungyeon Rhyu, Kichang Yang, Sungjun Cho, Jaehyeon Kim, Kyogu Lee, Moontae Lee

Music generation introduces challenging complexities to large language models. Symbolic structures of music often include vertical harmonization as well as horizontal counterpoint, urging various adaptations and enhancements for large-scale Transformers. However, existing works share three major drawbacks: 1) their tokenization requires domain-specific annotations, such as bars and beats, that are typically missing in raw MIDI data; 2) the pure impact of enhancing token embedding methods is hardly examined without domain-specific annotations; and 3) existing works to overcome the aforementioned drawbacks, such as MuseNet, lack reproducibility. To tackle such limitations, we develop a MIDI-based music generation framework inspired by MuseNet, empirically studying two structural embeddings that do not rely on domain-specific annotations. We provide various metrics and insights that can guide suitable encoding to deploy. We also verify that multiple embedding configurations can selectively boost certain musical aspects. By providing open-source implementations via HuggingFace, our findings shed light on leveraging large language models toward practical and reproducible music generation.

7/30/2024

🏷️

BERT-like Pre-training for Symbolic Piano Music Classification Tasks

Yi-Hui Chou, I-Chun Chen, Chin-Jui Chang, Joann Ching, Yi-Hsuan Yang

This article presents a benchmark study of symbolic piano music classification using the masked language modelling approach of the Bidirectional Encoder Representations from Transformers (BERT). Specifically, we consider two types of MIDI data: MIDI scores, which are musical scores rendered directly into MIDI with no dynamics and precisely aligned with the metrical grid notated by its composer and MIDI performances, which are MIDI encodings of human performances of musical scoresheets. With five public-domain datasets of single-track piano MIDI files, we pre-train two 12-layer Transformer models using the BERT approach, one for MIDI scores and the other for MIDI performances, and fine-tune them for four downstream classification tasks. These include two note-level classification tasks (melody extraction and velocity prediction) and two sequence-level classification tasks (style classification and emotion classification). Our evaluation shows that the BERT approach leads to higher classification accuracy than recurrent neural network (RNN)-based baselines.

4/16/2024

Hierarchical Symbolic Pop Music Generation with Graph Neural Networks

Wen Qing Lim, Jinhua Liang, Huan Zhang

Music is inherently made up of complex structures, and representing them as graphs helps to capture multiple levels of relationships. While music generation has been explored using various deep generation techniques, research on graph-related music generation is sparse. Earlier graph-based music generation worked only on generating melodies, and recent works to generate polyphonic music do not account for longer-term structure. In this paper, we explore a multi-graph approach to represent both the rhythmic patterns and phrase structure of Chinese pop music. Consequently, we propose a two-step approach that aims to generate polyphonic music with coherent rhythm and long-term structure. We train two Variational Auto-Encoder networks - one on a MIDI dataset to generate 4-bar phrases, and another on song structure labels to generate full song structure. Our work shows that the models are able to learn most of the structural nuances in the training dataset, including chord and pitch frequency distributions, and phrase attributes.

9/13/2024