Hierarchical Symbolic Pop Music Generation with Graph Neural Networks

Read original: arXiv:2409.08155 - Published 9/13/2024 by Wen Qing Lim, Jinhua Liang, Huan Zhang

Hierarchical Symbolic Pop Music Generation with Graph Neural Networks

Overview

This research paper explores using graph neural networks to generate hierarchical symbolic pop music.
The authors propose a novel architecture that models the hierarchical structure of music and generates coherent and musically plausible compositions.
The system is evaluated on a dataset of pop songs and shown to outperform baseline methods in terms of both objective and subjective measures.

Plain English Explanation

The paper presents a new approach to generating symbolic music using graph neural networks. Music has a hierarchical structure, with high-level elements like melody and harmony, and lower-level elements like individual notes. The authors' system aims to capture this hierarchy in order to generate more coherent and musically plausible compositions.

The key idea is to represent the musical structure as a graph, with nodes for musical elements like notes, chords, and sections, and edges connecting related elements. A graph neural network is then used to learn the relationships between these elements and generate new music that follows the same hierarchical structure.

The system is evaluated on a dataset of pop songs and shown to outperform baseline methods in terms of both objective measures (e.g., harmonic coherence) and subjective measures (e.g., human ratings of musical quality). This suggests that the hierarchical approach can indeed capture important aspects of musical structure and lead to more musically meaningful compositions.

Technical Explanation

The authors propose a hierarchical symbolic music generation model that uses a graph neural network to capture the relationships between musical elements at different levels of the hierarchy.

The input to the system is a sequence of musical events, such as notes, chords, and bar lines, represented as a graph. The graph neural network then learns to predict the next event in the sequence, conditioned on the current state of the graph.

The key innovation is the use of a hierarchical architecture, where the network is composed of multiple layers that operate at different levels of the musical hierarchy. The lower layers focus on modeling local relationships between individual notes and chords, while the higher layers capture longer-range dependencies between larger musical structures, such as motifs and sections.

The authors evaluate their system on a dataset of pop songs and compare it to a number of baseline methods, including a traditional sequence-to-sequence model and a grammar-based model. The results show that the hierarchical graph neural network outperforms these baselines on both objective and subjective measures of musical quality, suggesting that it is able to effectively capture the hierarchical structure of music.

Critical Analysis

The paper presents a novel and promising approach to symbolic music generation, but there are a few potential limitations and areas for further research:

The authors only evaluate their system on a dataset of pop songs, which may not fully capture the diversity of musical styles and structures. It would be interesting to see how the system performs on other genres, such as classical or jazz music.
The hierarchical architecture of the model is a key contribution, but the authors do not provide a detailed analysis of how the different layers of the network contribute to the overall performance. A more in-depth investigation of the inner workings of the model could yield additional insights.
The authors mention that the system is able to generate coherent and musically plausible compositions, but they do not provide a comprehensive evaluation of the aesthetic or creative quality of the generated music. Developing more sophisticated evaluation metrics for creative output could be an important area for future research.
The system is trained on a relatively small dataset of pop songs, which may limit its ability to generalize to a wider range of musical styles and structures. Exploring ways to scale up the training data or leverage other sources of musical knowledge could be a fruitful direction for future work.

Overall, the paper presents an interesting and promising approach to symbolic music generation that leverages the hierarchical structure of music. Further research and development in this area could lead to significant advancements in the field of computational creativity and music generation.

Conclusion

This paper introduces a novel graph neural network-based approach to generating hierarchical symbolic pop music. The key innovation is the use of a hierarchical architecture that captures the multi-level structure of music, from individual notes to larger-scale musical elements like motifs and sections.

The system is evaluated on a dataset of pop songs and shown to outperform baseline methods in terms of both objective and subjective measures of musical quality. This suggests that the hierarchical approach can effectively model the complex relationships that underlie musical structure and lead to more coherent and musically plausible compositions.

While the paper presents a promising step forward in the field of symbolic music generation, there are also several areas for potential improvement and further research, such as expanding the system to handle a wider range of musical styles, delving deeper into the inner workings of the hierarchical architecture, and developing more sophisticated evaluation metrics for creative output. Overall, this work represents an exciting advancement in the use of graph neural networks for music generation and highlights the potential of leveraging the hierarchical structure of music to generate more musically meaningful and compelling compositions.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Hierarchical Symbolic Pop Music Generation with Graph Neural Networks

Wen Qing Lim, Jinhua Liang, Huan Zhang

Music is inherently made up of complex structures, and representing them as graphs helps to capture multiple levels of relationships. While music generation has been explored using various deep generation techniques, research on graph-related music generation is sparse. Earlier graph-based music generation worked only on generating melodies, and recent works to generate polyphonic music do not account for longer-term structure. In this paper, we explore a multi-graph approach to represent both the rhythmic patterns and phrase structure of Chinese pop music. Consequently, we propose a two-step approach that aims to generate polyphonic music with coherent rhythm and long-term structure. We train two Variational Auto-Encoder networks - one on a MIDI dataset to generate 4-bar phrases, and another on song structure labels to generate full song structure. Our work shows that the models are able to learn most of the structural nuances in the training dataset, including chord and pitch frequency distributions, and phrase attributes.

9/13/2024

Whole-Song Hierarchical Generation of Symbolic Music Using Cascaded Diffusion Models

Ziyu Wang, Lejun Min, Gus Xia

Recent deep music generation studies have put much emphasis on long-term generation with structures. However, we are yet to see high-quality, well-structured whole-song generation. In this paper, we make the first attempt to model a full music piece under the realization of compositional hierarchy. With a focus on symbolic representations of pop songs, we define a hierarchical language, in which each level of hierarchy focuses on the semantics and context dependency at a certain music scope. The high-level languages reveal whole-song form, phrase, and cadence, whereas the low-level languages focus on notes, chords, and their local patterns. A cascaded diffusion model is trained to model the hierarchical language, where each level is conditioned on its upper levels. Experiments and analysis show that our model is capable of generating full-piece music with recognizable global verse-chorus structure and cadences, and the music quality is higher than the baselines. Additionally, we show that the proposed model is controllable in a flexible way. By sampling from the interpretable hierarchical languages or adjusting pre-trained external representations, users can control the music flow via various features such as phrase harmonic structures, rhythmic patterns, and accompaniment texture.

5/17/2024

Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

Seungyeon Rhyu, Kichang Yang, Sungjun Cho, Jaehyeon Kim, Kyogu Lee, Moontae Lee

Music generation introduces challenging complexities to large language models. Symbolic structures of music often include vertical harmonization as well as horizontal counterpoint, urging various adaptations and enhancements for large-scale Transformers. However, existing works share three major drawbacks: 1) their tokenization requires domain-specific annotations, such as bars and beats, that are typically missing in raw MIDI data; 2) the pure impact of enhancing token embedding methods is hardly examined without domain-specific annotations; and 3) existing works to overcome the aforementioned drawbacks, such as MuseNet, lack reproducibility. To tackle such limitations, we develop a MIDI-based music generation framework inspired by MuseNet, empirically studying two structural embeddings that do not rely on domain-specific annotations. We provide various metrics and insights that can guide suitable encoding to deploy. We also verify that multiple embedding configurations can selectively boost certain musical aspects. By providing open-source implementations via HuggingFace, our findings shed light on leveraging large language models toward practical and reproducible music generation.

7/30/2024

🏷️

Improved symbolic drum style classification with grammar-based hierarchical representations

L'eo G'er'e (CNAM Paris, CEDRIC - VERTIGO), Philippe Rigaux (CEDRIC - VERTIGO, CNAM Paris), Nicolas Audebert (CEDRIC - VERTIGO, CNAM, IGN, LaSTIG)

Deep learning models have become a critical tool for analysis and classification of musical data. These models operate either on the audio signal, e.g. waveform or spectrogram, or on a symbolic representation, such as MIDI. In the latter, musical information is often reduced to basic features, i.e. durations, pitches and velocities. Most existing works then rely on generic tokenization strategies from classical natural language processing, or matrix representations, e.g. piano roll. In this work, we evaluate how enriched representations of symbolic data can impact deep models, i.e. Transformers and RNN, for music style classification. In particular, we examine representations that explicitly incorporate musical information implicitly present in MIDI-like encodings, such as rhythmic organization, and show that they outperform generic tokenization strategies. We introduce a new tree-based representation of MIDI data built upon a context-free musical grammar. We show that this grammar representation accurately encodes high-level rhythmic information and outperforms existing encodings on the GrooveMIDI Dataset for drumming style classification, while being more compact and parameter-efficient.

7/26/2024