SYMPLEX: Controllable Symbolic Music Generation using Simplex Diffusion with Vocabulary Priors

Read original: arXiv:2405.12666 - Published 5/22/2024 by Nicolas Jonason, Luca Casini, Bob L. T. Sturm

🛸

Overview

Presents a new approach for fast and controllable generation of symbolic music based on simplex diffusion
Applies this technique to generating 4-bar multi-instrument music loops using an orderless representation
Shows that the model can be steered with vocabulary priors, allowing for control over the music generation process

Plain English Explanation

This paper introduces a new way to quickly and easily generate musical compositions using a technique called simplex diffusion. Simplex diffusion is a process that works with probabilities rather than the actual sound of the music.

The researchers applied this method to generate short, 4-bar loops of music that can use multiple musical instruments. One of the key benefits of this approach is that it gives the user a lot of control over the music that is generated. For example, you can influence the choice of instruments, the pitch and timing of the notes, and other aspects of the music -- all without having to make major changes to the underlying model.

This level of control is achieved by using "vocabulary priors," which are essentially guidelines that steer the music generation process. So the model isn't just producing random music, but is generating compositions that align with the preferences you specify.

Overall, this research demonstrates a new way to quickly create music that can be customized to your liking, without requiring extensive training or specialized musical knowledge. It's an exciting development that could make music composition more accessible to a wider range of users.

Technical Explanation

The paper presents a novel approach for generating symbolic music based on simplex diffusion, which is a type of diffusion process that operates on probabilities rather than the raw audio signal. This allows for fast and controllable generation of 4-bar music loops with multiple instruments, using an orderless representation.

The key innovation is the use of vocabulary priors, which provide a way to steer the music generation process. These priors allow the model to be influenced on factors like instrumentation, pitch, and timing, without requiring task-specific model adaptation or the application of extrinsic control mechanisms like in prior work such as MusicMAGUS and ComposerX.

The researchers demonstrate the effectiveness of this approach through qualitative and quantitative evaluations, showing that it can generate high-quality musical compositions that align with the specified priors. This includes the ability to perform tasks like infilling missing time or pitch information, as well as controlling the choice of instruments.

Critical Analysis

The paper presents a promising new technique for controllable symbolic music generation, but there are a few potential limitations worth considering:

The experiments focus on relatively short 4-bar loops, so it's unclear how well the approach would scale to generating longer, more complex musical pieces. Further research may be needed to explore hierarchical or long-form music generation.
While the vocabulary priors provide a useful mechanism for control, they may require careful crafting to achieve desired musical outcomes. The paper does not explore how sensitive the model is to the specific prior configurations.
The orderless representation used in this work may limit the ability to capture long-range musical structure and dependencies. Incorporating more sophisticated musical representations could be an area for future research.

Overall, the simplex diffusion approach represents an intriguing step forward in controllable music generation. However, further exploration of its scalability, robustness, and musical expressiveness would be valuable to fully assess its potential impact on the field.

Conclusion

This paper introduces a novel method for fast and controllable generation of symbolic music using simplex diffusion. By operating on probabilities rather than raw audio, the approach allows for the generation of high-quality 4-bar music loops with a high degree of user control over factors like instrumentation, pitch, and timing.

The key innovation is the use of vocabulary priors, which provide a way to steer the music generation process without requiring task-specific model adaptation. This opens up new possibilities for making music composition more accessible to a wide range of users, as it enables customization and creative exploration without deep musical expertise.

While the current work focuses on short musical snippets, the underlying principles could potentially be extended to generate longer, more complex compositions. Further research to address limitations around scalability and musical expressiveness could help unlock the full potential of this approach and advance the state of the art in generative music systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

SYMPLEX: Controllable Symbolic Music Generation using Simplex Diffusion with Vocabulary Priors

Nicolas Jonason, Luca Casini, Bob L. T. Sturm

We present a new approach for fast and controllable generation of symbolic music based on the simplex diffusion, which is essentially a diffusion process operating on probabilities rather than the signal space. This objective has been applied in domains such as natural language processing but here we apply it to generating 4-bar multi-instrument music loops using an orderless representation. We show that our model can be steered with vocabulary priors, which affords a considerable level control over the music generation process, for instance, infilling in time and pitch and choice of instrumentation -- all without task-specific model adaptation or applying extrinsic control.

5/22/2024

Flexible Control in Symbolic Music Generation via Musical Metadata

Sangjun Han, Jiwon Ham, Chaeeun Lee, Heejin Kim, Soojong Do, Sihyuk Yi, Jun Seo, Seoyoon Kim, Yountae Jung, Woohyung Lim

In this work, we introduce the demonstration of symbolic music generation, focusing on providing short musical motifs that serve as the central theme of the narrative. For the generation, we adopt an autoregressive model which takes musical metadata as inputs and generates 4 bars of multitrack MIDI sequences. During training, we randomly drop tokens from the musical metadata to guarantee flexible control. It provides users with the freedom to select input types while maintaining generative performance, enabling greater flexibility in music composition. We validate the effectiveness of the strategy through experiments in terms of model capacity, musical fidelity, diversity, and controllability. Additionally, we scale up the model and compare it with other music generation model through a subjective test. Our results indicate its superiority in both control and music quality. We provide a URL link https://www.youtube.com/watch?v=-0drPrFJdMQ to our demonstration video.

9/14/2024

SymPAC: Scalable Symbolic Music Generation With Prompts And Constraints

Haonan Chen, Jordan B. L. Smith, Janne Spijkervet, Ju-Chiang Wang, Pei Zou, Bochen Li, Qiuqiang Kong, Xingjian Du

Progress in the task of symbolic music generation may be lagging behind other tasks like audio and text generation, in part because of the scarcity of symbolic training data. In this paper, we leverage the greater scale of audio music data by applying pre-trained MIR models (for transcription, beat tracking, structure analysis, etc.) to extract symbolic events and encode them into token sequences. To the best of our knowledge, this work is the first to demonstrate the feasibility of training symbolic generation models solely from auto-transcribed audio data. Furthermore, to enhance the controllability of the trained model, we introduce SymPAC (Symbolic Music Language Model with Prompting And Constrained Generation), which is distinguished by using (a) prompt bars in encoding and (b) a technique called Constrained Generation via Finite State Machines (FSMs) during inference time. We show the flexibility and controllability of this approach, which may be critical in making music AI useful to creators and users.

9/11/2024

Why Perturbing Symbolic Music is Necessary: Fitting the Distribution of Never-used Notes through a Joint Probabilistic Diffusion Model

Shipei Liu, Xiaoya Fan, Guowei Wu

Existing music generation models are mostly language-based, neglecting the frequency continuity property of notes, resulting in inadequate fitting of rare or never-used notes and thus reducing the diversity of generated samples. We argue that the distribution of notes can be modeled by translational invariance and periodicity, especially using diffusion models to generalize notes by injecting frequency-domain Gaussian noise. However, due to the low-density nature of music symbols, estimating the distribution of notes latent in the high-density solution space poses significant challenges. To address this problem, we introduce the Music-Diff architecture, which fits a joint distribution of notes and accompanying semantic information to generate symbolic music conditionally. We first enhance the fragmentation module for extracting semantics by using event-based notations and the structural similarity index, thereby preventing boundary blurring. As a prerequisite for multivariate perturbation, we introduce a joint pre-training method to construct the progressions between notes and musical semantics while avoiding direct modeling of low-density notes. Finally, we recover the perturbed notes by a multi-branch denoiser that fits multiple noise objectives via Pareto optimization. Our experiments suggest that in contrast to language models, joint probability diffusion models perturbing at both note and semantic levels can provide more sample diversity and compositional regularity. The case study highlights the rhythmic advantages of our model over language- and DDPMs-based models by analyzing the hierarchical structure expressed in the self-similarity metrics.

8/6/2024