Subtractive Training for Music Stem Insertion using Latent Diffusion Models

2406.19328

Published 6/28/2024 by Ivan Villa-Renteria, Mason L. Wang, Zachary Shah, Zhe Li, Soohyun Kim, Neelesh Ramachandran, Mert Pilanci

cs.SD cs.LG eess.AS

🏋️

Abstract

We present Subtractive Training, a simple and novel method for synthesizing individual musical instrument stems given other instruments as context. This method pairs a dataset of complete music mixes with 1) a variant of the dataset lacking a specific stem, and 2) LLM-generated instructions describing how the missing stem should be reintroduced. We then fine-tune a pretrained text-to-audio diffusion model to generate the missing instrument stem, guided by both the existing stems and the text instruction. Our results demonstrate Subtractive Training's efficacy in creating authentic drum stems that seamlessly blend with the existing tracks. We also show that we can use the text instruction to control the generation of the inserted stem in terms of rhythm, dynamics, and genre, allowing us to modify the style of a single instrument in a full song while keeping the remaining instruments the same. Lastly, we extend this technique to MIDI formats, successfully generating compatible bass, drum, and guitar parts for incomplete arrangements.

Create account to get full access

Overview

This paper presents a novel approach for generating musical accompaniment in collaboration with human users, called Diff-Riff.
The researchers also introduce MusicMagus, a zero-shot text-to-music editing system, and InstructMusicGen, a system that allows for text-based editing of music.
Additionally, the paper discusses MelFusion, a framework for synthesizing music from image and language cues, as well as Arrange, Inpaint, Refine, a system for long-term music generation and editing.

Plain English Explanation

The researchers have developed several exciting AI-powered tools to help people create and edit music. Diff-Riff allows users to collaborate with an AI system to generate musical accompaniment. MusicMagus enables users to edit music just by describing what they want, without needing any musical expertise. InstructMusicGen takes this a step further, allowing users to provide detailed instructions to edit music.

The researchers also developed MelFusion, which can generate music based on images and text descriptions. And Arrange, Inpaint, Refine is a system that can create and edit music over longer time periods.

Overall, these tools aim to make music creation and editing more accessible to people without formal musical training, opening up new creative possibilities.

Technical Explanation

The Diff-Riff system uses a diffusion model to generate musical accompaniment that complements a user-provided melody. MusicMagus is a zero-shot text-to-music editing model that can modify music based on textual descriptions. InstructMusicGen builds on this by allowing users to provide more detailed, step-by-step instructions for editing music.

MelFusion uses a multimodal neural network to generate music from image and language cues. Arrange, Inpaint, Refine is a system that can create and edit music over longer time periods, allowing for more complex and coherent compositions.

Critical Analysis

The researchers acknowledge that their systems still have limitations, such as the need for further improvements in music quality and coherence over long time periods. They also note that the text-based editing capabilities, while powerful, may not fully capture the nuance and expressiveness of traditional musical composition.

Additionally, there are concerns about the potential for these systems to be used to generate inauthentic or plagiarized music, which could have ethical implications. Further research is needed to address these challenges and ensure the responsible development of these technologies.

Conclusion

This paper presents a suite of innovative AI-powered tools that aim to democratize music creation and editing. By leveraging techniques like diffusion models, text-to-music translation, and multimodal synthesis, these systems have the potential to empower more people to express themselves musically, regardless of their formal training.

The researchers have made significant strides in this direction, but there is still work to be done to improve the quality, coherence, and ethical considerations of these technologies. As the field of AI-assisted music continues to evolve, it will be important to carefully evaluate the impact and implications of these systems to ensure they are developed and deployed in a responsible manner.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤷

Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models

Javier Nistal, Marco Pasini, Cyran Aouameur, Maarten Grachten, Stefan Lattner

Recent advancements in deep generative models present new opportunities for music production but also pose challenges, such as high computational demands and limited audio quality. Moreover, current systems frequently rely solely on text input and typically focus on producing complete musical pieces, which is incompatible with existing workflows in music production. To address these issues, we introduce Diff-A-Riff, a Latent Diffusion Model designed to generate high-quality instrumental accompaniments adaptable to any musical context. This model offers control through either audio references, text prompts, or both, and produces 48kHz pseudo-stereo audio while significantly reducing inference time and memory usage. We demonstrate the model's capabilities through objective metrics and subjective listening tests, with extensive examples available on the accompanying website: sonycslparis.github.io/diffariff-companion/

6/13/2024

cs.SD cs.AI eess.AS

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

Yixiao Zhang, Yukara Ikemiya, Gus Xia, Naoki Murata, Marco A. Mart'inez-Ram'irez, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon

Recent advances in text-to-music generation models have opened new avenues in musical creativity. However, music generation usually involves iterative refinements, and how to edit the generated music remains a significant challenge. This paper introduces a novel approach to the editing of music generated by such models, enabling the modification of specific attributes, such as genre, mood and instrument, while maintaining other aspects unchanged. Our method transforms text editing to textit{latent space manipulation} while adding an extra constraint to enforce consistency. It seamlessly integrates with existing pretrained text-to-music diffusion models without requiring additional training. Experimental results demonstrate superior performance over both zero-shot and certain supervised baselines in style and timbre transfer evaluations. Additionally, we showcase the practical applicability of our approach in real-world music editing scenarios.

5/29/2024

cs.SD cs.AI cs.MM eess.AS

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

Yixiao Zhang, Yukara Ikemiya, Woosung Choi, Naoki Murata, Marco A. Mart'inez-Ram'irez, Liwei Lin, Gus Xia, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon

Recent advances in text-to-music editing, which employ text queries to modify music (e.g. by changing its style or adjusting instrumental components), present unique challenges and opportunities for AI-assisted music creation. Previous approaches in this domain have been constrained by the necessity to train specific editing models from scratch, which is both resource-intensive and inefficient; other research uses large language models to predict edited music, resulting in imprecise audio reconstruction. To Combine the strengths and address these limitations, we introduce Instruct-MusicGen, a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions such as adding, removing, or separating stems. Our approach involves a modification of the original MusicGen architecture by incorporating a text fusion module and an audio fusion module, which allow the model to process instruction texts and audio inputs concurrently and yield the desired edited music. Remarkably, Instruct-MusicGen only introduces 8% new parameters to the original MusicGen model and only trains for 5K steps, yet it achieves superior performance across all tasks compared to existing baselines, and demonstrates performance comparable to the models trained for specific tasks. This advancement not only enhances the efficiency of text-to-music editing but also broadens the applicability of music language models in dynamic music production environments.

5/30/2024

cs.SD cs.AI cs.LG cs.MM eess.AS

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

Sanjoy Chowdhury, Sayan Nag, K J Joseph, Balaji Vasan Srinivasan, Dinesh Manocha

Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, we propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel visual synapse, which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music, measured both objectively and subjectively, with a relative gain of up to 67.98% on the FAD score. We hope that our work will gather attention to this pragmatic, yet relatively under-explored research area.

6/10/2024

cs.CV cs.AI cs.MM eess.AS