MidiCaps -- A large-scale MIDI dataset with text captions

2406.02255

Published 6/5/2024 by Jan Melechovsky, Abhinaba Roy, Dorien Herremans

MidiCaps -- A large-scale MIDI dataset with text captions

Abstract

Generative models guided by text prompts are increasingly becoming more popular. However, no text-to-MIDI models currently exist, mostly due to the lack of a captioned MIDI dataset. This work aims to enable research that combines LLMs with symbolic music by presenting the first large-scale MIDI dataset with text captions that is openly available: MidiCaps. MIDI (Musical Instrument Digital Interface) files are a widely used format for encoding musical information. Their structured format captures the nuances of musical composition and has practical applications by music producers, composers, musicologists, as well as performers. Inspired by recent advancements in captioning techniques applied to various domains, we present a large-scale curated dataset of over 168k MIDI files accompanied by textual descriptions. Each MIDI caption succinctly describes the musical content, encompassing tempo, chord progression, time signature, instruments present, genre and mood; thereby facilitating multi-modal exploration and analysis. The dataset contains a mix of various genres, styles, and complexities, offering a rich source for training and evaluating models for tasks such as music information retrieval, music understanding and cross-modal translation. We provide detailed statistics about the dataset and have assessed the quality of the captions in an extensive listening study. We anticipate that this resource will stimulate further research in the intersection of music and natural language processing, fostering advancements in both fields.

Create account to get full access

Overview

• This paper presents MidiCaps, a large-scale dataset of MIDI music files paired with text captions. • The dataset aims to bridge the gap between music and language, enabling research into tasks like music retrieval, generation, and understanding. • The dataset contains over 1 million MIDI files matched with high-quality captions, making it the largest of its kind.

Plain English Explanation

MidiCaps is a new dataset that connects music and language. It contains over 1 million MIDI music files, which are a type of digital music notation, paired with detailed text descriptions. This allows researchers to explore new ways of working with music and text together.

For example, MusiLinGo is a model that can take a piece of music and generate a caption describing it in natural language. Or CapsaFusion could match a MIDI file to relevant images or videos. The dataset could even help build systems that can understand the meaning and emotions conveyed through music, similar to how NES connects video and music.

The large scale and high quality of the MidiCaps dataset makes it a powerful new tool for music and language AI research. Compared to previous datasets, MidiCaps provides much richer and more diverse data to work with. This could lead to significant advances, like enabling MICap models that can generate more natural and expressive music-based descriptions.

Technical Explanation

The MidiCaps dataset was constructed by matching over 1 million MIDI music files with high-quality text captions. The MIDI files were sourced from online repositories, while the captions were collected from music-related websites and forums.

The researchers used advanced natural language processing techniques to filter and clean the caption data, ensuring a high level of quality and relevance to the corresponding MIDI files. This resulted in a dataset with rich, diverse, and well-curated music-text pairs.

To enable research into areas like music retrieval and generation, the dataset is structured with a variety of metadata, including information about the MIDI files' musical properties, emotion, and genre. The researchers also provide baseline models and benchmarks to help other researchers get started with the dataset.

The large scale and comprehensive nature of MidiCaps make it a significant advancement over previous music-text datasets, which tended to be smaller and less diverse. This opens up new possibilities for BERT-like pre-training on symbolic piano music and other innovative applications of music-language AI.

Critical Analysis

The MidiCaps dataset represents an impressive effort to bridge the gap between music and language, but it does have some limitations. The researchers acknowledge that the dataset may contain biases, as the MIDI files and captions were drawn from online sources that may not be fully representative of all musical genres and styles.

Additionally, the quality and relevance of the captions, while high, may still vary depending on the source. The researchers suggest further work is needed to validate the accuracy and consistency of the caption-MIDI pairings.

Another potential issue is the reliance on MIDI files, which capture musical notation but may not fully convey the nuances and expressiveness of live performance. Incorporating other music formats, such as audio recordings, could provide a more complete picture of the music-language relationship.

Despite these caveats, the MidiCaps dataset represents a significant step forward in music-language AI research. By making this resource publicly available, the researchers have opened the door to a wide range of innovative applications and advancements in the field.

Conclusion

The MidiCaps dataset is a groundbreaking contribution to the field of music-language AI research. By providing a large-scale, high-quality collection of MIDI files paired with detailed text captions, the dataset enables new possibilities for tasks like music retrieval, generation, and understanding.

The dataset's scale and comprehensiveness make it a valuable tool for researchers, potentially leading to significant breakthroughs in areas like BERT-like pre-training on symbolic piano music and unified models for identity-aware movie descriptions. As the field of music-language AI continues to evolve, the MidiCaps dataset will undoubtedly play a crucial role in driving innovation and advancing our understanding of the complex relationship between music and language.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MusicScore: A Dataset for Music Score Modeling and Generation

Yuheng Lin, Zheqi Dai, Qiuqiang Kong

Music scores are written representations of music and contain rich information about musical components. The visual information on music scores includes notes, rests, staff lines, clefs, dynamics, and articulations. This visual information in music scores contains more semantic information than audio and symbolic representations of music. Previous music score datasets have limited sizes and are mainly designed for optical music recognition (OMR). There is a lack of research on creating a large-scale benchmark dataset for music modeling and generation. In this work, we propose MusicScore, a large-scale music score dataset collected and processed from the International Music Score Library Project (IMSLP). MusicScore consists of image-text pairs, where the image is a page of a music score and the text is the metadata of the music. The metadata of MusicScore is extracted from the general information section of the IMSLP pages. The metadata includes rich information about the composer, instrument, piece style, and genre of the music pieces. MusicScore is curated into small, medium, and large scales of 400, 14k, and 200k image-text pairs with varying diversity, respectively. We build a score generation system based on a UNet diffusion model to generate visually readable music scores conditioned on text descriptions to benchmark the MusicScore dataset for music score generation. MusicScore is released to the public at https://huggingface.co/datasets/ZheqiDAI/MusicScore.

6/18/2024

cs.MM cs.GR cs.SD eess.AS

💬

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, Emmanouil Benetos

Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains not well-explored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with a frozen LLM, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our introduced dataset enables notable advancements beyond previous ones.

4/3/2024

eess.AS cs.AI cs.CL cs.MM cs.SD

🤯

Improving Text-To-Audio Models with Synthetic Captions

Zhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Rafael Valle, Soujanya Poria, Bryan Catanzaro

It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an textit{audio language model} to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named texttt{AF-AudioSet}, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new textit{state-of-the-art}.

6/26/2024

cs.CL cs.LG cs.SD eess.AS

Taming Data and Transformers for Audio Generation

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Guha Balakrishnan, Sergey Tulyakov, Vicente Ordonez

Generating ambient sounds and effects is a challenging problem due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle the problem by introducing two new models. First, we propose AutoCap, a high-quality and efficient automatic audio captioning model. We show that by leveraging metadata available with the audio modality, we can substantially improve the quality of captions. AutoCap reaches CIDEr score of 83.2, marking a 3.2% improvement from the best available captioning model at four times faster inference speed. We then use AutoCap to caption clips from existing datasets, obtaining 761,000 audio clips with high-quality captions, forming the largest available audio-text dataset. Second, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters and train with our new dataset. When compared to state-of-the-art audio generators, GenAu obtains significant improvements of 15.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly improved quality of generated audio compared to previous works. This shows that the quality of data is often as important as its quantity. Besides, since AutoCap is fully automatic, new audio samples can be added to the training dataset, unlocking the training of even larger generative models for audio synthesis.

6/28/2024

cs.SD cs.CL cs.CV cs.MM eess.AS