The NES Video-Music Database: A Dataset of Symbolic Video Game Music Paired with Gameplay Videos

2404.04420

Published 4/9/2024 by Igor Cardoso, Rubens O. Moraes, Lucas N. Ferreira

The NES Video-Music Database: A Dataset of Symbolic Video Game Music Paired with Gameplay Videos

Abstract

Neural models are one of the most popular approaches for music generation, yet there aren't standard large datasets tailored for learning music directly from game data. To address this research gap, we introduce a novel dataset named NES-VMDB, containing 98,940 gameplay videos from 389 NES games, each paired with its original soundtrack in symbolic format (MIDI). NES-VMDB is built upon the Nintendo Entertainment System Music Database (NES-MDB), encompassing 5,278 music pieces from 397 NES games. Our approach involves collecting long-play videos for 389 games of the original dataset, slicing them into 15-second-long clips, and extracting the audio from each clip. Subsequently, we apply an audio fingerprinting algorithm (similar to Shazam) to automatically identify the corresponding piece in the NES-MDB dataset. Additionally, we introduce a baseline method based on the Controllable Music Transformer to generate NES music conditioned on gameplay clips. We evaluated this approach with objective metrics, and the results showed that the conditional CMT improves musical structural quality when compared to its unconditional counterpart. Moreover, we used a neural classifier to predict the game genre of the generated pieces. Results showed that the CMT generator can learn correlations between gameplay videos and game genres, but further research has to be conducted to achieve human-level performance.

Create account to get full access

Overview

This paper introduces the NES Video-Music Database, a new dataset that pairs symbolic video game music with corresponding gameplay videos from classic Nintendo Entertainment System (NES) games.
The dataset aims to enable research on tasks like automatic music generation, music-video synchronization, and understanding the relationship between video game music and visuals.
The paper describes the dataset creation process, provides an analysis of the data, and highlights several potential research directions that could be explored using this resource.

Plain English Explanation

The researchers have created a new dataset that combines video game music and the corresponding gameplay videos from classic NES games. This dataset, called the NES Video-Music Database, is designed to help researchers study the connection between music and visuals in video games.

A link to the SoundingActions paper may be relevant here, as it explores how actions in videos can be linked to the associated sounds.

By having access to both the music and the video footage, researchers can work on tasks like automatically generating new video game music, synchronizing music to match the on-screen action, and better understanding how the music and visuals in games are designed to work together. This could lead to insights that help improve the audio-visual experience in future video games.

Technical Explanation

The researchers collected a dataset of gameplay videos from 50 classic NES games, along with the corresponding symbolic music data (e.g., MIDI files) for each game. They used a combination of web scraping, manual annotation, and audio extraction techniques to assemble the dataset.

The MusiLingo paper may be relevant here, as it explores ways to connect music and text, which could be extended to connecting music and video.

The dataset includes information about the game, level, and time-aligned music and video data. The researchers analyzed the dataset to understand characteristics like the distribution of game genres, the diversity of musical styles, and the relationship between music and video features.

The SonicVisionLM paper is relevant here, as it demonstrates how language models can be used to connect sound, vision, and text, which could be applied to the connection between video game music and visuals.

The researchers highlight several potential research directions that could be explored using this dataset, such as automatic music generation, music-video synchronization, and understanding the interplay between video game music and visuals.

Critical Analysis

The NES Video-Music Database is a valuable resource for researchers interested in the intersection of video game music and visuals. The dataset covers a wide range of classic NES games, providing a diverse set of musical styles and gameplay footage.

However, the dataset is limited to a specific console and time period, which may limit its broader applicability. Additionally, the researchers note that the dataset does not include information about the game's narrative or player interaction, which could be important factors in understanding the relationship between music and visuals.

The Positive Risky Message Assessment paper is relevant here, as it discusses the importance of considering context when analyzing the impact of media, which could be applied to the study of video game music and visuals.

Further research could explore expanding the dataset to include a wider range of game consoles and genres, as well as incorporating additional contextual information about the games. Additionally, the researchers could investigate whether the insights gained from this dataset can be applied to more modern video games with increasingly complex audio-visual integration.

Conclusion

The NES Video-Music Database provides a valuable resource for researchers interested in exploring the relationship between video game music and visuals. By combining symbolic music data and corresponding gameplay footage, the dataset enables the study of tasks like automatic music generation, music-video synchronization, and understanding the interplay between audio and visual elements in classic video games.

The Insights from Use of Previously Unseen Neural Architecture paper may be relevant here, as it discusses the potential for novel architectures to unlock new research directions, which could be applied to the study of video game music and visuals.

The dataset's focus on classic NES games provides a rich and diverse set of musical styles and gameplay, offering researchers a valuable starting point for exploring these topics. While the dataset has some limitations, the researchers have highlighted several promising research directions that could lead to advancements in our understanding of the role of music in video games and its broader implications for interactive media.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Xiaoqiang Huang, Qifeng Liu, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo

In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 190K video-music pairs, including various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music that is both acoustically and semantically aligned with the video. By incorporating local and global visual cues, VidMuse enables the creation of musically coherent audio tracks that consistently match the video content through Long-Short-Term modeling. Through extensive experiments, VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment. The code and datasets will be available at https://github.com/ZeyueT/VidMuse/.

6/7/2024

cs.CV cs.LG cs.MM cs.SD

MusicScore: A Dataset for Music Score Modeling and Generation

Yuheng Lin, Zheqi Dai, Qiuqiang Kong

Music scores are written representations of music and contain rich information about musical components. The visual information on music scores includes notes, rests, staff lines, clefs, dynamics, and articulations. This visual information in music scores contains more semantic information than audio and symbolic representations of music. Previous music score datasets have limited sizes and are mainly designed for optical music recognition (OMR). There is a lack of research on creating a large-scale benchmark dataset for music modeling and generation. In this work, we propose MusicScore, a large-scale music score dataset collected and processed from the International Music Score Library Project (IMSLP). MusicScore consists of image-text pairs, where the image is a page of a music score and the text is the metadata of the music. The metadata of MusicScore is extracted from the general information section of the IMSLP pages. The metadata includes rich information about the composer, instrument, piece style, and genre of the music pieces. MusicScore is curated into small, medium, and large scales of 400, 14k, and 200k image-text pairs with varying diversity, respectively. We build a score generation system based on a UNet diffusion model to generate visually readable music scores conditioned on text descriptions to benchmark the MusicScore dataset for music score generation. MusicScore is released to the public at https://huggingface.co/datasets/ZheqiDAI/MusicScore.

6/18/2024

cs.MM cs.GR cs.SD eess.AS

Emotion Manipulation Through Music -- A Deep Learning Interactive Visual Approach

Adel N. Abdalla, Jared Osborne, Razvan Andonie

Music evokes emotion in many people. We introduce a novel way to manipulate the emotional content of a song using AI tools. Our goal is to achieve the desired emotion while leaving the original melody as intact as possible. For this, we create an interactive pipeline capable of shifting an input song into a diametrically opposed emotion and visualize this result through Russel's Circumplex model. Our approach is a proof-of-concept for Semantic Manipulation of Music, a novel field aimed at modifying the emotional content of existing music. We design a deep learning model able to assess the accuracy of our modifications to key, SoundFont instrumentation, and other musical features. The accuracy of our model is in-line with the current state of the art techniques on the 4Q Emotion dataset. With further refinement, this research may contribute to on-demand custom music generation, the automated remixing of existing work, and music playlists tuned for emotional progression.

6/14/2024

cs.SD cs.AI cs.CY cs.LG eess.AS

MidiCaps -- A large-scale MIDI dataset with text captions

Jan Melechovsky, Abhinaba Roy, Dorien Herremans

Generative models guided by text prompts are increasingly becoming more popular. However, no text-to-MIDI models currently exist, mostly due to the lack of a captioned MIDI dataset. This work aims to enable research that combines LLMs with symbolic music by presenting the first large-scale MIDI dataset with text captions that is openly available: MidiCaps. MIDI (Musical Instrument Digital Interface) files are a widely used format for encoding musical information. Their structured format captures the nuances of musical composition and has practical applications by music producers, composers, musicologists, as well as performers. Inspired by recent advancements in captioning techniques applied to various domains, we present a large-scale curated dataset of over 168k MIDI files accompanied by textual descriptions. Each MIDI caption succinctly describes the musical content, encompassing tempo, chord progression, time signature, instruments present, genre and mood; thereby facilitating multi-modal exploration and analysis. The dataset contains a mix of various genres, styles, and complexities, offering a rich source for training and evaluating models for tasks such as music information retrieval, music understanding and cross-modal translation. We provide detailed statistics about the dataset and have assessed the quality of the captions in an extensive listening study. We anticipate that this resource will stimulate further research in the intersection of music and natural language processing, fostering advancements in both fields.

6/5/2024

eess.AS cs.LG cs.MM cs.SD