Machine Learning Techniques in Automatic Music Transcription: A Systematic Survey

Read original: arXiv:2406.15249 - Published 6/24/2024 by Fatemeh Jamshidi, Gary Pike, Amit Das, Richard Chapman

Machine Learning Techniques in Automatic Music Transcription: A Systematic Survey

Overview

This paper presents a systematic survey of machine learning techniques used in automatic music transcription (AMT).
AMT is the process of converting an acoustic music signal into a symbolic music representation, such as sheet music or MIDI.
The survey covers a wide range of machine learning approaches, including frame-level transcription, note-level transcription, and end-to-end transcription.
The paper also discusses the evaluation of AMT systems, including challenges and potential solutions.

Plain English Explanation

Automatic music transcription (AMT) is the process of converting a recorded music performance into a written musical score, like sheet music or MIDI files. This is a challenging task that requires understanding the different instruments, notes, and rhythms in the music.

The researchers in this paper looked at how machine learning techniques can be used to automate the music transcription process. Machine learning is a type of artificial intelligence that allows computers to learn and improve from data, without being explicitly programmed.

The paper covers a wide range of machine learning approaches for AMT, including methods that work at the individual note level, the overall frame or time-step level, and even end-to-end systems that try to transcribe the entire piece of music at once. The researchers also discussed how these AMT systems can be evaluated and compared to each other.

Overall, this survey provides a comprehensive look at the current state of machine learning for automatic music transcription, which could have important applications in music education, music production, and music information retrieval.

Technical Explanation

The paper begins by introducing the task of automatic music transcription (AMT) and its importance in various music-related applications. AMT involves converting an acoustic music signal into a symbolic music representation, such as sheet music or MIDI.

The main focus of the paper is on reviewing the machine learning techniques that have been applied to tackle different aspects of the AMT problem. The authors categorize these approaches into three main types:

Frame-Level Transcription: These methods operate at the individual time-frame level, predicting the notes or chords present at each time step. The paper covers a range of neural network architectures, such as convolutional neural networks and recurrent neural networks, that have been used for frame-level AMT.
Note-Level Transcription: These approaches aim to directly identify the individual notes present in the music, including their onset times, pitches, and durations. The paper discusses how deep learning and generative models have been applied to this note-level transcription task.
End-to-End Transcription: Some recent work has explored training single, integrated models to perform the entire AMT task in an end-to-end manner, without relying on intermediate representations or separate sub-modules. The paper examines the advantages and challenges of these end-to-end approaches.

In addition to surveying the technical approaches, the paper also discusses the evaluation of AMT systems, including commonly used metrics and the challenges in assessing transcription quality, especially for polyphonic (multi-instrument) music.

Critical Analysis

The paper provides a comprehensive and up-to-date review of the machine learning techniques used for automatic music transcription. By categorizing the approaches into frame-level, note-level, and end-to-end methods, the authors give the reader a clear understanding of the different strategies and their relative trade-offs.

One limitation mentioned in the paper is the difficulty in directly comparing the performance of different AMT systems due to the lack of standardized evaluation protocols and datasets. The authors suggest that the community would benefit from the development of more robust and musically-informed evaluation metrics.

Another potential issue is the reliance of many AMT systems on large, high-quality training datasets, which may not be readily available for certain musical genres or instrument combinations. The authors note that more research is needed to improve the data efficiency and generalization capabilities of these models.

Additionally, the paper does not delve deeply into the potential biases and ethical considerations of AMT systems, such as their ability to accurately transcribe music from diverse cultural backgrounds. As these technologies become more widely adopted, it will be important for researchers to address these important societal implications.

Conclusion

This systematic survey provides a comprehensive overview of the state-of-the-art in machine learning techniques for automatic music transcription. By categorizing the various approaches and discussing their strengths, limitations, and evaluation challenges, the paper offers a valuable resource for researchers and practitioners working in this field.

The advances in AMT highlighted in this paper have the potential to greatly improve music education, music production workflows, and music information retrieval systems. As the technology continues to evolve, it will be crucial to address the remaining challenges and ensure that these tools are developed and deployed in an ethical and inclusive manner.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Machine Learning Techniques in Automatic Music Transcription: A Systematic Survey

Fatemeh Jamshidi, Gary Pike, Amit Das, Richard Chapman

In the domain of Music Information Retrieval (MIR), Automatic Music Transcription (AMT) emerges as a central challenge, aiming to convert audio signals into symbolic notations like musical notes or sheet music. This systematic review accentuates the pivotal role of AMT in music signal analysis, emphasizing its importance due to the intricate and overlapping spectral structure of musical harmonies. Through a thorough examination of existing machine learning techniques utilized in AMT, we explore the progress and constraints of current models and methodologies. Despite notable advancements, AMT systems have yet to match the accuracy of human experts, largely due to the complexities of musical harmonies and the need for nuanced interpretation. This review critically evaluates both fully automatic and semi-automatic AMT systems, emphasizing the importance of minimal user intervention and examining various methodologies proposed to date. By addressing the limitations of prior techniques and suggesting avenues for improvement, our objective is to steer future research towards fully automated AMT systems capable of accurately and efficiently translating intricate audio signals into precise symbolic representations. This study not only synthesizes the latest advancements but also lays out a road-map for overcoming existing challenges in AMT, providing valuable insights for researchers aiming to narrow the gap between current systems and human-level transcription accuracy.

6/24/2024

Annotation-free Automatic Music Transcription with Scalable Synthetic Data and Adversarial Domain Confusion

Gakusei Sato, Taketo Akama

Automatic Music Transcription (AMT) is a vital technology in the field of music information processing. Despite recent enhancements in performance due to machine learning techniques, current methods typically attain high accuracy in domains where abundant annotated data is available. Addressing domains with low or no resources continues to be an unresolved challenge. To tackle this issue, we propose a transcription model that does not require any MIDI-audio paired data through the utilization of scalable synthetic audio for pre-training and adversarial domain confusion using unannotated real audio. In experiments, we evaluate methods under the real-world application scenario where training datasets do not include the MIDI annotation of audio in the target data domain. Our proposed method achieved competitive performance relative to established baseline methods, despite not utilizing any real datasets of paired MIDI-audio. Additionally, ablation studies have provided insights into the scalability of this approach and the forthcoming challenges in the field of AMT research.

7/4/2024

Quantifying the Corpus Bias Problem in Automatic Music Transcription Systems

Luk'av{s} Samuel Mart'ak, Patricia Hu, Gerhard Widmer

Automatic Music Transcription (AMT) is the task of recognizing notes in audio recordings of music. The State-of-the-Art (SotA) benchmarks have been dominated by deep learning systems. Due to the scarcity of high quality data, they are usually trained and evaluated exclusively or predominantly on classical piano music. Unfortunately, that hinders our ability to understand how they generalize to other music. Previous works have revealed several aspects of memorization and overfitting in these systems. We identify two primary sources of distribution shift: the music, and the sound. Complementing recent results on the sound axis (i.e. acoustics, timbre), we investigate the musical one (i.e. note combinations, dynamics, genre). We evaluate the performance of several SotA AMT systems on two new experimental test sets which we carefully construct to emulate different levels of musical distribution shift. Our results reveal a stark performance gap, shedding further light on the Corpus Bias problem, and the extent to which it continues to trouble these systems.

8/12/2024

Development of Large Annotated Music Datasets using HMM-based Forced Viterbi Alignment

S. Johanan Joysingh, P. Vijayalakshmi, T. Nagarajan

Datasets are essential for any machine learning task. Automatic Music Transcription (AMT) is one such task, where considerable amount of data is required depending on the way the solution is achieved. Considering the fact that a music dataset, complete with audio and its time-aligned transcriptions would require the effort of people with musical experience, it could be stated that the task becomes even more challenging. Musical experience is required in playing the musical instrument(s), and in annotating and verifying the transcriptions. We propose a method that would help in streamlining this process, making the task of obtaining a dataset from a particular instrument easy and efficient. We use predefined guitar exercises and hidden Markov model(HMM) based forced viterbi alignment to accomplish this. The guitar exercises are designed to be simple. Since the note sequence are already defined, HMM based forced viterbi alignment provides time-aligned transcriptions of these audio files. The onsets of the transcriptions are manually verified and the labels are accurate up to 10ms, averaging at 5ms. The contributions of the proposed work is two fold, i) a well streamlined and efficient method for generating datasets for any instrument, especially monophonic and, ii) an acoustic plectrum guitar dataset containing wave files and transcriptions in the form of label files. This method will aid as a preliminary step towards building concrete datasets for building AMT systems for different instruments.

8/28/2024