Quantifying the Corpus Bias Problem in Automatic Music Transcription Systems

Read original: arXiv:2408.04737 - Published 8/12/2024 by Luk'av{s} Samuel Mart'ak, Patricia Hu, Gerhard Widmer

Quantifying the Corpus Bias Problem in Automatic Music Transcription Systems

Overview

This paper examines the problem of "corpus bias" in automatic music transcription (AMT) systems.
Corpus bias refers to the tendency of AMT models to perform better on data that is similar to the training data, and less well on data that is different.
The researchers quantify the extent of this problem and propose ways to address it.

Plain English Explanation

The paper looks at a challenge facing automatic music transcription (AMT) systems - the "corpus bias" problem. This means that these systems tend to work well on music that is similar to the data they were trained on, but struggle with music that is different.

For example, an AMT model trained mostly on classical piano music may do a great job transcribing new classical piano pieces, but perform poorly on rock or jazz recordings. The researchers wanted to understand the extent of this problem and find ways to address it.

They ran experiments to quantify how much performance degrades when AMT models are applied to music outside their training distribution. The results showed that corpus bias is a significant issue, with model accuracy dropping sharply when tested on unfamiliar music genres or instrumentation.

To help fix this, the researchers proposed strategies like using more diverse training data and developing evaluation methods that better reflect real-world usage. These approaches could make AMT systems more robust and applicable across a wider range of musical styles.

Technical Explanation

The researchers designed experiments to quantify the corpus bias problem in automatic music transcription (AMT). They trained several state-of-the-art AMT models on datasets of classical piano music, then evaluated the models' performance on test sets covering different genres and instrumentation.

The results showed that model accuracy dropped significantly when tested on music outside the training distribution. For example, a model trained on classical piano had over 50% worse note-level F1 score when evaluated on jazz recordings.

To understand the reasons for this, the researchers analyzed factors like timbre, rhythm, and polyphony across the datasets. They found that differences in these low-level musical characteristics contributed to the corpus bias effect.

Based on these insights, the researchers proposed several strategies to mitigate the corpus bias problem. These include using more diverse training data, developing specialized model architectures, and designing specialized evaluation metrics that better capture real-world usage scenarios.

Critical Analysis

The paper provides a thoughtful analysis of an important and under-studied problem in automatic music transcription (AMT). Corpus bias is a significant limitation of current AMT systems, and the researchers do a good job of quantifying its impact.

However, the paper could be strengthened by further discussion of the limitations and caveats of the work. For example, the experiments only looked at a handful of AMT models and datasets - it's unclear how generalizable the findings are. There may also be other factors beyond timbre, rhythm, and polyphony that contribute to corpus bias.

Additionally, while the proposed mitigation strategies seem promising, the paper does not provide a detailed implementation plan or evaluation of their effectiveness. More research would be needed to assess the practical feasibility and impact of these approaches.

Overall, this is a valuable contribution that highlights a critical challenge facing AMT technology. But there is still much work to be done to fully understand and overcome the corpus bias problem.

Conclusion

This paper sheds important light on the "corpus bias" issue in automatic music transcription (AMT) systems. The researchers demonstrated that current AMT models struggle significantly when applied to music outside their training distribution, with performance degrading by over 50% in some cases.

By analyzing factors like timbre, rhythm, and polyphony, the paper provides insights into the underlying causes of this problem. The proposed mitigation strategies, such as using more diverse training data and specialized evaluation metrics, offer promising directions for making AMT systems more robust and widely applicable.

Overall, this work highlights a crucial challenge that must be addressed for AMT technology to reach its full potential. As machine learning continues to transform the field of music analysis, understanding and overcoming corpus bias will be essential for developing reliable, generalizable systems that can handle the rich diversity of human music.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Quantifying the Corpus Bias Problem in Automatic Music Transcription Systems

Luk'av{s} Samuel Mart'ak, Patricia Hu, Gerhard Widmer

Automatic Music Transcription (AMT) is the task of recognizing notes in audio recordings of music. The State-of-the-Art (SotA) benchmarks have been dominated by deep learning systems. Due to the scarcity of high quality data, they are usually trained and evaluated exclusively or predominantly on classical piano music. Unfortunately, that hinders our ability to understand how they generalize to other music. Previous works have revealed several aspects of memorization and overfitting in these systems. We identify two primary sources of distribution shift: the music, and the sound. Complementing recent results on the sound axis (i.e. acoustics, timbre), we investigate the musical one (i.e. note combinations, dynamics, genre). We evaluate the performance of several SotA AMT systems on two new experimental test sets which we carefully construct to emulate different levels of musical distribution shift. Our results reveal a stark performance gap, shedding further light on the Corpus Bias problem, and the extent to which it continues to trouble these systems.

8/12/2024

Machine Learning Techniques in Automatic Music Transcription: A Systematic Survey

Fatemeh Jamshidi, Gary Pike, Amit Das, Richard Chapman

In the domain of Music Information Retrieval (MIR), Automatic Music Transcription (AMT) emerges as a central challenge, aiming to convert audio signals into symbolic notations like musical notes or sheet music. This systematic review accentuates the pivotal role of AMT in music signal analysis, emphasizing its importance due to the intricate and overlapping spectral structure of musical harmonies. Through a thorough examination of existing machine learning techniques utilized in AMT, we explore the progress and constraints of current models and methodologies. Despite notable advancements, AMT systems have yet to match the accuracy of human experts, largely due to the complexities of musical harmonies and the need for nuanced interpretation. This review critically evaluates both fully automatic and semi-automatic AMT systems, emphasizing the importance of minimal user intervention and examining various methodologies proposed to date. By addressing the limitations of prior techniques and suggesting avenues for improvement, our objective is to steer future research towards fully automated AMT systems capable of accurately and efficiently translating intricate audio signals into precise symbolic representations. This study not only synthesizes the latest advancements but also lays out a road-map for overcoming existing challenges in AMT, providing valuable insights for researchers aiming to narrow the gap between current systems and human-level transcription accuracy.

6/24/2024

Annotation-free Automatic Music Transcription with Scalable Synthetic Data and Adversarial Domain Confusion

Gakusei Sato, Taketo Akama

Automatic Music Transcription (AMT) is a vital technology in the field of music information processing. Despite recent enhancements in performance due to machine learning techniques, current methods typically attain high accuracy in domains where abundant annotated data is available. Addressing domains with low or no resources continues to be an unresolved challenge. To tackle this issue, we propose a transcription model that does not require any MIDI-audio paired data through the utilization of scalable synthetic audio for pre-training and adversarial domain confusion using unannotated real audio. In experiments, we evaluate methods under the real-world application scenario where training datasets do not include the MIDI annotation of audio in the target data domain. Our proposed method achieved competitive performance relative to established baseline methods, despite not utilizing any real datasets of paired MIDI-audio. Additionally, ablation studies have provided insights into the scalability of this approach and the forthcoming challenges in the field of AMT research.

7/4/2024

Development of Large Annotated Music Datasets using HMM-based Forced Viterbi Alignment

S. Johanan Joysingh, P. Vijayalakshmi, T. Nagarajan

Datasets are essential for any machine learning task. Automatic Music Transcription (AMT) is one such task, where considerable amount of data is required depending on the way the solution is achieved. Considering the fact that a music dataset, complete with audio and its time-aligned transcriptions would require the effort of people with musical experience, it could be stated that the task becomes even more challenging. Musical experience is required in playing the musical instrument(s), and in annotating and verifying the transcriptions. We propose a method that would help in streamlining this process, making the task of obtaining a dataset from a particular instrument easy and efficient. We use predefined guitar exercises and hidden Markov model(HMM) based forced viterbi alignment to accomplish this. The guitar exercises are designed to be simple. Since the note sequence are already defined, HMM based forced viterbi alignment provides time-aligned transcriptions of these audio files. The onsets of the transcriptions are manually verified and the labels are accurate up to 10ms, averaging at 5ms. The contributions of the proposed work is two fold, i) a well streamlined and efficient method for generating datasets for any instrument, especially monophonic and, ii) an acoustic plectrum guitar dataset containing wave files and transcriptions in the form of label files. This method will aid as a preliminary step towards building concrete datasets for building AMT systems for different instruments.

8/28/2024