Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription

Read original: arXiv:2402.07596 - Published 4/30/2024 by Antonio R'ios-Vila, Jorge Calvo-Zaragoza, Thierry Paquet

Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription

Overview

This paper presents a novel model called the "Sheet Music Transformer" for end-to-end optical music recognition (OMR) that can handle polyphonic music transcription beyond just monophonic (single-voice) transcription.
The model leverages transformer-based neural networks to perform a sequence-to-sequence task, taking in scanned sheet music images and outputting a symbolic music representation.
The authors evaluate their approach on a dataset of piano quartets, demonstrating its effectiveness in transcribing complex polyphonic music.

Plain English Explanation

The paper describes a new artificial intelligence system called the "Sheet Music Transformer" that can automatically convert scanned images of sheet music into a digital format that computers can understand and work with. This is known as optical music recognition (OMR).

Previous OMR systems have typically been limited to handling simple, single-line melodies (monophonic music). In contrast, the Sheet Music Transformer is designed to work with more complex, multi-part music (polyphonic music) such as piano quartets, which have multiple musical lines playing at the same time.

The key innovation of this work is the use of a transformer-based neural network, which is a type of deep learning model that has shown great success in various language processing tasks. The authors adapt this transformer architecture to the problem of OMR, allowing the system to take in an image of sheet music and output a symbolic music representation that computers can understand and work with.

By evaluating their model on a dataset of piano quartets, the researchers demonstrate that the Sheet Music Transformer can effectively handle the complexities of polyphonic music transcription, going beyond the limitations of previous OMR systems. This advancement has the potential to enable a wide range of applications, such as automated music score digitization, music information retrieval, and interactive music analysis tools.

Technical Explanation

The authors propose a transformer-based end-to-end optical music recognition (OMR) system, called the Sheet Music Transformer, that can handle polyphonic music transcription beyond just monophonic (single-voice) transcription.

The key components of the Sheet Music Transformer are:

Image Encoder: A convolutional neural network (CNN) that encodes the input sheet music image into a sequence of feature representations.
Transformer Decoder: A transformer-based decoder that takes the encoded image features and generates a sequence of symbolic music tokens, representing the transcribed music.

The model is trained end-to-end on a dataset of piano quartet sheet music images and their corresponding symbolic music representations. During inference, the Sheet Music Transformer takes a scanned sheet music image as input and outputs the transcribed polyphonic music in a symbolic format.

The authors evaluate their approach on the GrandStaff dataset, which contains piano quartet scores. They compare the performance of the Sheet Music Transformer to previous OMR systems and demonstrate its effectiveness in handling complex polyphonic music transcription.

Critical Analysis

The authors acknowledge several limitations and areas for future research in their work:

The model's performance may be limited by the quality and diversity of the training data, as the GrandStaff dataset focuses on a specific musical genre (piano quartets).
The current model architecture does not explicitly model the hierarchical structure of music, such as measures, voices, and chords. Incorporating such structural information could potentially improve the model's accuracy and interpretability.
The authors suggest exploring efficient real-time transcription approaches that could enable interactive applications, such as live music capture and digitization.

While the Sheet Music Transformer represents a significant advancement in polyphonic OMR, further research is needed to address these limitations and expand the model's capabilities to handle a wider range of music styles and applications.

Conclusion

The Sheet Music Transformer introduced in this paper represents a notable step forward in optical music recognition (OMR) by addressing the challenge of polyphonic music transcription. By leveraging transformer-based neural networks, the model can effectively convert scanned sheet music images into symbolic music representations, going beyond the limitations of previous OMR systems that were restricted to monophonic (single-voice) transcription.

The successful evaluation of the Sheet Music Transformer on a dataset of piano quartets demonstrates its potential to enable a wide range of applications, such as automated music score digitization, music information retrieval, and interactive music analysis tools. As the authors suggest, future research directions include exploring ways to incorporate hierarchical music structure and developing efficient real-time transcription approaches to enable new interactive use cases.

Overall, this work contributes to the ongoing progress in AI-powered music understanding and the broader goal of bridging the gap between the physical and digital domains of music.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription

Antonio R'ios-Vila, Jorge Calvo-Zaragoza, Thierry Paquet

State-of-the-art end-to-end Optical Music Recognition (OMR) has, to date, primarily been carried out using monophonic transcription techniques to handle complex score layouts, such as polyphony, often by resorting to simplifications or specific adaptations. Despite their efficacy, these approaches imply challenges related to scalability and limitations. This paper presents the Sheet Music Transformer, the first end-to-end OMR model designed to transcribe complex musical scores without relying solely on monophonic strategies. Our model employs a Transformer-based image-to-sequence framework that predicts score transcriptions in a standard digital music encoding format from input images. Our model has been tested on two polyphonic music datasets and has proven capable of handling these intricate music structures effectively. The experimental outcomes not only indicate the competence of the model, but also show that it is better than the state-of-the-art methods, thus contributing to advancements in end-to-end OMR transcription.

4/30/2024

Sheet Music Transformer ++: End-to-End Full-Page Optical Music Recognition for Pianoform Sheet Music

Antonio R'ios-Vila, Jorge Calvo-Zaragoza, David Rizo, Thierry Paquet

Optical Music Recognition is a field that has progressed significantly, bringing accurate systems that transcribe effectively music scores into digital formats. Despite this, there are still several limitations that hinder OMR from achieving its full potential. Specifically, state of the art OMR still depends on multi-stage pipelines for performing full-page transcription, as well as it has only been demonstrated in monophonic cases, leaving behind very relevant engravings. In this work, we present the Sheet Music Transformer++, an end-to-end model that is able to transcribe full-page polyphonic music scores without the need of a previous Layout Analysis step. This is done thanks to an extensive curriculum learning-based pretraining with synthetic data generation. We conduct several experiments on a full-page extension of a public polyphonic transcription dataset. The experimental outcomes confirm that the model is competent at transcribing full-page pianoform scores, marking a noteworthy milestone in end-to-end OMR transcription.

5/22/2024

Toward a More Complete OMR Solution

Guang Yang (Paul G. Allen School of Computer Science & Engineering, University of Washington, United States), Muru Zhang (Paul G. Allen School of Computer Science & Engineering, University of Washington, United States), Lin Qiu (Paul G. Allen School of Computer Science & Engineering, University of Washington, United States), Yanming Wan (Paul G. Allen School of Computer Science & Engineering, University of Washington, United States), Noah A. Smith (Paul G. Allen School of Computer Science & Engineering, University of Washington, United States, Allen Institute for Artificial Intelligence, United States)

Optical music recognition (OMR) aims to convert music notation into digital formats. One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image (object detection) and then assembles them into a music notation (notation assembly). Most previous work on notation assembly unrealistically assumes perfect object detection. In this study, we focus on the MUSCIMA++ v2.0 dataset, which represents musical notation as a graph with pairwise relationships among detected music objects, and we consider both stages together. First, we introduce a music object detector based on YOLOv8, which improves detection performance. Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output. We find that this model is able to outperform existing models trained on perfect detection output, showing the benefit of considering the detection and assembly stages in a more holistic way. These findings, together with our novel evaluation metric, are important steps toward a more complete OMR solution.

9/4/2024

YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation

Sungkyun Chang, Emmanouil Benetos, Holger Kirchhoff, Simon Dixon

Multi-instrument music transcription aims to convert polyphonic music recordings into musical scores assigned to each instrument. This task is challenging for modeling as it requires simultaneously identifying multiple instruments and transcribing their pitch and precise timing, and the lack of fully annotated data adds to the training difficulties. This paper introduces YourMT3+, a suite of models for enhanced multi-instrument music transcription based on the recent language token decoding approach of MT3. We enhance its encoder by adopting a hierarchical attention transformer in the time-frequency domain and integrating a mixture of experts. To address data limitations, we introduce a new multi-channel decoding method for training with incomplete annotations and propose intra- and cross-stem augmentation for dataset mixing. Our experiments demonstrate direct vocal transcription capabilities, eliminating the need for voice separation pre-processors. Benchmarks across ten public datasets show our models' competitiveness with, or superiority to, existing transcription models. Further testing on pop music recordings highlights the limitations of current models. Fully reproducible code and datasets are available with demos at url{https://github.com/mimbres/YourMT3}.

8/2/2024