Sheet Music Transformer ++: End-to-End Full-Page Optical Music Recognition for Pianoform Sheet Music

Read original: arXiv:2405.12105 - Published 5/22/2024 by Antonio R'ios-Vila, Jorge Calvo-Zaragoza, David Rizo, Thierry Paquet

Sheet Music Transformer ++: End-to-End Full-Page Optical Music Recognition for Pianoform Sheet Music

Overview

This paper presents a novel deep learning model called "Sheet Music Transformer++" for end-to-end full-page optical music recognition (OMR) on pianoform sheet music.
The model leverages BERT-like pre-training on symbolic piano music to achieve high performance on OMR tasks.
The authors also introduce a new dataset for evaluating full-page OMR systems and benchmark their approach against state-of-the-art methods.

Plain English Explanation

The paper describes a new AI model called "Sheet Music Transformer++" that can automatically transcribe entire pages of handwritten or printed sheet music into a digital format. This is a challenging task called optical music recognition (OMR), where the model needs to accurately identify all the musical notes, symbols, and other elements on the page.

The key innovation in this work is that the model uses a pre-training approach inspired by the successful BERT language model. The researchers first train the model on a large dataset of existing digital sheet music, teaching it to understand the structure and grammar of musical notation. This pre-training allows the model to more effectively recognize patterns and extract meaning from the input sheet music images.

The researchers also created a new benchmark dataset to evaluate full-page OMR systems, going beyond previous work that only looked at recognizing individual musical symbols. This allows for a more comprehensive assessment of how well these models can handle the complexity of a complete musical score.

Overall, this research represents an important step forward in automating the process of digitizing physical sheet music, which has many applications in music education, archiving, and computational creativity. By leveraging powerful deep learning techniques like BERT pre-training, the Sheet Music Transformer++ model demonstrates significant improvements over prior OMR approaches.

Technical Explanation

The core of the Sheet Music Transformer++ model is a vision transformer architecture that takes a full-page sheet music image as input and outputs a symbolic music representation. To enable this end-to-end OMR capability, the authors leverage BERT-like pre-training on symbolic piano music to provide the model with strong prior knowledge about musical structure and notation.

The model first encodes the input sheet music image using a convolutional neural network backbone. The resulting feature maps are then split into patches and fed into a transformer encoder, which learns contextual representations of the music symbols. A transformer decoder then generates the final symbolic music output, such as MIDI or MusicXML, in an auto-regressive fashion.

The authors evaluate their approach on a new benchmark dataset called FullPageOMR, which contains full-page sheet music images with ground truth annotations. They compare the Sheet Music Transformer++ against several state-of-the-art OMR systems, including Towards Efficient Real-Time Piano Transcription Using Deep Learning and Scoring Intervals Using Non-Hierarchical Transformer for Automatic Music Transcription. The results demonstrate significant improvements in OMR accuracy, particularly for full-page transcription.

Critical Analysis

The authors acknowledge several limitations of their approach. First, the Sheet Music Transformer++ model is still not fast enough for real-time OMR applications, which would require further optimizations. Second, the model's performance may degrade on sheet music with complex layouts, unusual notation conventions, or poor image quality.

Additionally, the FullPageOMR dataset, while a valuable contribution, only covers pianoform sheet music. Expanding the model to handle a broader range of musical scores, such as orchestral or choral pieces, would be an important next step.

It would also be interesting to see how the Sheet Music Transformer++ model could be further improved by incorporating techniques from the Large Language Models from Notes to Musical research, which explores using large language models for symbolic music generation and understanding.

Overall, the Sheet Music Transformer++ represents a promising advance in the field of optical music recognition, demonstrating the power of deep learning and pre-training techniques to tackle this complex task. However, there is still room for improvement, and further research could explore ways to make the model more robust, efficient, and versatile.

Conclusion

The Sheet Music Transformer++ model presented in this paper is a significant step forward in the field of optical music recognition (OMR). By leveraging BERT-like pre-training on symbolic piano music, the model is able to achieve state-of-the-art performance on the challenging task of full-page OMR for pianoform sheet music.

The introduction of the FullPageOMR benchmark dataset also provides a valuable resource for evaluating and comparing OMR systems. While the model has some limitations in terms of speed and handling complex layouts, the authors have demonstrated the power of deep learning and transfer learning techniques to advance the state of the art in this domain.

Overall, this research has important implications for the digitization and computational analysis of sheet music, with applications in music education, archiving, and computational creativity. As the field of OMR continues to evolve, the insights and approaches presented in this paper will likely inform and inspire future work in this exciting area of artificial intelligence and music technology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Sheet Music Transformer ++: End-to-End Full-Page Optical Music Recognition for Pianoform Sheet Music

Antonio R'ios-Vila, Jorge Calvo-Zaragoza, David Rizo, Thierry Paquet

Optical Music Recognition is a field that has progressed significantly, bringing accurate systems that transcribe effectively music scores into digital formats. Despite this, there are still several limitations that hinder OMR from achieving its full potential. Specifically, state of the art OMR still depends on multi-stage pipelines for performing full-page transcription, as well as it has only been demonstrated in monophonic cases, leaving behind very relevant engravings. In this work, we present the Sheet Music Transformer++, an end-to-end model that is able to transcribe full-page polyphonic music scores without the need of a previous Layout Analysis step. This is done thanks to an extensive curriculum learning-based pretraining with synthetic data generation. We conduct several experiments on a full-page extension of a public polyphonic transcription dataset. The experimental outcomes confirm that the model is competent at transcribing full-page pianoform scores, marking a noteworthy milestone in end-to-end OMR transcription.

5/22/2024

Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription

Antonio R'ios-Vila, Jorge Calvo-Zaragoza, Thierry Paquet

State-of-the-art end-to-end Optical Music Recognition (OMR) has, to date, primarily been carried out using monophonic transcription techniques to handle complex score layouts, such as polyphony, often by resorting to simplifications or specific adaptations. Despite their efficacy, these approaches imply challenges related to scalability and limitations. This paper presents the Sheet Music Transformer, the first end-to-end OMR model designed to transcribe complex musical scores without relying solely on monophonic strategies. Our model employs a Transformer-based image-to-sequence framework that predicts score transcriptions in a standard digital music encoding format from input images. Our model has been tested on two polyphonic music datasets and has proven capable of handling these intricate music structures effectively. The experimental outcomes not only indicate the competence of the model, but also show that it is better than the state-of-the-art methods, thus contributing to advancements in end-to-end OMR transcription.

4/30/2024

Toward a More Complete OMR Solution

Guang Yang (Paul G. Allen School of Computer Science & Engineering, University of Washington, United States), Muru Zhang (Paul G. Allen School of Computer Science & Engineering, University of Washington, United States), Lin Qiu (Paul G. Allen School of Computer Science & Engineering, University of Washington, United States), Yanming Wan (Paul G. Allen School of Computer Science & Engineering, University of Washington, United States), Noah A. Smith (Paul G. Allen School of Computer Science & Engineering, University of Washington, United States, Allen Institute for Artificial Intelligence, United States)

Optical music recognition (OMR) aims to convert music notation into digital formats. One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image (object detection) and then assembles them into a music notation (notation assembly). Most previous work on notation assembly unrealistically assumes perfect object detection. In this study, we focus on the MUSCIMA++ v2.0 dataset, which represents musical notation as a graph with pairwise relationships among detected music objects, and we consider both stages together. First, we introduce a music object detector based on YOLOv8, which improves detection performance. Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output. We find that this model is able to outperform existing models trained on perfect detection output, showing the benefit of considering the detection and assembly stages in a more holistic way. These findings, together with our novel evaluation metric, are important steps toward a more complete OMR solution.

9/4/2024

YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation

Sungkyun Chang, Emmanouil Benetos, Holger Kirchhoff, Simon Dixon

Multi-instrument music transcription aims to convert polyphonic music recordings into musical scores assigned to each instrument. This task is challenging for modeling as it requires simultaneously identifying multiple instruments and transcribing their pitch and precise timing, and the lack of fully annotated data adds to the training difficulties. This paper introduces YourMT3+, a suite of models for enhanced multi-instrument music transcription based on the recent language token decoding approach of MT3. We enhance its encoder by adopting a hierarchical attention transformer in the time-frequency domain and integrating a mixture of experts. To address data limitations, we introduce a new multi-channel decoding method for training with incomplete annotations and propose intra- and cross-stem augmentation for dataset mixing. Our experiments demonstrate direct vocal transcription capabilities, eliminating the need for voice separation pre-processors. Benchmarks across ten public datasets show our models' competitiveness with, or superiority to, existing transcription models. Further testing on pop music recordings highlights the limitations of current models. Fully reproducible code and datasets are available with demos at url{https://github.com/mimbres/YourMT3}.

8/2/2024