End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding

Read original: arXiv:2405.13527 - Published 5/24/2024 by Wei Zeng, Xian He, Ye Wang

End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding

Overview

This paper presents a novel approach for end-to-end polyphonic piano audio-to-score transcription using a hierarchical decoding mechanism.
The proposed system can directly convert real-world piano audio recordings into symbolic music notation, overcoming the complexity of musical structures.
The authors demonstrate the effectiveness of their method on various datasets, showcasing its potential for practical applications in music education, music information retrieval, and music production.

Plain English Explanation

The paper describes a new way to automatically convert piano music recordings into musical scores. This is a challenging task because piano music can be very complex, with multiple notes played simultaneously (polyphonic) and rapid changes in pitch and timing.

The key innovation is a "hierarchical decoding" approach, where the system first identifies the overall musical structure and then fills in the details of each note. This allows it to handle the intricate nature of real-world piano performances more effectively than previous methods.

The system was tested on several datasets of piano recordings, and the results show it can accurately transcribe the music into a standard musical notation format. This could be very useful for applications like music education, music information retrieval, and music production.

Technical Explanation

The proposed system uses a hierarchical decoding approach to transcribe polyphonic piano audio into musical scores. It first models the overall musical structure, such as the key, tempo, and rhythm, and then uses this high-level understanding to predict the individual notes and their timing.

The architecture consists of a feature extraction module to encode the audio input, followed by a hierarchical decoder. The decoder has two main components: a structural decoder that predicts the musical structure, and a note decoder that generates the actual note transcription based on the structural information.

The authors evaluated their method on several benchmark datasets, including the MAESTRO and PIANIST datasets. The results show that the hierarchical approach outperforms previous end-to-end piano transcription systems, particularly in terms of accurately capturing polyphonic note events.

Critical Analysis

The paper presents a compelling approach to the challenging problem of polyphonic piano transcription. The hierarchical decoding mechanism is a novel and well-justified solution to the complexity of musical structures.

However, the authors acknowledge that their system still struggles with certain aspects, such as accurately transcribing ornaments and rapid passages. There is also room for improvement in terms of real-time performance, which is crucial for practical applications.

Additionally, the evaluation is limited to piano music, and it would be valuable to see how the approach generalizes to other instruments or ensembles. Further research could also explore ways to incorporate additional musical knowledge or contextual information to enhance the transcription quality.

Conclusion

This paper introduces a state-of-the-art approach for end-to-end polyphonic piano audio-to-score transcription using a hierarchical decoding mechanism. The key innovation is the ability to model the overall musical structure and use that to guide the generation of the note-level transcription, which allows the system to better handle the complexity of real-world piano performances.

The results demonstrate the effectiveness of this method and its potential for a range of applications in music technology. While there is still room for improvement, this work represents a significant step forward in the field of automatic music transcription and could inspire further advancements in this important area of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding

Wei Zeng, Xian He, Ye Wang

Piano audio-to-score transcription (A2S) is an important yet underexplored task with extensive applications for music composition, practice, and analysis. However, existing end-to-end piano A2S systems faced difficulties in retrieving bar-level information such as key and time signatures, and have been trained and evaluated with only synthetic data. To address these limitations, we propose a sequence-to-sequence (Seq2Seq) model with a hierarchical decoder that aligns with the hierarchical structure of musical scores, enabling the transcription of score information at both the bar and note levels by multi-task learning. To bridge the gap between synthetic data and recordings of human performance, we propose a two-stage training scheme, which involves pre-training the model using an expressive performance rendering (EPR) system on synthetic audio, followed by fine-tuning the model using recordings of human performance. To preserve the voicing structure for score reconstruction, we propose a pre-processing method for **Kern scores in scenarios with an unconstrained number of voices. Experimental results support the effectiveness of our proposed approaches, in terms of both transcription performance on synthetic audio data in comparison to the current state-of-the-art, and the first experiment on human recordings.

5/24/2024

Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models

Taegyun Kwon, Dasaem Jeong, Juhan Nam

In recent years, advancements in neural network designs and the availability of large-scale labeled datasets have led to significant improvements in the accuracy of piano transcription models. However, most previous work focused on high-performance offline transcription, neglecting deliberate consideration of model size. The goal of this work is to implement real-time inference for piano transcription while ensuring both high performance and lightweight. To this end, we propose novel architectures for convolutional recurrent neural networks, redesigning an existing autoregressive piano transcription model. First, we extend the acoustic module by adding a frequency-conditioned FiLM layer to the CNN module to adapt the convolutional filters on the frequency axis. Second, we improve note-state sequence modeling by using a pitchwise LSTM that focuses on note-state transitions within a note. In addition, we augment the autoregressive connection with an enhanced recursive context. Using these components, we propose two types of models; one for high performance and the other for high compactness. Through extensive experiments, we show that the proposed models are comparable to state-of-the-art models in terms of note accuracy on the MAESTRO dataset. We also investigate the effective model size and real-time inference latency by gradually streamlining the architecture. Finally, we conduct cross-data evaluation on unseen piano datasets and in-depth analysis to elucidate the effect of the proposed components in the view of note length and pitch range.

4/11/2024

YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation

Sungkyun Chang, Emmanouil Benetos, Holger Kirchhoff, Simon Dixon

Multi-instrument music transcription aims to convert polyphonic music recordings into musical scores assigned to each instrument. This task is challenging for modeling as it requires simultaneously identifying multiple instruments and transcribing their pitch and precise timing, and the lack of fully annotated data adds to the training difficulties. This paper introduces YourMT3+, a suite of models for enhanced multi-instrument music transcription based on the recent language token decoding approach of MT3. We enhance its encoder by adopting a hierarchical attention transformer in the time-frequency domain and integrating a mixture of experts. To address data limitations, we introduce a new multi-channel decoding method for training with incomplete annotations and propose intra- and cross-stem augmentation for dataset mixing. Our experiments demonstrate direct vocal transcription capabilities, eliminating the need for voice separation pre-processors. Benchmarks across ten public datasets show our models' competitiveness with, or superiority to, existing transcription models. Further testing on pop music recordings highlights the limitations of current models. Fully reproducible code and datasets are available with demos at url{https://github.com/mimbres/YourMT3}.

8/2/2024

Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription

Antonio R'ios-Vila, Jorge Calvo-Zaragoza, Thierry Paquet

State-of-the-art end-to-end Optical Music Recognition (OMR) has, to date, primarily been carried out using monophonic transcription techniques to handle complex score layouts, such as polyphony, often by resorting to simplifications or specific adaptations. Despite their efficacy, these approaches imply challenges related to scalability and limitations. This paper presents the Sheet Music Transformer, the first end-to-end OMR model designed to transcribe complex musical scores without relying solely on monophonic strategies. Our model employs a Transformer-based image-to-sequence framework that predicts score transcriptions in a standard digital music encoding format from input images. Our model has been tested on two polyphonic music datasets and has proven capable of handling these intricate music structures effectively. The experimental outcomes not only indicate the competence of the model, but also show that it is better than the state-of-the-art methods, thus contributing to advancements in end-to-end OMR transcription.

4/30/2024