From Audio Encoders to Piano Judges: Benchmarking Performance Understanding for Solo Piano

Read original: arXiv:2407.04518 - Published 7/22/2024 by Huan Zhang, Jinhua Liang, Simon Dixon

From Audio Encoders to Piano Judges: Benchmarking Performance Understanding for Solo Piano

Overview

The paper presents a benchmark for evaluating how well machine learning models can understand and judge musical performances, specifically for solo piano.
It compares the performance of various audio encoding models and "piano judges" on tasks like predicting human ratings of piano performances.
The goal is to advance research in areas like automatic music evaluation, piano transcription, and music understanding.

Plain English Explanation

The researchers wanted to see how well different AI models could understand and evaluate performances of solo piano music. They took a variety of existing models that were trained to encode audio data or judge the quality of piano performances, and tested how they performed on several benchmark tasks.

For example, they had the models try to predict the ratings that human experts would give to recordings of piano performances. The idea was to see which models were best able to mimic human judgment and appreciation of the musical performances.

By comparing the performance of these different models, the researchers hoped to identify the strengths and weaknesses of current approaches to machine understanding of music. This could help guide future research in areas like automatically transcribing piano music from audio, or building AI systems that can provide meaningful feedback on musical performances.

The paper provides a standardized set of benchmarks and datasets that other researchers can use to test and improve their own models for music understanding. This kind of benchmarking is an important step towards developing AI systems that can engage with and appreciate music in ways that are more similar to how humans do.

Technical Explanation

The paper introduces a benchmark suite called "Piano Performance Understanding Benchmark" (PPUB) to evaluate how well machine learning models can understand and judge solo piano performances. It compares the performance of various audio encoding models (e.g. VGGish, Wav2Vec2) and "piano judges" (i.e. models trained to predict human ratings of performances) on tasks like:

Predicting human ratings of piano performances
Classifying performances as professional or amateur
Identifying technical and expressive aspects of performances

The authors curated a dataset of over 2,000 piano performance recordings along with human ratings and annotations. They used this to benchmark a variety of models, including both off-the-shelf audio encoders and models specifically trained on piano performance data.

The results show that while current models can perform reasonably well on some tasks, there is still significant room for improvement in machine understanding of the nuances of solo piano performance. The paper provides detailed analysis of the relative strengths and weaknesses of different model architectures and training approaches.

Critical Analysis

The PPUB benchmark presented in this paper is a valuable contribution to the field of music intelligence research. By providing a standardized set of tasks and datasets, it enables systematic comparison and evaluation of different models' capabilities in understanding and judging musical performances.

One potential limitation is the scope being narrowly focused on solo piano performances. While this allows for more controlled and detailed analysis, it may limit the generalizability to other musical genres and ensemble settings. Expanding the benchmark to include a wider range of instruments and ensemble types could make the findings more broadly applicable.

Additionally, the human ratings and annotations used as ground truth may themselves be subject to biases and inconsistencies. Exploring ways to collect more objective and reliable performance evaluations could strengthen the validity of the benchmark.

That said, the authors do acknowledge these challenges and suggest directions for future work to address them. Overall, this paper represents an important step forward in developing robust methods for machine understanding of music, with potential applications in areas like music education, performance evaluation, and automated music transcription.

Conclusion

This paper introduces a new benchmark for evaluating machine learning models' understanding of solo piano performances. By comparing the performance of various audio encoding and "piano judge" models on tasks like predicting human ratings, the authors provide valuable insights into the current state-of-the-art and opportunities for further progress in this area.

The PPUB benchmark and associated dataset offer a standardized framework for researchers to test and develop more musically-informed AI systems. Advances in this direction could have significant implications for fields like music education, performance analysis, and the broader goal of building AI that can engage with and appreciate music in ways more akin to human cognition.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

From Audio Encoders to Piano Judges: Benchmarking Performance Understanding for Solo Piano

Huan Zhang, Jinhua Liang, Simon Dixon

Our study investigates an approach for understanding musical performances through the lens of audio encoding models, focusing on the domain of solo Western classical piano music. Compared to composition-level attribute understanding such as key or genre, we identify a knowledge gap in performance-level music understanding, and address three critical tasks: expertise ranking, difficulty estimation, and piano technique detection, introducing a comprehensive Pianism-Labelling Dataset (PLD) for this purpose. We leverage pre-trained audio encoders, specifically Jukebox, Audio-MAE, MERT, and DAC, demonstrating varied capabilities in tackling downstream tasks, to explore whether domain-specific fine-tuning enhances capability in capturing performance nuances. Our best approach achieved 93.6% accuracy in expertise ranking, 33.7% in difficulty estimation, and 46.7% in technique detection, with Audio-MAE as the overall most effective encoder. Finally, we conducted a case study on Chopin Piano Competition data using trained models for expertise ranking, which highlights the challenge of accurately assessing top-tier performances.

7/22/2024

Towards Musically Informed Evaluation of Piano Transcription Models

Patricia Hu, Luk'av{s} Samuel Mart'ak, Carlos Cancino-Chac'on, Gerhard Widmer

Automatic piano transcription models are typically evaluated using simple frame- or note-wise information retrieval (IR) metrics. Such benchmark metrics do not provide insights into the transcription quality of specific musical aspects such as articulation, dynamics, or rhythmic precision of the output, which are essential in the context of expressive performance analysis. Furthermore, in recent years, MAESTRO has become the de-facto training and evaluation dataset for such models. However, inference performance has been observed to deteriorate substantially when applied on out-of-distribution data, thereby questioning the suitability and reliability of transcribed outputs from such models for specific MIR tasks. In this work, we investigate the performance of three state-of-the-art piano transcription models in two experiments. In the first one, we propose a variety of musically informed evaluation metrics which, in contrast to the IR metrics, offer more detailed insight into the musical quality of the transcriptions. In the second experiment, we compare inference performance on real-world and perturbed audio recordings, and highlight musical dimensions which our metrics can help explain. Our experimental results highlight the weaknesses of existing piano transcription metrics and contribute to a more musically sound error analysis of transcription outputs.

7/30/2024

🏷️

BERT-like Pre-training for Symbolic Piano Music Classification Tasks

Yi-Hui Chou, I-Chun Chen, Chin-Jui Chang, Joann Ching, Yi-Hsuan Yang

This article presents a benchmark study of symbolic piano music classification using the masked language modelling approach of the Bidirectional Encoder Representations from Transformers (BERT). Specifically, we consider two types of MIDI data: MIDI scores, which are musical scores rendered directly into MIDI with no dynamics and precisely aligned with the metrical grid notated by its composer and MIDI performances, which are MIDI encodings of human performances of musical scoresheets. With five public-domain datasets of single-track piano MIDI files, we pre-train two 12-layer Transformer models using the BERT approach, one for MIDI scores and the other for MIDI performances, and fine-tune them for four downstream classification tasks. These include two note-level classification tasks (melody extraction and velocity prediction) and two sequence-level classification tasks (style classification and emotion classification). Our evaluation shows that the BERT approach leads to higher classification accuracy than recurrent neural network (RNN)-based baselines.

4/16/2024

New!LLaQo: Towards a Query-Based Coach in Expressive Music Performance Assessment

Huan Zhang, Vincent Cheung, Hayato Nishioka, Simon Dixon, Shinichi Furuya

Research in music understanding has extensively explored composition-level attributes such as key, genre, and instrumentation through advanced representations, leading to cross-modal applications using large language models. However, aspects of musical performance such as stylistic expression and technique remain underexplored, along with the potential of using large language models to enhance educational outcomes with customized feedback. To bridge this gap, we introduce LLaQo, a Large Language Query-based music coach that leverages audio language modeling to provide detailed and formative assessments of music performances. We also introduce instruction-tuned query-response datasets that cover a variety of performance dimensions from pitch accuracy to articulation, as well as contextual performance understanding (such as difficulty and performance techniques). Utilizing AudioMAE encoder and Vicuna-7b LLM backend, our model achieved state-of-the-art (SOTA) results in predicting teachers' performance ratings, as well as in identifying piece difficulty and playing techniques. Textual responses from LLaQo was moreover rated significantly higher compared to other baseline models in a user study using audio-text matching. Our proposed model can thus provide informative answers to open-ended questions related to musical performance from audio data.

9/17/2024