PiCoGen: Generate Piano Covers with a Two-stage Approach

Read original: arXiv:2407.20883 - Published 7/31/2024 by Chih-Pin Tan, Shuen-Huei Guan, Yi-Hsuan Yang

PiCoGen: Generate Piano Covers with a Two-stage Approach

Overview

PiCoGen is a two-stage approach for generating piano covers of existing songs
It first transcribes the audio input into symbolic music, then applies style transfer to generate a piano cover version
The system aims to produce piano covers that capture the essence of the original song while adding the pianist's unique interpretation

Plain English Explanation

PiCoGen is a system that can take an existing song and generate a piano cover version of it. It works in two steps:

Transcription: The first step is to analyze the original song's audio and convert it into a symbolic music representation, like sheet music. This allows the system to understand the melody, chords, and other musical elements.
Style Transfer: The second step is to take that symbolic music representation and transform it into a piano cover version. The system applies "style transfer" techniques to add the pianist's personal touch and interpretation, while still preserving the essence of the original song.

The key idea is to leverage both the original audio and the symbolic music representation to produce a piano cover that feels authentic and captures the spirit of the original, but with the pianist's own creative flair. This allows for more expressive and personalized cover song generation, compared to simply transcribing the original song directly.

Technical Explanation

PiCoGen uses a two-stage approach to generate piano covers. The first stage is a transcription model that converts the input audio into a symbolic music representation. This involves identifying the melody, chords, and other musical elements present in the original song.

The second stage is a style transfer model that takes the symbolic music representation and generates a piano cover version. This model is trained to learn the unique playing style and interpretation of a particular pianist, and then applies that style to the input music to create a personalized cover.

The researchers experimented with different model architectures, including Transformer-based approaches, to enable fine-grained control over aspects like rhythm and chord progression during the generation process. This allows the system to produce piano covers that closely match the desired style and expression.

Critical Analysis

The PiCoGen approach addresses an important challenge in music generation - creating expressive, personalized covers that go beyond simple transcription. By separating the transcription and style transfer stages, the system can leverage both the original audio and symbolic music representations to achieve more natural and compelling results.

However, the paper acknowledges some limitations. The transcription model may not always perfectly capture the nuances of the original performance, which could impact the quality of the final piano cover. Additionally, the style transfer model is dependent on having high-quality training data to learn the pianist's unique style, which may be difficult to obtain for less popular or obscure artists.

Further research could explore ways to improve the transcription accuracy, as well as techniques for learning style representations from limited training data. Incorporating user feedback or interaction during the generation process could also be a fruitful area of investigation, allowing users to fine-tune the piano covers to their preferences.

Conclusion

PiCoGen represents an interesting advance in the field of controllable music generation by combining transcription and style transfer to create expressive piano covers. The two-stage approach allows the system to preserve the essence of the original song while adding the pianist's personal touch, potentially enabling more creative and nuanced cover song generation. While the current system has some limitations, the overall concept and techniques used in PiCoGen could have broader implications for advancing the state of the art in AI-assisted music creation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

PiCoGen: Generate Piano Covers with a Two-stage Approach

Chih-Pin Tan, Shuen-Huei Guan, Yi-Hsuan Yang

Cover song generation stands out as a popular way of music making in the music-creative community. In this study, we introduce Piano Cover Generation (PiCoGen), a two-stage approach for automatic cover song generation that transcribes the melody line and chord progression of a song given its audio recording, and then uses the resulting lead sheet as the condition to generate a piano cover in the symbolic domain. This approach is advantageous in that it does not required paired data of covers and their original songs for training. Compared to an existing approach that demands such paired data, our evaluation shows that PiCoGen demonstrates competitive or even superior performance across songs of different musical genres.

7/31/2024

PiCoGen2: Piano cover generation with transfer learning approach and weakly aligned data

Chih-Pin Tan, Hsin Ai, Yi-Hsin Chang, Shuen-Huei Guan, Yi-Hsuan Yang

Piano cover generation aims to create a piano cover from a pop song. Existing approaches mainly employ supervised learning and the training demands strongly-aligned and paired song-to-piano data, which is built by remapping piano notes to song audio. This would, however, result in the loss of piano information and accordingly cause inconsistencies between the original and remapped piano versions. To overcome this limitation, we propose a transfer learning approach that pre-trains our model on piano-only data and fine-tunes it on weakly-aligned paired data constructed without note remapping. During pre-training, to guide the model to learn piano composition concepts instead of merely transcribing audio, we use an existing lead sheet transcription model as the encoder to extract high-level features from the piano recordings. The pre-trained model is then fine-tuned on the paired song-piano data to transfer the learned composition knowledge to the pop song domain. Our evaluation shows that this training strategy enables our model, named PiCoGen2, to attain high-quality results, outperforming baselines on both objective and subjective metrics across five pop genres.

8/6/2024

An End-to-End Approach for Chord-Conditioned Song Generation

Shuochen Gao, Shun Lei, Fan Zhuo, Hangyu Liu, Feng Liu, Boshi Tang, Qiaochu Huang, Shiyin Kang, Zhiyong Wu

The Song Generation task aims to synthesize music composed of vocals and accompaniment from given lyrics. While the existing method, Jukebox, has explored this task, its constrained control over the generations often leads to deficiency in music performance. To mitigate the issue, we introduce an important concept from music composition, namely chords, to song generation networks. Chords form the foundation of accompaniment and provide vocal melody with associated harmony. Given the inaccuracy of automatic chord extractors, we devise a robust cross-attention mechanism augmented with dynamic weight sequence to integrate extracted chord information into song generations and reduce frame-level flaws, and propose a novel model termed Chord-Conditioned Song Generator (CSG) based on it. Experimental evidence demonstrates our proposed method outperforms other approaches in terms of musical performance and control precision of generated songs.

9/11/2024

Emotion-driven Piano Music Generation via Two-stage Disentanglement and Functional Representation

Jingyue Huang, Ke Chen, Yi-Hsuan Yang

Managing the emotional aspect remains a challenge in automatic music generation. Prior works aim to learn various emotions at once, leading to inadequate modeling. This paper explores the disentanglement of emotions in piano performance generation through a two-stage framework. The first stage focuses on valence modeling of lead sheet, and the second stage addresses arousal modeling by introducing performance-level attributes. To further capture features that shape valence, an aspect less explored by previous approaches, we introduce a novel functional representation of symbolic music. This representation aims to capture the emotional impact of major-minor tonality, as well as the interactions among notes, chords, and key signatures. Objective and subjective experiments validate the effectiveness of our framework in both emotional valence and arousal modeling. We further leverage our framework in a novel application of emotional controls, showing a broad potential in emotion-driven music generation.

7/31/2024