An End-to-End Approach for Chord-Conditioned Song Generation

Read original: arXiv:2409.06307 - Published 9/11/2024 by Shuochen Gao, Shun Lei, Fan Zhuo, Hangyu Liu, Feng Liu, Boshi Tang, Qiaochu Huang, Shiyin Kang, Zhiyong Wu

An End-to-End Approach for Chord-Conditioned Song Generation

Overview

This paper presents an end-to-end approach for generating songs conditioned on chord progressions.
The system uses a transformer-based model to generate melodies, rhythms, and lyrics given a sequence of chord changes.
The model is trained on a dataset of songs with aligned chord and musical information.
Experiments show the generated songs have high coherence with the input chord progressions.

Plain English Explanation

The researchers have developed a machine learning system that can create new songs based on a sequence of chords. Chords are the different combinations of musical notes that provide the harmonic structure of a song.

The system uses a neural network model, specifically a transformer, to generate the melody, rhythm, and lyrics of a new song given a set of chords. The model is trained on a large dataset of existing songs that have the chord progressions aligned with the musical elements.

By conditioning the song generation on the chord changes, the system is able to produce new songs that are harmonically coherent and flow well with the provided chord sequence. This allows users to essentially "fill in the blanks" and create new music based on a basic chord structure.

The researchers evaluate the quality of the generated songs and find that they are perceived as meaningful and consistent with the input chords. This end-to-end approach to generating songs from chords could be a useful tool for songwriters, musicians, and music producers.

Technical Explanation

The core of the system is a transformer-based neural network that generates the melody, rhythm, and lyrics of a song given a sequence of chords as input. The transformer model allows the system to capture long-range dependencies in the musical structure.

The input to the model is a sequence of chord labels, which are first embedded into a vector representation. This chord embedding is then used as the conditioning input to the transformer, which generates the musical elements in an autoregressive fashion - predicting the next note, rhythm, or lyric token based on the previous outputs and the chord context.

The model is trained on a dataset of songs where the chord progressions are aligned with the corresponding musical content. This allows the system to learn the relationship between chord changes and melodic, rhythmic, and lyrical patterns.

During evaluation, the researchers find that the generated songs exhibit a high degree of coherence with the input chord sequences. Subjective listening tests also indicate that the songs are perceived as meaningful and natural by human judges.

The end-to-end nature of the system, where all musical elements are generated from the chord input, is a key contribution of this work. This allows for flexible and controllable song creation based on harmonic structure.

Critical Analysis

A limitation of the approach is that it only conditions on chord progressions, without considering other important musical aspects like tonality, key changes, or modulations. Incorporating these higher-level musical concepts could further improve the coherence and expressiveness of the generated songs.

Additionally, the training dataset used in the experiments is relatively small (around 5,000 songs), which may limit the model's ability to capture the full diversity of musical styles and structures. Scaling up the dataset size and diversity could lead to more versatile and representative song generation.

The paper does not provide a thorough analysis of the creativity or novelty of the generated songs. While the songs are coherent with the chord input, it is unclear how "original" or "unique" the generated content is compared to human-composed music. Evaluating the model's ability to produce truly novel and creative musical ideas is an important area for further research.

Overall, the end-to-end chord-conditioned song generation approach presented in this paper is a promising step towards more controllable and expressive musical creation using artificial intelligence. However, there are still opportunities to expand the musical understanding and creative capabilities of such systems.

Conclusion

This paper introduces an innovative end-to-end system for generating new songs based on a sequence of chord changes. By conditioning the generation of melody, rhythm, and lyrics on the harmonic structure, the system is able to produce musically coherent and meaningful songs.

The transformer-based model architecture and aligned dataset of songs with chord progressions are key technical contributions that enable this chord-conditioned song generation. While the current system has some limitations, the overall approach represents an important step towards more flexible and controllable AI-assisted music creation.

As the field of AI-generated music continues to evolve, systems like the one presented in this paper could significantly impact how music is composed and produced, empowering both professional and amateur creators.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

An End-to-End Approach for Chord-Conditioned Song Generation

Shuochen Gao, Shun Lei, Fan Zhuo, Hangyu Liu, Feng Liu, Boshi Tang, Qiaochu Huang, Shiyin Kang, Zhiyong Wu

The Song Generation task aims to synthesize music composed of vocals and accompaniment from given lyrics. While the existing method, Jukebox, has explored this task, its constrained control over the generations often leads to deficiency in music performance. To mitigate the issue, we introduce an important concept from music composition, namely chords, to song generation networks. Chords form the foundation of accompaniment and provide vocal melody with associated harmony. Given the inaccuracy of automatic chord extractors, we devise a robust cross-attention mechanism augmented with dynamic weight sequence to integrate extracted chord information into song generations and reduce frame-level flaws, and propose a novel model termed Chord-Conditioned Song Generator (CSG) based on it. Experimental evidence demonstrates our proposed method outperforms other approaches in terms of musical performance and control precision of generated songs.

9/11/2024

SongCreator: Lyrics-based Universal Song Generation

Shun Lei, Yixuan Zhou, Boshi Tang, Max W. Y. Lam, Feng Liu, Hangyu Liu, Jingcheng Wu, Shiyin Kang, Zhiyong Wu, Helen Meng

Music is an integral part of human culture, embodying human intelligence and creativity, of which songs compose an essential part. While various aspects of song generation have been explored by previous works, such as singing voice, vocal composition and instrumental arrangement, etc., generating songs with both vocals and accompaniment given lyrics remains a significant challenge, hindering the application of music generation models in the real world. In this light, we propose SongCreator, a song-generation system designed to tackle this challenge. The model features two novel designs: a meticulously designed dual-sequence language model (DSLM) to capture the information of vocals and accompaniment for song generation, and an additional attention mask strategy for DSLM, which allows our model to understand, generate and edit songs, making it suitable for various song-related generation tasks. Extensive experiments demonstrate the effectiveness of SongCreator by achieving state-of-the-art or competitive performances on all eight tasks. Notably, it surpasses previous works by a large margin in lyrics-to-song and lyrics-to-vocals. Additionally, it is able to independently control the acoustic conditions of the vocals and accompaniment in the generated song through different prompts, exhibiting its potential applicability. Our samples are available at https://songcreator.github.io/.

9/11/2024

MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Yun-Han Lan, Wen-Yi Hsiao, Hao-Chung Cheng, Yi-Hsuan Yang

Existing text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this challenge, we introduce MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds upon the pretrained MusicGen framework. Our innovation lies in an efficient finetuning mechanism, tailored for consumer-grade GPUs, that integrates automatically-extracted rhythm and chords as the condition signal. During inference, the condition can either be musical features extracted from a reference audio signal, or be user-defined symbolic chord sequence, BPM, and textual prompts. Our performance evaluation on two datasets -- one derived from extracted features and the other from user-created inputs -- demonstrates that MusiConGen can generate realistic backing track music that aligns well with the specified conditions. We open-source the code and model checkpoints, and provide audio examples online, https://musicongen.github.io/musicongen_demo/.

7/23/2024

Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment

Zhiqing Hong, Rongjie Huang, Xize Cheng, Yongqi Wang, Ruiqi Li, Fuming You, Zhou Zhao, Zhimeng Zhang

A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built up to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in https://text2songMelodist.github.io/Sample/.

5/21/2024