COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations

Read original: arXiv:2404.16969 - Published 9/12/2024 by Ruben Ciranni, Giorgio Mariani, Michele Mancusi, Emilian Postolache, Giorgio Fabbro, Emanuele Rodol`a, Luca Cosmo

🤯

Overview

The paper presents a new contrastive learning method called COCOLA (Coherence-Oriented Contrastive Learning for Audio) for learning musical audio representations that capture harmonic and rhythmic coherence between audio samples.
The authors also introduce a new baseline for compositional music generation called CompoNet, which is based on the ControlNet architecture and generalizes the tasks of MSDM.
The paper evaluates the CompoNet model using the COCOLA method and releases pre-trained models on public datasets containing separate audio stems.

Plain English Explanation

The paper describes a new way of learning representations, or "embeddings," for musical audio data. The goal is to capture the coherence, or logical connection, between different parts of a musical track, such as the harmony (chords) and rhythm.

The COCOLA method works by comparing different combinations of the individual "stems" (or components) that make up a full musical piece. This allows the model to learn representations that reflect the compositional structure of the music, rather than just the individual sounds.

The authors also introduce a new CompoNet model for generating new music, based on the ControlNet architecture. This model is designed to be better at producing coherent, compositionally-aware musical accompaniment.

The paper evaluates the CompoNet model using the COCOLA method, and the authors release pre-trained versions of both models that can be used by other researchers working on tasks like music generation or audio representation learning.

Technical Explanation

The COCOLA method operates at the level of individual "stems" (or components) that make up a music track, such as the drums, bass, melody, etc. It uses a contrastive learning approach to train the model to capture the harmonic and rhythmic coherence between different combinations of these stems.

The CompoNet model is a new baseline for compositional music generation that is based on the ControlNet architecture. It is designed to generalize the tasks of MSDM, a previous model for generating musical accompaniment.

The authors evaluate the CompoNet model using the COCOLA method, which allows them to quantify how well the model captures the compositional structure of the music. They also release pre-trained versions of both COCOLA and CompoNet that were trained on public datasets containing separate audio stems, such as MUSDB18-HQ, MoisesDB, Slakh2100, and CocoChorales.

Critical Analysis

The paper presents a novel and promising approach for learning musical representations that capture the compositional structure of music. The COCOLA method's focus on coherence between different musical components is a valuable addition to the field of audio representation learning.

However, the paper does not provide a thorough examination of the limitations of the COCOLA and CompoNet models. For example, it is unclear how well the models perform on more complex or genre-diverse musical compositions, or how they compare to other state-of-the-art models for music generation and representation learning.

Additionally, the paper could have explored the potential applications of the COCOLA and CompoNet models beyond the task of accompaniment generation, such as music analysis, classification, or retrieval. Investigating these broader use cases could help readers better understand the significance and impact of the proposed methods.

Conclusion

This paper introduces a new contrastive learning method called COCOLA and a new baseline model called CompoNet for musical audio representation and compositional music generation. The COCOLA method's focus on capturing harmonic and rhythmic coherence between musical components is a novel and promising approach, and the release of pre-trained models on public datasets is a valuable contribution to the research community.

While the paper could have delved deeper into the limitations and broader applications of the proposed methods, it still represents an important step forward in the field of music representation learning and generative modeling. The COCOLA and CompoNet models have the potential to enable more coherent and compositionally-aware music generation, which could have significant implications for applications such as music composition, production, and education.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤯

COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations

Ruben Ciranni, Giorgio Mariani, Michele Mancusi, Emilian Postolache, Giorgio Fabbro, Emanuele Rodol`a, Luca Cosmo

We present COCOLA (Coherence-Oriented Contrastive Learning for Audio), a contrastive learning method for musical audio representations that captures the harmonic and rhythmic coherence between samples. Our method operates at the level of the stems composing music tracks and can input features obtained via Harmonic-Percussive Separation (HPS). COCOLA allows the objective evaluation of generative models for music accompaniment generation, which are difficult to benchmark with established metrics. In this regard, we evaluate recent music accompaniment generation models, demonstrating the effectiveness of the proposed method. We release the model checkpoints trained on public datasets containing separate stems (MUSDB18-HQ, MoisesDB, Slakh2100, and CocoChorales).

9/12/2024

💬

Continual Contrastive Spoken Language Understanding

Umberto Cappellazzo, Enrico Fini, Muqiao Yang, Daniele Falavigna, Alessio Brutti, Bhiksha Raj

Recently, neural networks have shown impressive progress across diverse fields, with speech processing being no exception. However, recent breakthroughs in this area require extensive offline training using large datasets and tremendous computing resources. Unfortunately, these models struggle to retain their previously acquired knowledge when learning new tasks continually, and retraining from scratch is almost always impractical. In this paper, we investigate the problem of learning sequence-to-sequence models for spoken language understanding in a class-incremental learning (CIL) setting and we propose COCONUT, a CIL method that relies on the combination of experience replay and contrastive learning. Through a modified version of the standard supervised contrastive loss applied only to the rehearsal samples, COCONUT preserves the learned representations by pulling closer samples from the same class and pushing away the others. Moreover, we leverage a multimodal contrastive loss that helps the model learn more discriminative representations of the new data by aligning audio and text features. We also investigate different contrastive designs to combine the strengths of the contrastive loss with teacher-student architectures used for distillation. Experiments on two established SLU datasets reveal the effectiveness of our proposed approach and significant improvements over the baselines. We also show that COCONUT can be combined with methods that operate on the decoder side of the model, resulting in further metrics improvements.

6/5/2024

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, S. Ramaneswaran, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute-binding of acoustic events. An instance from either benchmark consists of two audio-caption pairs, where both audios have the same acoustic events but with different compositions. An ALM is evaluated on how well it matches the right audio to the right caption. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. To train CompA-CLAP, we first propose improvements to contrastive training with composition-aware hard negatives, allowing for more focused training. Next, we propose a novel modular contrastive loss that helps the model learn fine-grained compositional understanding and overcomes the acute scarcity of openly available compositional audios. CompA-CLAP significantly improves over all our baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities.

8/1/2024

Sequential Contrastive Audio-Visual Learning

Ioannis Tsiamas, Santiago Pascual, Chunghsin Yeh, Joan Serr`a

Contrastive learning has emerged as a powerful technique in audio-visual representation learning, leveraging the natural co-occurrence of audio and visual modalities in extensive web-scale video datasets to achieve significant advancements. However, conventional contrastive audio-visual learning methodologies often rely on aggregated representations derived through temporal aggregation, which neglects the intrinsic sequential nature of the data. This oversight raises concerns regarding the ability of standard approaches to capture and utilize fine-grained information within sequences, information that is vital for distinguishing between semantically similar yet distinct examples. In response to this limitation, we propose sequential contrastive audio-visual learning (SCAV), which contrasts examples based on their non-aggregated representation space using sequential distances. Retrieval experiments with the VGGSound and Music datasets demonstrate the effectiveness of SCAV, showing 2-3x relative improvements against traditional aggregation-based contrastive learning and other methods from the literature. We also show that models trained with SCAV exhibit a high degree of flexibility regarding the metric employed for retrieval, allowing them to operate on a spectrum of efficiency-accuracy trade-offs, potentially making them applicable in multiple scenarios, from small- to large-scale retrieval.

7/9/2024