Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment

Read original: arXiv:2404.06682 - Published 4/11/2024 by Yuka Hashizume, Li Li, Atsushi Miyashita, Tomoki Toda

Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment

Overview

This paper explores the development of multidimensional disentangled representations of instrumental sounds for the purpose of assessing musical similarity.
The researchers propose a deep learning model that can extract interpretable features from audio data, such as pitch, timbre, and attack, and use these features to measure the similarity between different musical instruments.
The goal is to create a more nuanced and meaningful way of comparing and categorizing musical sounds, which could have applications in music information retrieval, synthesis, and education.

Plain English Explanation

In this research, the authors developed a machine learning system that can analyze the various characteristics of different musical instruments and sounds. The system is able to break down the audio into distinct features, such as the pitch, the timbre (or tone quality), and the way the sound attacks or starts.

This is important because when we compare the sounds of different instruments, it's not always easy to pinpoint exactly what makes them similar or different. For example, two saxophones may sound quite alike, but a saxophone and a violin may have very different timbres even if they are playing the same note. By extracting these underlying features, the system can provide a more detailed and meaningful way of assessing the relationships between various musical sounds.

The researchers believe this type of technology could be useful for a variety of musical applications, such as music information retrieval, where you might want to search for similar sounding instruments, or music synthesis, where you could generate new sounds by combining the different learned features. It could also potentially aid in music education by helping students and teachers better understand the unique qualities of various musical timbres.

Technical Explanation

The core of the researchers' approach is a deep learning model that takes in audio data and outputs a multidimensional representation of the sound. This representation is "disentangled", meaning the different perceptual features like pitch, timbre, and attack are encoded into separate dimensions of the output.

To train this model, the researchers used a combination of self-supervised and supervised learning techniques. The self-supervised component learns general audio representations by having the model reconstruct the input audio from its encoded representation. The supervised component then fine-tunes this representation to predict specific attributes of the sound, such as instrument class and perceptual features annotated by human listeners.

The resulting multidimensional representations are evaluated on a task of assessing musical similarity. The researchers show that the disentangled features outperform more traditional audio representations, like spectrograms, at capturing the nuanced relationships between different instrument sounds. They also demonstrate that the individual feature dimensions align with human judgments of perceptual qualities like brightness and attack.

Critical Analysis

A key strength of this work is the focus on learning interpretable and meaningful representations of audio, rather than treating it as a black box. By explicitly modeling the perceptual dimensions that underlie our perception of musical timbre, the researchers provide a more transparent and controllable approach to audio analysis and synthesis.

That said, the paper does not deeply explore the potential limitations or failure modes of this approach. For example, it's unclear how well the disentangled representations would generalize to more complex audio scenes with multiple overlapping instruments. There may also be inherent challenges in perfectly disentangling all perceptual aspects of timbre, which can be a highly multidimensional and context-dependent phenomenon.

Additionally, while the researchers validate their representations against human judgments of similarity, more work may be needed to fully understand the practical implications and use cases. For instance, it's not obvious how these representations could be applied to real-world music information retrieval or creative applications without further research.

Overall, this is an interesting and technically solid piece of work that advances our understanding of how to build more interpretable and musically-relevant audio representations. However, there is still room for further exploration of the limitations, robustness, and practical applications of this approach.

Conclusion

This paper presents a novel deep learning model for extracting multidimensional, disentangled representations of instrumental sounds. By explicitly modeling perceptual features like pitch, timbre, and attack, the researchers have developed a more nuanced and interpretable way of assessing musical similarity.

The potential applications of this work span music information retrieval, music synthesis, and music education, where a deeper understanding of the underlying characteristics of musical timbres could lead to significant advances. While more research is needed to fully explore the limitations and practical implications of this approach, this work represents an important step forward in our ability to computationally model the nuances of human musical perception.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment

Yuka Hashizume, Li Li, Atsushi Miyashita, Tomoki Toda

To achieve a flexible recommendation and retrieval system, it is desirable to calculate music similarity by focusing on multiple partial elements of musical pieces and allowing the users to select the element they want to focus on. A previous study proposed using multiple individual networks for calculating music similarity based on each instrumental sound, but it is impractical to use each signal as a query in search systems. Using separated instrumental sounds alternatively resulted in less accuracy due to artifacts. In this paper, we propose a method to compute similarities focusing on each instrumental sound with a single network that takes mixed sounds as input instead of individual instrumental sounds. Specifically, we design a single similarity embedding space with disentangled dimensions for each instrument, extracted by Conditional Similarity Networks, which is trained by the triplet loss using masks. Experimental results have shown that (1) the proposed method can obtain more accurate feature representation than using individual networks using separated sounds as input, (2) each sub-embedding space can hold the characteristics of the corresponding instrument, and (3) the selection of similar musical pieces focusing on each instrumental sound by the proposed method can obtain human consent, especially in drums and guitar.

4/11/2024

DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

Yin-Jyun Luo, Kin Wai Cheuk, Woosung Choi, Toshimitsu Uesaka, Keisuke Toyama, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Wei-Hsiang Liao, Simon Dixon, Yuki Mitsufuji

Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations act as modular building blocks for constructing the melody and instrument of a source, and the collection of which forms a set of per-instrument latent representations underlying the observed mixture. By manipulating the representations, our model samples mixtures with novel combinations of pitch and timbre of the constituent instruments. We can jointly learn the disentangled pitch-timbre representations and a latent diffusion transformer that reconstructs the mixture conditioned on the set of source-level representations. We evaluate the model using both a simple dataset of isolated chords and a realistic four-part chorales in the style of J.S. Bach, identify the key components for the success of disentanglement, and demonstrate the application of mixture transformation based on source-level attribute manipulation.

8/21/2024

I can listen but cannot read: An evaluation of two-tower multimodal systems for instrument recognition

Yannis Vasilakis, Rachel Bittner, Johan Pauwels

Music two-tower multimodal systems integrate audio and text modalities into a joint audio-text space, enabling direct comparison between songs and their corresponding labels. These systems enable new approaches for classification and retrieval, leveraging both modalities. Despite the promising results they have shown for zero-shot classification and retrieval tasks, closer inspection of the embeddings is needed. This paper evaluates the inherent zero-shot properties of joint audio-text spaces for the case-study of instrument recognition. We present an evaluation and analysis of two-tower systems for zero-shot instrument recognition and a detailed analysis of the properties of the pre-joint and joint embeddings spaces. Our findings suggest that audio encoders alone demonstrate good quality, while challenges remain within the text encoder or joint space projection. Specifically, two-tower systems exhibit sensitivity towards specific words, favoring generic prompts over musically informed ones. Despite the large size of textual encoders, they do not yet leverage additional textual context or infer instruments accurately from their descriptions. Lastly, a novel approach for quantifying the semantic meaningfulness of the textual space leveraging an instrument ontology is proposed. This method reveals deficiencies in the systems' understanding of instruments and provides evidence of the need for fine-tuning text encoders on musical data.

7/26/2024

Disentangled Acoustic Fields For Multimodal Physical Scene Understanding

Jie Yin, Andrew Luo, Yilun Du, Anoop Cherian, Tim K. Marks, Jonathan Le Roux, Chuang Gan

We study the problem of multimodal physical scene understanding, where an embodied agent needs to find fallen objects by inferring object properties, direction, and distance of an impact sound source. Previous works adopt feed-forward neural networks to directly regress the variables from sound, leading to poor generalization and domain adaptation issues. In this paper, we illustrate that learning a disentangled model of acoustic formation, referred to as disentangled acoustic field (DAF), to capture the sound generation and propagation process, enables the embodied agent to construct a spatial uncertainty map over where the objects may have fallen. We demonstrate that our analysis-by-synthesis framework can jointly infer sound properties by explicitly decomposing and factorizing the latent space of the disentangled model. We further show that the spatial uncertainty map can significantly improve the success rate for the localization of fallen objects by proposing multiple plausible exploration locations.

7/17/2024