DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

Read original: arXiv:2408.10807 - Published 8/21/2024 by Yin-Jyun Luo, Kin Wai Cheuk, Woosung Choi, Toshimitsu Uesaka, Keisuke Toyama, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Wei-Hsiang Liao, Simon Dixon and 1 other

DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

Overview

DisMix is a machine learning model that can separate musical instrument sounds from a mixed audio signal
It can then manipulate the pitch and timbre of the individual instrument sounds

Plain English Explanation

DisMix is a system that can take a recording of multiple musical instruments playing together and break it down into the individual instrument sounds. It can then adjust the pitch and tone (timbre) of those individual instruments without affecting the others.

This allows musicians and audio engineers to be more creative and have more control when working with complex musical recordings. For example, they could isolate a guitar part and change its sound without impacting the drums or vocals. Or they could shift the pitch of a saxophone line while preserving the original timbre.

Technical Explanation

The key innovation in DisMix is its ability to disentangle the pitch and timbre information in the mixed audio signal. Most prior approaches treated pitch and timbre as coupled, making it difficult to manipulate them independently.

DisMix uses a neural network architecture with separate encoders for pitch and timbre information. This allows the model to learn a more nuanced representation of the audio, capturing the distinct factors that contribute to the overall sound. The encoded pitch and timbre representations can then be recombined and decoded to manipulate the individual instrument sounds.

The model is trained on a large dataset of isolated instrument sounds, which enables it to generalize to unseen mixtures during inference. Experiments show that DisMix outperforms previous state-of-the-art methods on pitch and timbre manipulation tasks.

Critical Analysis

The authors acknowledge that DisMix has limitations in handling polyphonic audio with many overlapping instrument sounds. There may also be challenges in preserving subtle musical nuances when manipulating individual components of a complex mix.

Additionally, the model is primarily evaluated on synthetic test cases, and its performance on real-world, professionally recorded music may differ. Further research is needed to assess the model's robustness and applicability in diverse, real-world music production scenarios.

Conclusion

DisMix represents a significant advancement in audio source separation and manipulation, with the potential to empower musicians, producers, and audio engineers. By disentangling pitch and timbre, the model enables fine-grained control over individual instrument sounds within a mixed recording. This could lead to new creative possibilities and enhanced audio editing workflows.

While the current system has some limitations, the core ideas behind DisMix suggest a promising direction for future research in audio signal processing and generative models for music creation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

Yin-Jyun Luo, Kin Wai Cheuk, Woosung Choi, Toshimitsu Uesaka, Keisuke Toyama, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Wei-Hsiang Liao, Simon Dixon, Yuki Mitsufuji

Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations act as modular building blocks for constructing the melody and instrument of a source, and the collection of which forms a set of per-instrument latent representations underlying the observed mixture. By manipulating the representations, our model samples mixtures with novel combinations of pitch and timbre of the constituent instruments. We can jointly learn the disentangled pitch-timbre representations and a latent diffusion transformer that reconstructs the mixture conditioned on the set of source-level representations. We evaluate the model using both a simple dataset of isolated chords and a realistic four-part chorales in the style of J.S. Bach, identify the key components for the success of disentanglement, and demonstrate the application of mixture transformation based on source-level attribute manipulation.

8/21/2024

Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment

Yuka Hashizume, Li Li, Atsushi Miyashita, Tomoki Toda

To achieve a flexible recommendation and retrieval system, it is desirable to calculate music similarity by focusing on multiple partial elements of musical pieces and allowing the users to select the element they want to focus on. A previous study proposed using multiple individual networks for calculating music similarity based on each instrumental sound, but it is impractical to use each signal as a query in search systems. Using separated instrumental sounds alternatively resulted in less accuracy due to artifacts. In this paper, we propose a method to compute similarities focusing on each instrumental sound with a single network that takes mixed sounds as input instead of individual instrumental sounds. Specifically, we design a single similarity embedding space with disentangled dimensions for each instrument, extracted by Conditional Similarity Networks, which is trained by the triplet loss using masks. Experimental results have shown that (1) the proposed method can obtain more accurate feature representation than using individual networks using separated sounds as input, (2) each sub-embedding space can hold the characteristics of the corresponding instrument, and (3) the selection of similar musical pieces focusing on each instrumental sound by the proposed method can obtain human consent, especially in drums and guitar.

4/11/2024

🤷

Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer

Michele Mancusi, Yurii Halychanskyi, Kin Wai Cheuk, Chieh-Hsin Lai, Stefan Uhlich, Junghyun Koo, Marco A. Mart'inez-Ram'irez, Wei-Hsiang Liao, Giorgio Fabbro, Yuhki Mitsufuji

Music timbre transfer is a challenging task that involves modifying the timbral characteristics of an audio signal while preserving its melodic structure. In this paper, we propose a novel method based on dual diffusion bridges, trained using the CocoChorales Dataset, which consists of unpaired monophonic single-instrument audio data. Each diffusion model is trained on a specific instrument with a Gaussian prior. During inference, a model is designated as the source model to map the input audio to its corresponding Gaussian prior, and another model is designated as the target model to reconstruct the target audio from this Gaussian prior, thereby facilitating timbre transfer. We compare our approach against existing unsupervised timbre transfer models such as VAEGAN and Gaussian Flow Bridges (GFB). Experimental results demonstrate that our method achieves both better Fr'echet Audio Distance (FAD) and melody preservation, as reflected by lower pitch distances (DPD) compared to VAEGAN and GFB. Additionally, we discover that the noise level from the Gaussian prior, $sigma$, can be adjusted to control the degree of melody preservation and amount of timbre transferred.

9/16/2024

Real-time Timbre Remapping with Differentiable DSP

Jordie Shier, Charalampos Saitis, Andrew Robertson, Andrew McPherson

Timbre is a primary mode of expression in diverse musical contexts. However, prevalent audio-driven synthesis methods predominantly rely on pitch and loudness envelopes, effectively flattening timbral expression from the input. Our approach draws on the concept of timbre analogies and investigates how timbral expression from an input signal can be mapped onto controls for a synthesizer. Leveraging differentiable digital signal processing, our method facilitates direct optimization of synthesizer parameters through a novel feature difference loss. This loss function, designed to learn relative timbral differences between musical events, prioritizes the subtleties of graded timbre modulations within phrases, allowing for meaningful translations in a timbre space. Using snare drum performances as a case study, where timbral expression is central, we demonstrate real-time timbre remapping from acoustic snare drums to a differentiable synthesizer modeled after the Roland TR-808.

7/8/2024