The Sound Demixing Challenge 2023 $unicode{x2013}$ Cinematic Demixing Track

Read original: arXiv:2308.06981 - Published 4/19/2024 by Stefan Uhlich, Giorgio Fabbro, Masato Hirano, Shusuke Takahashi, Gordon Wichern, Jonathan Le Roux, Dipam Chakraborty, Sharada Mohanty, Kai Li, Yi Luo and 7 others

🧠

Overview

The paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23).
It provides details about the challenge setup, including the structure of the competition and the datasets used.
The paper also discusses the most successful approaches employed by participants.

Plain English Explanation

The paper focuses on a competition called the Sound Demixing Challenge 2023, which had a specific track called Cinematic Demixing (CDX). In this track, participants had to separate different audio sources (like music, dialogue, and sound effects) from real movie audio recordings.

The paper explains how the competition was structured and what datasets were used, including a new dataset called CDXDB23 that was created from actual movie audio. It then describes the best-performing systems in the competition, and how they were able to improve on a baseline approach by making the simulated training data better match the characteristics of real cinematic audio.

[Relevant Link: https://aimodels.fyi/papers/arxiv/cross-domain-audio-deepfake-detection-dataset-analysis]

Overall, the key takeaway is that the participants were able to significantly improve the separation of movie audio into its component parts by carefully designing their machine learning models and training data to better reflect the complexities of real-world cinematic audio.

Technical Explanation

The paper provides a detailed overview of the CDX track of the SDX'23 competition. It describes the structure of the challenge, including the use of a hidden CDXDB23 dataset constructed from real movies to rank the submissions.

The results show that the best-performing system trained exclusively on the simulated Divide and Remaster (DnR) dataset achieved an improvement of 1.8 dB in signal-to-distortion ratio (SDR) compared to a baseline system. However, the top-performing system on the open leaderboard, where any data could be used for training, saw a much more significant improvement of 5.7 dB.

[Relevant Link: https://aimodels.fyi/papers/arxiv/x-lance-technical-report-interspeech-2024-speech]

The paper suggests that a key factor in this performance boost was making the simulated training data better match the characteristics of real cinematic audio, which the researchers investigate in detail.

Critical Analysis

The paper provides a thorough and well-structured analysis of the CDX track of the SDX'23 competition. However, it does not delve into potential limitations or caveats of the research.

[Relevant Link: https://aimodels.fyi/papers/arxiv/audio-dialogues-dialogues-dataset-audio-music-understanding]

For example, the paper does not discuss the generalizability of the findings to other types of audio separation tasks or the robustness of the techniques to real-world variations in cinematic audio. Additionally, the paper does not explore potential biases or ethical considerations in the development and deployment of such audio separation systems.

[Relevant Link: https://aimodels.fyi/papers/arxiv/tango-2-aligning-diffusion-based-text-to]

Further research could investigate these areas to provide a more comprehensive understanding of the implications and limitations of the proposed approaches.

Conclusion

The paper provides valuable insights into the state-of-the-art in cinematic audio separation, as demonstrated by the participants of the SDX'23 competition. The key finding is that carefully designing the training data to better reflect the complexities of real-world cinematic audio can lead to significant improvements in separation performance.

[Relevant Link: https://aimodels.fyi/papers/arxiv/weakly-supervised-audio-separation-via-bi-modal]

These advancements have the potential to enhance various applications, such as music production, movie post-processing, and audio archiving. However, further research is needed to fully understand the limitations and ethical considerations of these techniques.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧠

The Sound Demixing Challenge 2023 $unicode{x2013}$ Cinematic Demixing Track

Stefan Uhlich, Giorgio Fabbro, Masato Hirano, Shusuke Takahashi, Gordon Wichern, Jonathan Le Roux, Dipam Chakraborty, Sharada Mohanty, Kai Li, Yi Luo, Jianwei Yu, Rongzhi Gu, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Mikhail Sukhovei, Yuki Mitsufuji

This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most successful approaches employed by participants. Compared to the cocktail-fork baseline, the best-performing system trained exclusively on the simulated Divide and Remaster (DnR) dataset achieved an improvement of 1.8 dB in SDR, whereas the top-performing system on the open leaderboard, where any data could be used for training, saw a significant improvement of 5.7 dB. A significant source of this improvement was making the simulated data better match real cinematic audio, which we further investigate in detail.

4/19/2024

🛠️

The Sound Demixing Challenge 2023 $unicode{x2013}$ Music Demixing Track

Giorgio Fabbro, Stefan Uhlich, Chieh-Hsin Lai, Woosung Choi, Marco Mart'inez-Ram'irez, Weihsiang Liao, Igor Gadelha, Geraldo Ramos, Eddie Hsu, Hugo Rodrigues, Fabian-Robert Stoter, Alexandre D'efossez, Yi Luo, Jianwei Yu, Dipam Chakraborty, Sharada Mohanty, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Nabarun Goswami, Tatsuya Harada, Minseok Kim, Jun Hyung Lee, Yuanliang Dong, Xinran Zhang, Jiafeng Liu, Yuki Mitsufuji

This paper summarizes the music demixing (MDX) track of the Sound Demixing Challenge (SDX'23). We provide a summary of the challenge setup and introduce the task of robust music source separation (MSS), i.e., training MSS models in the presence of errors in the training data. We propose a formalization of the errors that can occur in the design of a training dataset for MSS systems and introduce two new datasets that simulate such errors: SDXDB23_LabelNoise and SDXDB23_Bleeding. We describe the methods that achieved the highest scores in the competition. Moreover, we present a direct comparison with the previous edition of the challenge (the Music Demixing Challenge 2021): the best performing system achieved an improvement of over 1.6dB in signal-to-distortion ratio over the winner of the previous competition, when evaluated on MDXDB21. Besides relying on the signal-to-distortion ratio as objective metric, we also performed a listening test with renowned producers and musicians to study the perceptual quality of the systems and report here the results. Finally, we provide our insights into the organization of the competition and our prospects for future editions.

4/22/2024

Remastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual Support

Karn N. Watcharasupat, Chih-Wei Wu, Iroro Orife

Cinematic audio source separation (CASS), as a problem of extracting the dialogue, music, and effects stems from their mixture, is a relatively new subtask of audio source separation. To date, only one publicly available dataset exists for CASS, that is, the Divide and Remaster (DnR) dataset, which is currently at version 2. While DnR v2 has been an incredibly useful resource for CASS, several areas of improvement have been identified, particularly through its use in the 2023 Sound Demixing Challenge. In this work, we develop version 3 of the DnR dataset, addressing issues relating to vocal content in non-dialogue stems, loudness distributions, mastering process, and linguistic diversity. In particular, the dialogue stem of DnR v3 includes speech content from more than 30 languages from multiple families including but not limited to the Germanic, Romance, Indo-Aryan, Dravidian, Malayo-Polynesian, and Bantu families. Benchmark results using the Bandit model indicated that training on multilingual data yields significant generalizability to the model even in languages with low data availability. Even in languages with high data availability, the multilingual model often performs on par or better than dedicated models trained on monolingual CASS datasets. Dataset and model implementation will be made available at https://github.com/kwatcharasupat/source-separation-landing.

8/27/2024

👨‍🏫

Benchmarks and leaderboards for sound demixing tasks

Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva

Music demixing is the task of separating different tracks from the given single audio signal into components, such as drums, bass, and vocals from the rest of the accompaniment. Separation of sources is useful for a range of areas, including entertainment and hearing aids. In this paper, we introduce two new benchmarks for the sound source separation tasks and compare popular models for sound demixing, as well as their ensembles, on these benchmarks. For the models' assessments, we provide the leaderboard at https://mvsep.com/quality_checker/, giving a comparison for a range of models. The new benchmark datasets are available for download. We also develop a novel approach for audio separation, based on the ensembling of different models that are suited best for the particular stem. The proposed solution was evaluated in the context of the Music Demixing Challenge 2023 and achieved top results in different tracks of the challenge. The code and the approach are open-sourced on GitHub.

5/8/2024