Scaling up masked audio encoder learning for general audio classification

2406.06992

Published 6/14/2024 by Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang

Scaling up masked audio encoder learning for general audio classification

Abstract

Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and dataset sizes for SSL-based general audio classification. We introduce Dasheng, a simple SSL audio encoder, based on the efficient masked autoencoder framework. Trained with 1.2 billion parameters on 272,356 hours of diverse audio, Dasheng obtains significant performance gains on the HEAR benchmark. It outperforms previous works on CREMA-D, LibriCount, Speech Commands, VoxLingua, and competes well in music and environment classification. Dasheng features inherently contain rich speech, music, and environmental information, as shown in nearest-neighbor classification experiments. Code is available https://github.com/richermans/dasheng/.

Create account to get full access

Overview

This paper presents a novel approach to scaling up masked audio encoder learning for general audio classification tasks.
The researchers developed a large-scale masked audio encoder model that can be effectively trained on diverse audio data and applied to various audio classification problems.
The model outperforms previous state-of-the-art approaches on several benchmark audio classification tasks, demonstrating the potential of this approach for advancing audio understanding and applications.

Plain English Explanation

The paper describes a new way to train large-scale machine learning models to work with all different kinds of audio data. The key idea is to use a "masked audio encoder" approach, where the model is trained to predict missing parts of audio signals. This helps the model learn general features and patterns in audio that can be useful for many different audio classification tasks, like identifying speech, music, or environmental sounds.

The researchers scaled up this masked audio encoder approach to create a very large and powerful model that can handle a wide variety of audio data. When they tested this model on several standard audio classification benchmarks, it outperformed previous state-of-the-art models. This suggests that this scaling up of masked audio encoder learning is a promising direction for advancing audio understanding and enabling new applications that rely on being able to automatically analyze and classify different types of audio.

The approach described in this paper builds on previous work in self-supervised audio representation learning and could complement other efforts to combine self-supervised and supervised learning for improved audio classification performance, especially in clinical settings.

Technical Explanation

The core of the paper's approach is a masked audio encoder model trained on a large, diverse dataset of audio signals. The model is based on a transformer architecture and is trained to predict masked portions of the input audio, similar to how masked language models are trained on text data.

The researchers scaled up this masked audio encoder model by training it on a very large dataset of over 1 million hours of audio data spanning a wide range of audio types, including speech, music, environmental sounds, and more. They also explored different model architectures and training strategies to optimize the model's performance.

When evaluated on several standard audio classification benchmarks, the scaled-up masked audio encoder model outperformed previous state-of-the-art approaches, including models like MERT and EncodeC-MAE. The researchers attribute this improved performance to the model's ability to learn robust, generalizable audio representations from the large-scale, diverse training data.

Critical Analysis

The paper provides a thorough evaluation of the scaled-up masked audio encoder model, but there are a few potential limitations and areas for further research:

The model was trained on a large but curated dataset, and its performance on real-world, noisy audio data may still need to be assessed.
The paper does not explore the model's performance on more specialized audio tasks, such as speaker verification, which may require additional fine-tuning or architectural changes.
The computational and memory requirements of the large-scale model may limit its deployment in some practical applications, and further work on model compression or efficient inference may be needed.

Overall, the researchers have made a significant contribution to advancing the state-of-the-art in general audio classification, but continued research and development will be necessary to unlock the full potential of this approach for real-world audio understanding applications.

Conclusion

This paper presents a novel approach to scaling up masked audio encoder learning for general audio classification tasks. By training a large-scale masked audio encoder model on a diverse dataset of over 1 million hours of audio, the researchers were able to achieve state-of-the-art performance on several benchmark audio classification tasks.

The key insights from this work are that large-scale, self-supervised learning of audio representations can lead to significant performance gains in downstream audio classification problems, and that the masked audio encoder approach is a promising direction for advancing the field of audio understanding. This work complements other efforts in self-supervised audio representation learning and combining self-supervised and supervised learning to push the boundaries of what is possible in audio-based AI applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤔

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghao Xiao, Chenghua Lin, Anton Ragni, Emmanouil Benetos, Norbert Gyenge, Roger Dannenberg, Ruibo Liu, Wenhu Chen, Gus Xia, Yemin Shi, Wenhao Huang, Zili Wang, Yike Guo, Jie Fu

Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is partially due to the distinctive challenges associated with modelling musical knowledge, particularly tonal and pitched characteristics of music. To address this research gap, we propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training. In our exploration, we identified an effective combination of teacher models, which outperforms conventional speech and audio approaches in terms of performance. This combination includes an acoustic teacher based on Residual Vector Quantisation - Variational AutoEncoder (RVQ-VAE) and a musical teacher based on the Constant-Q Transform (CQT). Furthermore, we explore a wide range of settings to overcome the instability in acoustic language model pre-training, which allows our designed paradigm to scale from 95M to 330M parameters. Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.

4/24/2024

cs.SD cs.AI cs.CL cs.LG eess.AS

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang

Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by low-rank adaptation (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.

6/26/2024

cs.SD cs.CL eess.AS

Exploring Self-Supervised Multi-view Contrastive Learning for Speech Emotion Recognition with Limited Annotations

Bulat Khaertdinov, Pedro Jeuris, Annanda Sousa, Enrique Hortal

Recent advancements in Deep and Self-Supervised Learning (SSL) have led to substantial improvements in Speech Emotion Recognition (SER) performance, reaching unprecedented levels. However, obtaining sufficient amounts of accurately labeled data for training or fine-tuning the models remains a costly and challenging task. In this paper, we propose a multi-view SSL pre-training technique that can be applied to various representations of speech, including the ones generated by large speech models, to improve SER performance in scenarios where annotations are limited. Our experiments, based on wav2vec 2.0, spectral and paralinguistic features, demonstrate that the proposed framework boosts the SER performance, by up to 10% in Unweighted Average Recall, in settings with extremely sparse data annotations.

6/13/2024

cs.CL cs.AI cs.SD eess.AS

🧠

EnCodecMAE: Leveraging neural codecs for universal audio representation learning

Leonardo Pepino, Pablo Riera, Luciana Ferrer

The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music and environmental sounds. To approach this problem, methods inspired by works on self-supervised learning for NLP, like BERT, or computer vision, like masked autoencoders (MAE), are often adapted to the audio domain. In this work, we propose masking representations of the audio signal, and training a MAE to reconstruct the masked segments. The reconstruction is done by predicting the discrete units generated by EnCodec, a neural audio codec, from the unmasked inputs. We evaluate this approach, which we call EnCodecMAE, on a wide range of tasks involving speech, music and environmental sounds. Our best model outperforms various state-of-the-art audio representation models in terms of global performance. Additionally, we evaluate the resulting representations in the challenging task of automatic speech recognition (ASR), obtaining decent results and paving the way for a universal audio representation.

5/22/2024

cs.SD cs.LG eess.AS