Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training

Read original: arXiv:2408.00791 - Published 8/6/2024 by Florian Schmid, Paul Primus, Tobias Morocutti, Jonathan Greif, Gerhard Widmer

Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training

Overview

The paper focuses on improving audio spectrogram transformers for sound event detection through a multi-stage training approach.
Researchers explore different training techniques to enhance the performance of transformer-based models on sound event detection tasks.

Plain English Explanation

The researchers in this paper are looking for ways to improve the performance of AI models that can detect different sounds, like a car horn, a dog barking, or a person speaking. These AI models use a type of machine learning architecture called a "transformer," which is particularly good at understanding patterns in sequential data like audio.

The researchers tried out several different training techniques to see if they could make the transformer models better at detecting sound events. One approach they tested was "multi-stage training," which means they trained the models in multiple steps, using different types of data and learning objectives at each stage.

The key idea is that by breaking up the training process into stages, the models can learn more efficiently and capture the nuances of sound event detection more effectively. For example, the first stage might focus on learning general audio features, while later stages fine-tune the models for the specific task of identifying different sound events.

Overall, the researchers found that this multi-stage training approach led to significant improvements in the sound event detection performance of the transformer-based models. This could be useful for applications like smart home assistants, surveillance systems, or other technologies that need to accurately identify and respond to various sounds in the environment.

Technical Explanation

The paper investigates techniques to improve the performance of audio spectrogram transformers for the task of sound event detection. The researchers explore a multi-stage training approach, where the transformer model is trained in multiple steps using different datasets and learning objectives.

The first stage involves pretraining the transformer model on a large general audio dataset to learn effective audio representations. In the second stage, the model is fine-tuned on a specific sound event detection dataset, such as the FMSG-JLESS dataset or the DCASE 2024 Task 4 dataset. This multi-stage approach allows the model to first develop a strong understanding of general audio features, and then specialize its learning for the sound event detection task.

The researchers also experiment with various self-training and ensembling techniques to further boost the performance of the transformer-based models. These methods leverage unlabeled data and ensemble different model variants to improve the overall sound event detection accuracy.

Critical Analysis

The paper presents a thorough investigation of training techniques to enhance the capabilities of transformer-based models for sound event detection. The multi-stage training approach is a well-justified strategy, as it allows the model to first learn general audio representations before specializing for the target task.

However, the paper does not provide a detailed analysis of the computational and memory requirements of the multi-stage training approach. This information would be valuable for understanding the practical feasibility of deploying such models in real-world applications, where resource constraints may be a concern.

Additionally, the paper could have explored the robustness of the trained models to variations in sound data, such as background noise, reverberation, or different recording conditions. Evaluating the models' performance in more challenging and realistic scenarios would help assess their suitability for real-world deployments.

Conclusion

This paper presents an effective approach to improving the performance of audio spectrogram transformers for sound event detection. The key contribution is the multi-stage training strategy, which allows the models to learn general audio features before specializing for the target task. The researchers also demonstrate the benefits of self-training and ensemble techniques to further enhance the detection accuracy.

The findings in this paper could have significant implications for the development of advanced audio processing systems, such as those used in smart home assistants, autonomous vehicles, or surveillance applications. By improving the sound event detection capabilities of transformer-based models, the researchers have taken a step forward in creating more robust and reliable audio understanding systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training

Florian Schmid, Paul Primus, Tobias Morocutti, Jonathan Greif, Gerhard Widmer

This technical report describes the CP-JKU team's submission for Task 4 Sound Event Detection with Heterogeneous Training Datasets and Potentially Missing Labels of the DCASE 24 Challenge. We fine-tune three large Audio Spectrogram Transformers, PaSST, BEATs, and ATST, on the joint DESED and MAESTRO datasets in a two-stage training procedure. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the large pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of all three fine-tuned transformers. Then, in a second iteration, we repeat the two-stage training process and include a distillation loss based on the pseudo-labels, boosting single-model performance substantially. Additionally, we pre-train PaSST and ATST on the subset of AudioSet that comes with strong temporal labels, before fine-tuning them on the Task 4 datasets.

8/6/2024

Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event Detection with Heterogeneous Datasets

Florian Schmid, Paul Primus, Tobias Morocutti, Jonathan Greif, Gerhard Widmer

A central problem in building effective sound event detection systems is the lack of high-quality, strongly annotated sound event datasets. For this reason, Task 4 of the DCASE 2024 challenge proposes learning from two heterogeneous datasets, including audio clips labeled with varying annotation granularity and with different sets of possible events. We propose a multi-iteration, multi-stage procedure for fine-tuning Audio Spectrogram Transformers on the joint DESED and MAESTRO Real datasets. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of fine-tuned transformers. Then, in a second iteration, we repeat the two-stage training process and include a distillation loss based on the pseudo-labels, achieving a new single-model, state-of-the-art performance on the public evaluation set of DESED with a PSDS1 of 0.692. A single model and an ensemble, both based on our proposed training procedure, ranked first in Task 4 of the DCASE Challenge 2024.

7/19/2024

🔎

New!Effective Pre-Training of Audio Transformers for Sound Event Detection

Florian Schmid, Tobias Morocutti, Francesco Foscarin, Jan Schluter, Paul Primus, Gerhard Widmer

We propose a pre-training pipeline for audio spectrogram transformers for frame-level sound event detection tasks. On top of common pre-training steps, we add a meticulously designed training routine on AudioSet frame-level annotations. This includes a balanced sampler, aggressive data augmentation, and ensemble knowledge distillation. For five transformers, we obtain a substantial performance improvement over previously available checkpoints both on AudioSet frame-level predictions and on frame-level sound event detection downstream tasks, confirming our pipeline's effectiveness. We publish the resulting checkpoints that researchers can directly fine-tune to build high-performance models for sound event detection tasks.

9/17/2024

FMSG-JLESS Submission for DCASE 2024 Task4 on Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels

Yang Xiao, Han Yin, Jisheng Bai, Rohan Kumar Das

This report presents the systems developed and submitted by Fortemedia Singapore (FMSG) and Joint Laboratory of Environmental Sound Sensing (JLESS) for DCASE 2024 Task 4. The task focuses on recognizing event classes and their time boundaries, given that multiple events can be present and may overlap in an audio recording. The novelty this year is a dataset with two sources, making it challenging to achieve good performance without knowing the source of the audio clips during evaluation. To address this, we propose a sound event detection method using domain generalization. Our approach integrates features from bidirectional encoder representations from audio transformers and a convolutional recurrent neural network. We focus on three main strategies to improve our method. First, we apply mixstyle to the frequency dimension to adapt the mel-spectrograms from different domains. Second, we consider training loss of our model specific to each datasets for their corresponding classes. This independent learning framework helps the model extract domain-specific features effectively. Lastly, we use the sound event bounding boxes method for post-processing. Our proposed method shows superior macro-average pAUC and polyphonic SED score performance on the DCASE 2024 Challenge Task 4 validation dataset and public evaluation dataset.

7/2/2024