Fully Few-shot Class-incremental Audio Classification Using Expandable Dual-embedding Extractor

Read original: arXiv:2406.08122 - Published 6/13/2024 by Yongjie Si, Yanxiong Li, Jialong Li, Jiaxin Tan, Qianhua He

🏷️

Overview

This paper explores a new problem of fully few-shot class-incremental audio classification, where there are few training samples in all sessions.
The authors propose a method using an expandable dual-embedding extractor to solve this problem.
Experiments on three datasets show that the proposed method exceeds seven baseline methods in average accuracy with statistical significance.

Plain English Explanation

In many real-world scenarios, it can be difficult to collect a large number of samples to train audio classification models, especially for some less common classes. This paper looks at a new problem where the training data is limited not just in the initial session, but across all the sessions where the model is updated with new classes.

To address this challenge, the researchers developed a model with an expandable dual-embedding extractor. This model has two key components: an embedding extractor that learns useful feature representations from the audio data, and an expandable classifier that can adapt to recognize new classes as they are added.

The embedding extractor uses a pretrained Audio Spectrogram Transformer (AST) as a starting point, and then further fine-tunes it to extract more relevant features. The expandable classifier stores "prototypes" for each class, which act as representative examples that the model can compare new inputs against.

By using this dual-embedding approach, the model is able to effectively learn from the limited training data and continuously expand its capabilities as new classes are introduced. The researchers tested this method on three different audio classification datasets and found that it outperformed seven other baseline techniques.

Technical Explanation

The paper proposes a method for fully few-shot class-incremental audio classification, where there are only a few training samples available in all sessions, not just the initial base session.

The core of the proposed model is an expandable dual-embedding extractor, which consists of two main components:

Embedding Extractor: This component uses a pretrained Audio Spectrogram Transformer (AST) as a starting point, and then fine-tunes it to extract more relevant features from the audio data.
Expandable Classifier: This component stores "prototypes" for each class, which act as representative examples that the model can compare new inputs against. As new classes are introduced, the classifier can expand to accommodate them.

The key idea is to leverage the powerful feature extraction capabilities of the AST model, while also allowing the classifier to dynamically adapt to new classes without requiring a complete retraining of the entire system.

The authors evaluate their proposed method on three audio classification datasets: LS-100, NSynth-100, and FSC-89. They compare it to seven baseline approaches and find that their method achieves significantly higher average accuracy across the datasets.

Critical Analysis

The paper addresses an important practical challenge in audio classification - the difficulty of collecting abundant training samples, especially for less common classes. By proposing a method that can effectively learn from limited data and continuously expand to handle new classes, the authors make a valuable contribution to the field.

However, the paper does not delve into potential limitations or caveats of the proposed approach. For example, it would be useful to understand how the model's performance scales as the number of classes grows, or how sensitive it is to the quality and diversity of the initial training data.

Additionally, the paper could have provided more insight into the inner workings of the dual-embedding extractor. For instance, how do the pretrained and fine-tuned components of the embedding extractor interact, and what is the rationale behind this specific architectural choice?

Overall, the research represents a promising step forward in addressing the challenges of few-shot and class-incremental audio classification. Further exploration of the method's robustness, scalability, and interpretability could help strengthen the findings and make the approach more applicable to real-world scenarios.

Conclusion

This paper introduces a new problem of fully few-shot class-incremental audio classification and presents a novel method to address it. By using an expandable dual-embedding extractor, the proposed model can effectively learn from limited training data and continuously expand its capabilities as new classes are introduced.

The experimental results demonstrate the superiority of this approach over several baseline techniques, highlighting its potential to improve audio classification performance in practical scenarios where data collection is challenging. While the paper could have delved deeper into certain aspects of the method, it nonetheless represents an important step forward in the field of audio classification and serves as a foundation for future research in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏷️

Fully Few-shot Class-incremental Audio Classification Using Expandable Dual-embedding Extractor

Yongjie Si, Yanxiong Li, Jialong Li, Jiaxin Tan, Qianhua He

It's assumed that training data is sufficient in base session of few-shot class-incremental audio classification. However, it's difficult to collect abundant samples for model training in base session in some practical scenarios due to the data scarcity of some classes. This paper explores a new problem of fully few-shot class-incremental audio classification with few training samples in all sessions. Moreover, we propose a method using expandable dual-embedding extractor to solve it. The proposed model consists of an embedding extractor and an expandable classifier. The embedding extractor consists of a pretrained Audio Spectrogram Transformer (AST) and a finetuned AST. The expandable classifier consists of prototypes and each prototype represents a class. Experiments are conducted on three datasets (LS-100, NSynth-100 and FSC-89). Results show that our method exceeds seven baseline ones in average accuracy with statistical significance. Code is at: https://github.com/YongjieSi/EDE.

6/13/2024

Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive Representation

Riyansha Singh, Parinita Nema, Vinod K Kurmi

In machine learning applications, gradual data ingress is common, especially in audio processing where incremental learning is vital for real-time analytics. Few-shot class-incremental learning addresses challenges arising from limited incoming data. Existing methods often integrate additional trainable components or rely on a fixed embedding extractor post-training on base sessions to mitigate concerns related to catastrophic forgetting and the dangers of model overfitting. However, using cross-entropy loss alone during base session training is suboptimal for audio data. To address this, we propose incorporating supervised contrastive learning to refine the representation space, enhancing discriminative power and leading to better generalization since it facilitates seamless integration of incremental classes, upon arrival. Experimental results on NSynth and LibriSpeech datasets with 100 classes, as well as ESC dataset with 50 and 10 classes, demonstrate state-of-the-art performance.

8/9/2024

ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions

Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

Transformers have rapidly overtaken CNN-based architectures as the new standard in audio classification. Transformer-based models, such as the Audio Spectrogram Transformers (AST), also inherit the fixed-size input paradigm from CNNs. However, this leads to performance degradation for ASTs in the inference when input lengths vary from the training. This paper introduces an approach that enables the use of variable-length audio inputs with AST models during both training and inference. By employing sequence packing, our method ElasticAST, accommodates any audio length during training, thereby offering flexibility across all lengths and resolutions at the inference. This flexibility allows ElasticAST to maintain evaluation capabilities at various lengths or resolutions and achieve similar performance to standard ASTs trained at specific lengths or resolutions. Moreover, experiments demonstrate ElasticAST's better performance when trained and evaluated on native-length audio datasets.

7/12/2024

Few-Shot Bioacoustic Event Detection with Frame-Level Embedding Learning System

PengYuan Zhao, ChengWei Lu, Liang Zou

This technical report presents our frame-level embedding learning system for the DCASE2024 challenge for few-shot bioacoustic event detection (Task 5).In this work, we used log-mel and PCEN for feature extraction of the input audio, Netmamba Encoder as the information interaction network, and adopted data augmentation strategies to improve the generalizability of the trained model as well as multiple post-processing methods. Our final system achieved an F-measure score of 56.4%, securing the 2nd rank in the few-shot bioacoustic event detection category of the Detection and Classification of Acoustic Scenes and Events Challenge 2024.

7/16/2024