Deep Space Separable Distillation for Lightweight Acoustic Scene Classification

Read original: arXiv:2405.03567 - Published 5/7/2024 by ShuQi Ye, Yuan Tian

🤿

Overview

The paper proposes a deep space separable distillation network for acoustic scene classification (ASC), a crucial task in real-world applications.
The network performs high-low frequency decomposition on the log-mel spectrogram to reduce computational complexity while maintaining model performance.
It also introduces three lightweight operators - Separable Convolution (SC), Orthonormal Separable Convolution (OSC), and Separable Partial Convolution (SPC) - for efficient feature extraction in ASC tasks.
The proposed method achieves a 9.8% performance gain over existing deep learning methods, while also reducing parameter count and computational complexity.

Plain English Explanation

Acoustic scene classification (ASC) is the task of identifying the environment or setting where a particular sound was recorded, such as a busy street, a quiet library, or a bustling cafe. This is an important capability for many real-world applications, like smart home devices, autonomous vehicles, and audio-based surveillance systems.

Recently, deep learning-based methods have become the go-to approach for ASC. However, these models can be computationally heavy and their performance is not always satisfactory. To address these issues, the researchers in this paper developed a new deep learning network called the "deep space separable distillation network."

The key ideas behind this network are:

Frequency Decomposition: The network takes the log-mel spectrogram of the audio input and splits it into high and low frequency components. This reduces the computational complexity of the model without sacrificing its performance.
Lightweight Operators: The network uses three specialized convolutional operators - Separable Convolution (SC), Orthonormal Separable Convolution (OSC), and Separable Partial Convolution (SPC) - that are more efficient at extracting relevant features for ASC tasks compared to standard convolutional layers.

The experiments show that this new network outperforms other popular deep learning methods for ASC by a significant margin (9.8%), while also being more compact and efficient in terms of the number of parameters and computational requirements.

Technical Explanation

The proposed "deep space separable distillation network" (DSSDN) for acoustic scene classification (ASC) begins by decomposing the input log-mel spectrogram into high and low frequency components. This is done to reduce the computational complexity of the model, as lower frequencies tend to contain the most salient information for ASC tasks.

The network then employs three specialized convolutional operators for feature extraction:

Separable Convolution (SC): This operator factorizes the standard 2D convolution into two 1D convolutions, reducing the number of parameters and computations.
Orthonormal Separable Convolution (OSC): Building on SC, the OSC operator further improves efficiency by enforcing orthonormality constraints on the convolution kernels.
Separable Partial Convolution (SPC): This operator combines the benefits of SC and partial convolution, which selectively applies convolution only to valid (non-padded) regions of the input.

These lightweight, yet effective, operators enable the DSSDN to achieve state-of-the-art performance on ASC benchmarks while having a smaller parameter count and computational complexity compared to other deep learning-based methods.

The experimental results demonstrate that the proposed DSSDN approach outperforms current deep learning techniques for ASC by 9.8%, while also exhibiting a smaller model size and lower inference time.

Critical Analysis

The paper presents a well-designed and effective solution to the problem of acoustic scene classification using deep learning. The key innovations, such as frequency decomposition and the use of specialized convolutional operators, are sound and well-justified.

However, the paper does not delve into the potential limitations or caveats of the proposed approach. For example, it would be interesting to understand how the DSSDN performs in more challenging or noisy acoustic environments, or how it compares to human-level performance on ASC tasks.

Additionally, the paper could have discussed the broader implications of this research, such as how it could enable the deployment of ASC models on resource-constrained edge devices or its potential applications in areas like audio-based surveillance or clinical settings.

Further research could also explore the generalizability of the DSSDN approach to other audio-related tasks, such as audio separation or acoustic event detection, to assess its broader applicability in the field of audio signal processing.

Conclusion

The proposed "deep space separable distillation network" offers a novel and effective solution for acoustic scene classification, a crucial task in real-world applications. By combining frequency decomposition and specialized convolutional operators, the network achieves state-of-the-art performance while being more computationally efficient than existing deep learning-based methods.

This research represents an important step forward in developing lightweight and high-performing ASC models, which could enable the deployment of such systems in a wide range of scenarios, from smart home devices to autonomous vehicles. The techniques presented in this paper, particularly the use of efficient feature extraction operators, may also have broader implications for other audio-related tasks and signal processing applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤿

Deep Space Separable Distillation for Lightweight Acoustic Scene Classification

ShuQi Ye, Yuan Tian

Acoustic scene classification (ASC) is highly important in the real world. Recently, deep learning-based methods have been widely employed for acoustic scene classification. However, these methods are currently not lightweight enough as well as their performance is not satisfactory. To solve these problems, we propose a deep space separable distillation network. Firstly, the network performs high-low frequency decomposition on the log-mel spectrogram, significantly reducing computational complexity while maintaining model performance. Secondly, we specially design three lightweight operators for ASC, including Separable Convolution (SC), Orthonormal Separable Convolution (OSC), and Separable Partial Convolution (SPC). These operators exhibit highly efficient feature extraction capabilities in acoustic scene classification tasks. The experimental results demonstrate that the proposed method achieves a performance gain of 9.8% compared to the currently popular deep learning methods, while also having smaller parameter count and computational complexity.

5/7/2024

🏷️

Low-Complexity Acoustic Scene Classification Using Parallel Attention-Convolution Network

Yanxiong Li, Jiaxin Tan, Guoqing Chen, Jialong Li, Yongjie Si, Qianhua He

This work is an improved system that we submitted to task 1 of DCASE2023 challenge. We propose a method of low-complexity acoustic scene classification by a parallel attention-convolution network which consists of four modules, including pre-processing, fusion, global and local contextual information extraction. The proposed network is computationally efficient to capture global and local contextual information from each audio clip. In addition, we integrate other techniques into our method, such as knowledge distillation, data augmentation, and adaptive residual normalization. When evaluated on the official dataset of DCASE2023 challenge, our method obtains the highest accuracy of 56.10% with parameter number of 5.21 kilo and multiply-accumulate operations of 1.44 million. It exceeds the top two systems of DCASE2023 challenge in accuracy and complexity, and obtains state-of-the-art result. Code is at: https://github.com/Jessytan/Low-complexity-ASC.

6/13/2024

Leveraging Self-supervised Audio Representations for Data-Efficient Acoustic Scene Classification

Yiqiang Cai, Shengchen Li, Xi Shao

Acoustic scene classification (ASC) predominantly relies on supervised approaches. However, acquiring labeled data for training ASC models is often costly and time-consuming. Recently, self-supervised learning (SSL) has emerged as a powerful method for extracting features from unlabeled audio data, benefiting many downstream audio tasks. This paper proposes a data-efficient and low-complexity ASC system by leveraging self-supervised audio representations extracted from general-purpose audio datasets. We introduce BEATs, an audio SSL pre-trained model, to extract the general representations from AudioSet. Through extensive experiments, it has been demonstrated that the self-supervised audio representations can help to achieve high ASC accuracy with limited labeled fine-tuning data. Furthermore, we find that ensembling the SSL models fine-tuned with different strategies contributes to a further performance improvement. To meet low-complexity requirements, we use knowledge distillation to transfer the self-supervised knowledge from large teacher models to an efficient student model. The experimental results suggest that the self-supervised teachers effectively improve the classification accuracy of the student model. Our best-performing system obtains an average accuracy of 56.7%.

8/28/2024

🏷️

TF-SepNet: An Efficient 1D Kernel Design in CNNs for Low-Complexity Acoustic Scene Classification

Yiqiang Cai, Peihong Zhang, Shengchen Li

Recent studies focus on developing efficient systems for acoustic scene classification (ASC) using convolutional neural networks (CNNs), which typically consist of consecutive kernels. This paper highlights the benefits of using separate kernels as a more powerful and efficient design approach in ASC tasks. Inspired by the time-frequency nature of audio signals, we propose TF-SepNet, a CNN architecture that separates the feature processing along the time and frequency dimensions. Features resulted from the separate paths are then merged by channels and directly forwarded to the classifier. Instead of the conventional two dimensional (2D) kernel, TF-SepNet incorporates one dimensional (1D) kernels to reduce the computational costs. Experiments have been conducted using the TAU Urban Acoustic Scene 2022 Mobile development dataset. The results show that TF-SepNet outperforms similar state-of-the-arts that use consecutive kernels. A further investigation reveals that the separate kernels lead to a larger effective receptive field (ERF), which enables TF-SepNet to capture more time-frequency features.

5/30/2024