FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation

Read original: arXiv:2406.07676 - Published 6/13/2024 by Swarup Ranjan Behera, Abhishek Dhiman, Karthik Gowda, Aalekhya Satya Narayani

FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation

Overview

This paper presents a new model called FastAST, which aims to accelerate the Audio Spectrogram Transformer (AST) by introducing token merging and cross-model knowledge distillation.
The proposed FastAST model reduces the computational complexity of the original AST model while maintaining its performance on audio classification tasks.
The authors demonstrate the effectiveness of their approach on various audio datasets, including ASTRA: Aligning Speech and Text Representations for Audio-Visual Speech Recognition, Streaming Audio Transformers for Online Audio Tagging, and Non-Autoregressive Generation: A Framework for Efficient and Flexible End-to-End Audio Synthesis.

Plain English Explanation

The paper introduces a new model called FastAST, which is designed to speed up the Audio Spectrogram Transformer (AST) model while maintaining its performance on audio classification tasks. The AST model is a powerful tool for processing audio data, but it can be computationally expensive, especially for large-scale applications.

The researchers behind FastAST have developed two key techniques to address this issue:

Token Merging: The original AST model processes audio data by breaking it down into small "tokens" or segments. FastAST combines some of these tokens, reducing the overall number of tokens that need to be processed, which saves computational resources.
Cross-Model Knowledge Distillation: The researchers also trained FastAST using the knowledge gained from the original AST model. This approach, known as knowledge distillation, allows FastAST to learn from the expertise of the more complex AST model, while being more efficient in its own operations.

By implementing these techniques, the researchers were able to create a model (FastAST) that is faster and more efficient than the original AST, without sacrificing its performance on audio classification tasks. This could be particularly useful for applications that require real-time audio processing, such as Fully Few-Shot Class-Incremental Audio Classification, where computational efficiency is crucial.

Technical Explanation

The paper introduces the FastAST model, which is designed to accelerate the Audio Spectrogram Transformer (AST) architecture. The key innovations in FastAST are:

Token Merging: The original AST model processes audio data by breaking it down into small "tokens" or segments. FastAST combines some of these tokens, reducing the overall number of tokens that need to be processed. This is achieved by applying a token merging module that groups similar tokens together, effectively reducing the computational complexity of the model.
Cross-Model Knowledge Distillation: The researchers also trained FastAST using the knowledge gained from the original AST model. This approach, known as knowledge distillation, allows FastAST to learn from the expertise of the more complex AST model, while being more efficient in its own operations. The authors use a specialized cross-model distillation technique to transfer knowledge from AST to FastAST.

The paper evaluates the performance of FastAST on various audio classification tasks, including ASTRA: Aligning Speech and Text Representations for Audio-Visual Speech Recognition, Streaming Audio Transformers for Online Audio Tagging, and Non-Autoregressive Generation: A Framework for Efficient and Flexible End-to-End Audio Synthesis. The results demonstrate that FastAST can achieve comparable or even better performance than the original AST model, while significantly reducing the computational cost.

Critical Analysis

The paper presents a well-designed and thorough approach to accelerating the Audio Spectrogram Transformer (AST) model. The token merging and cross-model knowledge distillation techniques used in FastAST are well-justified and demonstrate impressive results.

One potential caveat is that the effectiveness of the token merging approach may depend on the specific characteristics of the audio data being processed. The authors do not provide a detailed analysis of how the token merging performs across different types of audio data, which could be an area for further investigation.

Additionally, the authors mention that the FastAST model may be more suitable for deployment in resource-constrained environments, such as on-device applications. However, they do not provide a comprehensive analysis of the model's deployment feasibility or potential challenges that may arise in such scenarios.

It would also be interesting to see how FastAST compares to other approaches for accelerating transformer-based models, such as Streaming Audio Transformers for Online Audio Tagging or Non-Autoregressive Generation: A Framework for Efficient and Flexible End-to-End Audio Synthesis. This could provide a more comprehensive understanding of the trade-offs and relative strengths of different acceleration techniques.

Overall, the FastAST model presents a promising approach to improving the efficiency of audio classification tasks, and the techniques introduced in this paper could have broader implications for the optimization of transformer-based models in various domains.

Conclusion

The FastAST model proposed in this paper represents a significant advancement in accelerating the Audio Spectrogram Transformer (AST) architecture. By introducing token merging and cross-model knowledge distillation, the researchers have developed a more efficient model that maintains the performance of the original AST while dramatically reducing the computational resources required.

The potential applications of FastAST are wide-ranging, from real-time audio processing in Streaming Audio Transformers for Online Audio Tagging to Fully Few-Shot Class-Incremental Audio Classification tasks. The techniques introduced in this paper could also have broader implications for the optimization of transformer-based models in other domains, contributing to the ongoing efforts to make these powerful models more accessible and practical for a wide range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

FastAST: Accelerating Audio Spectrogram Transformer via Token Merging and Cross-Model Knowledge Distillation

Swarup Ranjan Behera, Abhishek Dhiman, Karthik Gowda, Aalekhya Satya Narayani

Audio classification models, particularly the Audio Spectrogram Transformer (AST), play a crucial role in efficient audio analysis. However, optimizing their efficiency without compromising accuracy remains a challenge. In this paper, we introduce FastAST, a framework that integrates Token Merging (ToMe) into the AST framework. FastAST enhances inference speed without requiring extensive retraining by merging similar tokens in audio spectrograms. Furthermore, during training, FastAST brings about significant speed improvements. The experiments indicate that FastAST can increase audio classification throughput with minimal impact on accuracy. To mitigate the accuracy impact, we integrate Cross-Model Knowledge Distillation (CMKD) into the FastAST framework. Integrating ToMe and CMKD into AST results in improved accuracy compared to AST while maintaining faster inference speeds. FastAST represents a step towards real-time, resource-efficient audio analysis.

6/13/2024

ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions

Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

Transformers have rapidly overtaken CNN-based architectures as the new standard in audio classification. Transformer-based models, such as the Audio Spectrogram Transformers (AST), also inherit the fixed-size input paradigm from CNNs. However, this leads to performance degradation for ASTs in the inference when input lengths vary from the training. This paper introduces an approach that enables the use of variable-length audio inputs with AST models during both training and inference. By employing sequence packing, our method ElasticAST, accommodates any audio length during training, thereby offering flexibility across all lengths and resolutions at the inference. This flexibility allows ElasticAST to maintain evaluation capabilities at various lengths or resolutions and achieve similar performance to standard ASTs trained at specific lengths or resolutions. Moreover, experiments demonstrate ElasticAST's better performance when trained and evaluated on native-length audio datasets.

7/12/2024

Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training

Florian Schmid, Paul Primus, Tobias Morocutti, Jonathan Greif, Gerhard Widmer

This technical report describes the CP-JKU team's submission for Task 4 Sound Event Detection with Heterogeneous Training Datasets and Potentially Missing Labels of the DCASE 24 Challenge. We fine-tune three large Audio Spectrogram Transformers, PaSST, BEATs, and ATST, on the joint DESED and MAESTRO datasets in a two-stage training procedure. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the large pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of all three fine-tuned transformers. Then, in a second iteration, we repeat the two-stage training process and include a distillation loss based on the pseudo-labels, boosting single-model performance substantially. Additionally, we pre-train PaSST and ATST on the subset of AudioSet that comes with strong temporal labels, before fine-tuning them on the Task 4 datasets.

8/6/2024

Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers

Umberto Cappellazzo, Daniele Falavigna, Alessio Brutti, Mirco Ravanelli

Parameter-efficient transfer learning (PETL) methods have emerged as a solid alternative to the standard full fine-tuning approach. They only train a few extra parameters for each downstream task, without sacrificing performance and dispensing with the issue of storing a copy of the pre-trained model for each task. For audio classification tasks, the Audio Spectrogram Transformer (AST) model shows impressive results. However, surprisingly, how to efficiently adapt it to several downstream tasks has not been tackled before. In this paper, we bridge this gap and present a detailed investigation of common PETL methods for the adaptation of the AST model to audio/speech tasks. Furthermore, we propose a new adapter design that exploits the convolution module of the Conformer model, leading to superior performance over the standard PETL approaches and surpassing or achieving performance parity with full fine-tuning by updating only 0.29% of the parameters. Finally, we provide ablation studies revealing that our proposed adapter: 1) proves to be effective in few-shot efficient transfer learning, 2) attains optimal results regardless of the amount of the allocated parameters, and 3) can be applied to other pre-trained models.

7/16/2024