EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization

2404.19214

Published 5/1/2024 by Jianzong Wang, Ziqi Liang, Xulong Zhang, Ning Cheng, Jing Xiao

🗣️

Abstract

In recent years, Transformer networks have shown remarkable performance in speech recognition tasks. However, their deployment poses challenges due to high computational and storage resource requirements. To address this issue, a lightweight model called EfficientASR is proposed in this paper, aiming to enhance the versatility of Transformer models. EfficientASR employs two primary modules: Shared Residual Multi-Head Attention (SRMHA) and Chunk-Level Feedforward Networks (CFFN). The SRMHA module effectively reduces redundant computations in the network, while the CFFN module captures spatial knowledge and reduces the number of parameters. The effectiveness of the EfficientASR model is validated on two public datasets, namely Aishell-1 and HKUST. Experimental results demonstrate a 36% reduction in parameters compared to the baseline Transformer network, along with improvements of 0.3% and 0.2% in Character Error Rate (CER) on the Aishell-1 and HKUST datasets, respectively.

Create account to get full access

Overview

Transformer networks have shown impressive performance in speech recognition tasks, but their deployment can be challenging due to high computational and storage requirements.
This paper proposes a lightweight model called EfficientASR to enhance the versatility of Transformer models.
EfficientASR utilizes two key modules: Shared Residual Multi-Head Attention (SRMHA) and Chunk-Level Feedforward Networks (CFFN).
The model is evaluated on the Aishell-1 and HKUST datasets, demonstrating a 36% reduction in parameters and improvements in Character Error Rate (CER) compared to the baseline Transformer network.

Plain English Explanation

Transformer networks have proven to be highly effective for speech recognition tasks, but their complexity can make them challenging to deploy in real-world applications. The EfficientASR model aims to address this issue by streamlining the Transformer architecture while maintaining its performance.

The key innovations in EfficientASR are the Shared Residual Multi-Head Attention (SRMHA) module and the Chunk-Level Feedforward Networks (CFFN) module. The SRMHA module reduces redundant computations in the network, while the CFFN module captures spatial knowledge and reduces the number of parameters.

By incorporating these modules, the EfficientASR model is able to achieve a 36% reduction in parameters compared to the standard Transformer network, along with small improvements in Character Error Rate (CER) on two widely used speech recognition datasets. This suggests that the EfficientASR model can provide a more efficient and versatile alternative to the standard Transformer approach, potentially enabling its deployment in a wider range of speech recognition applications.

Technical Explanation

The paper proposes the EfficientASR model, which builds upon the success of Transformer networks in speech recognition tasks while addressing their high computational and storage requirements. EfficientASR employs two key modules to achieve this:

Shared Residual Multi-Head Attention (SRMHA): This module effectively reduces redundant computations in the Transformer network by sharing attention weights across multiple heads. This helps to decrease the overall computational load without significantly impacting performance.
Chunk-Level Feedforward Networks (CFFN): The CFFN module captures spatial knowledge and reduces the number of parameters in the model. By processing the input in smaller "chunks" rather than as a single sequence, the CFFN module can learn more efficient representations while maintaining the model's ability to capture long-range dependencies.

The effectiveness of the EfficientASR model is evaluated on two public speech recognition datasets: Aishell-1 and HKUST. The results demonstrate a 36% reduction in the number of model parameters compared to the baseline Transformer network, while also achieving improvements of 0.3% and 0.2% in Character Error Rate (CER) on the respective datasets.

Critical Analysis

The paper provides a compelling solution to the challenge of deploying Transformer networks in real-world speech recognition applications. By introducing the SRMHA and CFFN modules, the authors have successfully reduced the computational and storage requirements of the model without significantly compromising its performance.

However, the paper does not address the potential limitations of the EfficientASR model. For example, it would be valuable to understand how the model's performance scales with larger datasets or more complex speech recognition tasks, as well as its robustness to different types of speech data and environmental conditions.

Additionally, the paper could have provided more insights into the trade-offs between the model's efficiency and its ability to capture long-range dependencies in speech data. It would be interesting to see how the LoRAP and SSHR approaches, which explore differentiated transformer sub-layers and self-supervised hierarchical representations, could potentially be integrated with the EfficientASR model to further enhance its performance and versatility.

Conclusion

The EfficientASR model presented in this paper offers a promising solution to the challenge of deploying Transformer networks in speech recognition applications. By introducing the SRMHA and CFFN modules, the model achieves a significant reduction in parameters while maintaining or even improving upon the performance of the baseline Transformer network.

These findings suggest that the EfficientASR model could have important implications for the wider adoption of Transformer-based speech recognition systems, particularly in resource-constrained environments or applications that require low-latency processing. As the field of speech recognition continues to evolve, the insights and techniques presented in this paper may inspire further research into efficient and versatile neural network architectures for this domain.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

A Single-Step Non-Autoregressive Automatic Speech Recognition Architecture with High Accuracy and Inference Speed

Ziyang Zhuang, Chenfeng Miao, Kun Zou, Shuai Gong, Ming Fang, Tao Wei, Zijian Li, Wei Hu, Shaojun Wang, Jing Xiao

Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. To further narrow the gap between the NAR and AR models, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EfficientASR. It uses an Index Mapping Vector (IMV) based alignment generator to generate alignments during training, and an alignment predictor to learn the alignments for inference. It can be trained end-to-end (E2E) with cross-entropy loss combined with alignment loss. The proposed EfficientASR achieves competitive results on the AISHELL-1 and AISHELL-2 benchmarks compared to the state-of-the-art (SOTA) models. Specifically, it achieves character error rates (CER) of 4.26%/4.62% on the AISHELL-1 dev/test dataset, which outperforms the SOTA AR Conformer with about 30x inference speedup.

6/14/2024

cs.SD eess.AS

Efficient infusion of self-supervised representations in Automatic Speech Recognition

Darshan Prabhu, Sai Ganesh Mirishkar, Pankaj Wasnik

Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks. Given the effectiveness of such models, it is advantageous to use them in conventional ASR systems. While some approaches suggest incorporating these models as a trainable encoder or a learnable frontend, training such systems is extremely slow and requires a lot of computation cycles. In this work, we propose two simple approaches that use (1) framewise addition and (2) cross-attention mechanisms to efficiently incorporate the representations from the SSL model(s) into the ASR architecture, resulting in models that are comparable in size with standard encoder-decoder conformer systems while also avoiding the usage of SSL models during training. Our approach results in faster training and yields significant performance gains on the Librispeech and Tedlium datasets compared to baselines. We further provide detailed analysis and ablation studies that demonstrate the effectiveness of our approach.

4/22/2024

cs.CL

One-pass Multiple Conformer and Foundation Speech Systems Compression and Quantization Using An All-in-one Neural Model

Zhaoqing Li, Haoning Xu, Tianzi Wang, Shoukang Hu, Zengrui Jin, Shujie Hu, Jiajun Deng, Mingyu Cui, Mengzhe Geng, Xunying Liu

We propose a novel one-pass multiple ASR systems joint compression and quantization approach using an all-in-one neural model. A single compression cycle allows multiple nested systems with varying Encoder depths, widths, and quantization precision settings to be simultaneously constructed without the need to train and store individual target systems separately. Experiments consistently demonstrate the multiple ASR systems compressed in a single all-in-one model produced a word error rate (WER) comparable to, or lower by up to 1.01% absolute (6.98% relative) than individually trained systems of equal complexity. A 3.4x overall system compression and training time speed-up was achieved. Maximum model size compression ratios of 12.8x and 3.93x were obtained over the baseline Switchboard-300hr Conformer and LibriSpeech-100hr fine-tuned wav2vec2.0 models, respectively, incurring no statistically significant WER increase.

6/17/2024

cs.SD cs.AI eess.AS

🗣️

Conformer-Based Speech Recognition On Extreme Edge-Computing Devices

Mingbin Xu, Alex Jin, Sicheng Wang, Mu Su, Tim Ng, Henry Mason, Shiyi Han, Zhihong Lei, Yaqiao Deng, Zhen Huang, Mahesh Krishnamoorthy

With increasingly more powerful compute capabilities and resources in today's devices, traditionally compute-intensive automatic speech recognition (ASR) has been moving from the cloud to devices to better protect user privacy. However, it is still challenging to implement on-device ASR on resource-constrained devices, such as smartphones, smart wearables, and other smart home automation devices. In this paper, we propose a series of model architecture adaptions, neural network graph transformations, and numerical optimizations to fit an advanced Conformer based end-to-end streaming ASR system on resource-constrained devices without accuracy degradation. We achieve over 5.26 times faster than realtime (0.19 RTF) speech recognition on smart wearables while minimizing energy consumption and achieving state-of-the-art accuracy. The proposed methods are widely applicable to other transformer-based server-free AI applications. In addition, we provide a complete theory on optimal pre-normalizers that numerically stabilize layer normalization in any Lp-norm using any floating point precision.

5/15/2024

cs.LG cs.PF