Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper

Read original: arXiv:2409.13499 - Published 9/24/2024 by Iuliia Thorbecke, Juan Zuluaga-Gomez, Esa'u Villatoro-Tello, Shashi Kumar, Pradeep Rangappa, Sergio Burdisso, Petr Motlicek, Karthik Pandia, Aravind Ganapathiraju

Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper

Overview

This paper introduces a novel approach for fast streaming automatic speech recognition (ASR) prototyping by leveraging knowledge distillation from Whisper, a state-of-the-art ASR model.
The method enables rapid development of efficient streaming transducer-based ASR models, which are well-suited for real-time applications.
The authors demonstrate the effectiveness of their approach through extensive experiments on multiple public datasets.

Plain English Explanation

The paper presents a new technique for quickly building efficient speech recognition models that can work in real-time. This is an important capability for applications like voice assistants or closed captioning.

The key idea is to transfer knowledge from a powerful, but slow, speech recognition model called Whisper to a smaller, faster model that can operate in a streaming fashion. This means the new model can process the audio as it's coming in, without having to wait for the full recording.

The authors show that this knowledge distillation approach allows them to create streaming speech recognition models that perform very well, while being much faster and more lightweight than the original Whisper model. This makes them well-suited for deployment on resource-constrained devices like smartphones or embedded systems.

Overall, this work provides a practical solution for quickly prototyping and deploying high-performance streaming speech recognition systems, which has important applications in a variety of real-world scenarios.

Technical Explanation

The paper presents a knowledge distillation technique for rapidly prototyping efficient streaming transducer-based ASR models. The authors leverage the Whisper model, a state-of-the-art ASR system, as the teacher to guide the training of a smaller, faster student model.

The student model uses a streaming transducer architecture, which is well-suited for real-time speech recognition applications. By distilling knowledge from Whisper, the student model can achieve strong performance while being much more computationally efficient than the original teacher.

The authors extensively evaluate their approach on multiple public ASR datasets, demonstrating significant word error rate (WER) improvements over baseline streaming transducer models. They also analyze the latency and inference speed of the student models, showing that they can operate in a low-latency, real-time fashion.

Critical Analysis

The paper presents a practical and effective solution for rapidly prototyping streaming ASR models. The knowledge distillation approach is well-motivated and the experimental results are compelling.

However, the paper does not address potential limitations of the technique, such as the generalization of the student models to unseen domains or the robustness of the approach to noisy or accented speech.

Additionally, the paper could have further explored the architectural choices and hyperparameter tuning of the student models to optimize their performance and efficiency even more.

Overall, this work makes an important contribution to the field of real-time speech recognition, but there are opportunities for future research to address some of the unresolved challenges.

Conclusion

This paper introduces a novel knowledge distillation technique for fast prototyping of streaming transducer-based ASR models. By leveraging the powerful Whisper model as a teacher, the authors demonstrate how to rapidly develop efficient student models that can operate in real-time with high accuracy.

The work has significant practical applications in areas such as voice assistants, closed captioning, and embedded speech recognition systems, where low-latency and computational efficiency are crucial. The insights and methodology presented in this paper can inspire further advancements in the field of streaming ASR, ultimately leading to more accessible and ubiquitous speech recognition technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

New!Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper

Iuliia Thorbecke, Juan Zuluaga-Gomez, Esa'u Villatoro-Tello, Shashi Kumar, Pradeep Rangappa, Sergio Burdisso, Petr Motlicek, Karthik Pandia, Aravind Ganapathiraju

The training of automatic speech recognition (ASR) with little to no supervised data remains an open question. In this work, we demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch in consumer and accessible GPUs in their entirety with pseudo-labeled (PL) speech from foundational speech models (FSM). This allows training a robust ASR model just in one stage and does not require large data and computational budget compared to the two-step scenario with pre-training and fine-tuning. We perform a comprehensive ablation on different aspects of PL-based streaming TT models such as the impact of (1) shallow fusion of n-gram LMs, (2) contextual biasing with named entities, (3) chunk-wise decoding for low-latency streaming applications, and (4) TT overall performance as the function of the FSM size. Our results demonstrate that TT can be trained from scratch without supervised data, even with very noisy PLs. We validate the proposed framework on 6 languages from CommonVoice and propose multiple heuristics to filter out hallucinated PLs.

9/24/2024

XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Esa'u Villatoro-Tello, Iuliia Nigmatulina, Petr Motlicek, Manjunath K E, Aravind Ganapathiraju

Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data for training. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our experiments on the AMI dataset reveal that the XLSR-Transducer achieves 4% absolute WER improvement over Whisper large-v2 and 8% over a Zipformer transducer model trained from scratch.To enable streaming capabilities, we investigate different attention masking patterns in the self-attention computation of transformer layers within the XLSR-53 model. We validate XLSR-Transducer on AMI and 5 languages from CommonVoice under low-resource scenarios. Finally, with the introduction of attention sinks, we reduce the left context by half while achieving a relative 12% improvement in WER.

7/8/2024

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study

Peikun Chen, Sining Sun, Changhao Shan, Qing Yang, Lei Xie

Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire speech utterance is needed during decoding. Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. Furthermore, we introduce right-chunk attention and various data augmentation techniques to improve the model's contextual modeling abilities. While achieving streaming speech recognition, experiments on the AISHELL-1 and -2 datasets demonstrate the competitive performance of our streaming approach with non-streaming decoder-only counterparts.

6/28/2024

Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection

Haoyu Wang, Guoqiang Hu, Guodong Lin, Wei-Qiang Zhang, Jian Li

As a robust and large-scale multilingual speech recognition model, Whisper has demonstrated impressive results in many low-resource and out-of-distribution scenarios. However, its encoder-decoder structure hinders its application to streaming speech recognition. In this paper, we introduce Simul-Whisper, which uses the time alignment embedded in Whisper's cross-attention to guide auto-regressive decoding and achieve chunk-based streaming ASR without any fine-tuning of the pre-trained model. Furthermore, we observe the negative effect of the truncated words at the chunk boundaries on the decoding results and propose an integrate-and-fire-based truncation detection model to address this issue. Experiments on multiple languages and Whisper architectures show that Simul-Whisper achieves an average absolute word error rate degradation of only 1.46% at a chunk size of 1 second, which significantly outperforms the current state-of-the-art baseline.

6/17/2024