A Multitask Training Approach to Enhance Whisper with Contextual Biasing and Open-Vocabulary Keyword Spotting

2309.09552

Published 6/7/2024 by Yuang Li, Min Zhang, Chang Su, Yinglu Li, Xiaosong Qiao, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Shimin Tao, Hao Yang

cs.AI cs.CL

🏋️

Abstract

The recognition of rare named entities, such as personal names and terminologies, is challenging for automatic speech recognition (ASR) systems, especially when they are not frequently observed in the training data. In this paper, we introduce keyword spotting enhanced Whisper (KWS-Whisper), a novel ASR system that leverages the Whisper model and performs open-vocabulary keyword spotting (OV-KWS) on the hidden states of the Whisper encoder to recognize user-defined named entities. These entities serve as prompts for the Whisper decoder. To optimize the model, we propose a multitask training approach that learns OV-KWS and contextual-ASR tasks. We evaluate our approach on Chinese Aishell hot word subsets and two internal code-switching test sets and show that it significantly improves the entity recall compared to the original Whisper model. Moreover, we demonstrate that the OV-KWS can be a plug-and-play module to enhance the ASR error correction methods and frozen Whisper models.

Create account to get full access

Overview

Recognizing rare named entities like personal names and technical terms is challenging for automatic speech recognition (ASR) systems, especially when they are not common in the training data.
The paper introduces KWS-Whisper, a novel ASR system that leverages the Whisper model and performs open-vocabulary keyword spotting (OV-KWS) to recognize user-defined named entities, which then serve as prompts for the Whisper decoder.
The authors propose a multitask training approach to optimize the model, learning both OV-KWS and contextual-ASR tasks.
The system is evaluated on Chinese Aishell hot word subsets and internal code-switching test sets, showing significant improvements in entity recall compared to the original Whisper model.
The OV-KWS component can also be used as a plug-and-play module to enhance ASR error correction methods and frozen Whisper models.

Plain English Explanation

Automatic speech recognition (ASR) systems, like the ones used in voice assistants, often struggle to accurately recognize rare or specialized terms, such as people's names or technical jargon. This is because these systems are typically trained on more common speech patterns and vocabulary, so they don't perform as well when encountering less frequent words.

To address this issue, the researchers developed a new ASR system called KWS-Whisper. This system builds on the Whisper ASR model, but adds a keyword spotting component that can recognize user-defined named entities, such as people's names or specialized terms. These recognized entities then serve as cues to help the Whisper model better understand and transcribe the full speech.

To train the system, the researchers used a multitask learning approach, where the model was tasked with both identifying the keywords and transcribing the overall speech. This helped the system learn to better leverage the keyword information to improve the accuracy of the full speech transcription.

When tested on datasets with rare named entities, the KWS-Whisper system significantly outperformed the original Whisper model in correctly recognizing those specialized terms. The researchers also found that the keyword spotting component could be easily integrated with other ASR error correction methods or even used to enhance pre-trained Whisper models.

Overall, this research demonstrates a promising approach to improving the performance of ASR systems on rare and specialized vocabulary, which could have important applications in fields like zero-shot language understanding and bridging the performance gap between human and machine transcription.

Technical Explanation

The paper introduces a novel automatic speech recognition (ASR) system called Keyword Spotting enhanced Whisper (KWS-Whisper), which leverages the Whisper model and performs open-vocabulary keyword spotting (OV-KWS) to recognize user-defined named entities.

The key components of the KWS-Whisper system are:

Whisper Encoder: The Whisper encoder is used to generate hidden representations of the input speech.
Open-Vocabulary Keyword Spotting (OV-KWS): The OV-KWS module performs keyword spotting on the Whisper encoder's hidden states to detect user-defined named entities.
Whisper Decoder: The recognized named entities from the OV-KWS module are used as prompts to guide the Whisper decoder in transcribing the full speech.

To optimize the KWS-Whisper model, the authors propose a multitask training approach that jointly learns the OV-KWS and contextual-ASR tasks. This helps the model better leverage the keyword information to improve the overall speech transcription accuracy.

The system is evaluated on two test sets:

Chinese Aishell hot word subsets: This dataset contains rare named entities, such as personal names and technical terminologies.
Internal code-switching test sets: These datasets include mixed-language speech with code-switching between Chinese and English.

The results show that the KWS-Whisper system significantly outperforms the original Whisper model in terms of entity recall on these challenging datasets. Additionally, the authors demonstrate that the OV-KWS component can be used as a plug-and-play module to enhance ASR error correction methods and even to improve the performance of frozen Whisper models.

Critical Analysis

The paper presents a compelling approach to improving automatic speech recognition (ASR) systems' ability to handle rare named entities, which is an important and practical challenge in real-world applications. The authors' use of a multitask learning strategy to jointly optimize the keyword spotting and speech transcription tasks is a well-designed solution that leverages the complementary strengths of these two components.

One potential limitation of the study is the relatively narrow evaluation scope, which focuses primarily on Chinese datasets. While the authors do mention testing on internal code-switching datasets, it would be valuable to see the system's performance on a more diverse range of languages and speech patterns to better understand its broader applicability.

Additionally, the paper does not provide much insight into the computational efficiency or inference latency of the KWS-Whisper system, which could be an important consideration for real-time ASR applications. A more in-depth analysis of the model's resource usage and scalability would help readers better assess its practical viability.

Overall, the research presented in this paper is a significant contribution to the field of ASR, particularly in the context of zero-shot language understanding and bridging the performance gap between human and machine transcription. The authors' innovative use of keyword spotting to enhance a state-of-the-art model like Whisper is a compelling approach that warrants further exploration and validation across a wider range of scenarios.

Conclusion

The KWS-Whisper system introduced in this paper represents a significant advancement in automatic speech recognition (ASR) technology, addressing the critical challenge of accurately recognizing rare named entities and specialized terminology. By leveraging the Whisper model and integrating an open-vocabulary keyword spotting component, the researchers have developed a versatile ASR system that can be easily adapted to user-specific needs and integrated with other error correction methods.

The strong performance of KWS-Whisper on the evaluated datasets, particularly in terms of improved entity recall, demonstrates the potential of this approach to enhance the overall accuracy and robustness of ASR systems. As the use of voice interfaces continues to grow in a wide range of applications, from personal digital assistants to specialized industry tools, the ability to accurately transcribe rare and specialized vocabulary will become increasingly important.

Overall, this research represents an important step forward in bridging the gap between human and machine speech recognition, with promising implications for fields such as zero-shot language understanding and automatic transcription. The authors' innovative use of multitask learning and their modular design approach open up exciting avenues for further development and integration of the KWS-Whisper system in real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Keyword-Guided Adaptation of Automatic Speech Recognition

Aviv Shamsian, Aviv Navon, Neta Glazer, Gill Hetz, Joseph Keshet

Automatic Speech Recognition (ASR) technology has made significant progress in recent years, providing accurate transcription across various domains. However, some challenges remain, especially in noisy environments and specialized jargon. In this paper, we propose a novel approach for improved jargon word recognition by contextual biasing Whisper-based models. We employ a keyword spotting model that leverages the Whisper encoder representation to dynamically generate prompts for guiding the decoder during the transcription process. We introduce two approaches to effectively steer the decoder towards these prompts: KG-Whisper, which is aimed at fine-tuning the Whisper decoder, and KG-Whisper-PT, which learns a prompt prefix. Our results show a significant improvement in the recognition accuracy of specified keywords and in reducing the overall word error rates. Specifically, in unseen language generalization, we demonstrate an average WER improvement of 5.1% over Whisper.

6/6/2024

eess.AS cs.LG cs.SD

Open vocabulary keyword spotting through transfer learning from speech synthesis

Kesavaraj V, Anil Kumar Vuppala

Identifying keywords in an open-vocabulary context is crucial for personalizing interactions with smart devices. Previous approaches to open vocabulary keyword spotting dependon a shared embedding space created by audio and text encoders. However, these approaches suffer from heterogeneous modality representations (i.e., audio-text mismatch). To address this issue, our proposed framework leverages knowledge acquired from a pre-trained text-to-speech (TTS) system. This knowledge transfer allows for the incorporation of awareness of audio projections into the text representations derived from the text encoder. The performance of the proposed approach is compared with various baseline methods across four different datasets. The robustness of our proposed model is evaluated by assessing its performance across different word lengths and in an Out-of-Vocabulary (OOV) scenario. Additionally, the effectiveness of transfer learning from the TTS system is investigated by analyzing its different intermediate representations. The experimental results indicate that, in the challenging LibriPhrase Hard dataset, the proposed approach outperformed the cross-modality correspondence detector (CMCD) method by a significant improvement of 8.22% in area under the curve (AUC) and 12.56% in equal error rate (EER).

4/19/2024

cs.HC cs.SD eess.AS

Efficient Compression of Multitask Multilingual Speech Models

Thomas Palmeira Ferraz

Whisper is a multitask and multilingual speech model covering 99 languages. It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we examine its limitations, demonstrating the presence of speaker-related (gender, age) and model-related (resourcefulness and model size) bias. Despite that, we show that only model-related bias are amplified by quantization, impacting more low-resource languages and smaller models. Searching for a better compression approach, we propose DistilWhisper, an approach that is able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2. This dual approach allows us to effectively boost ASR performance while keeping the robustness inherited from the multitask and multilingual pre-training. Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters, boosting performance in the targeted languages for both in- and out-of-domain test sets, while introducing only a negligible parameter overhead at inference.

5/3/2024

cs.CL cs.AI cs.SD eess.AS

MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting

Zhiqi Ai, Zhiyong Chen, Shugong Xu

In this paper, we propose MM-KWS, a novel approach to user-defined keyword spotting leveraging multi-modal enrollments of text and speech templates. Unlike previous methods that focus solely on either text or speech features, MM-KWS extracts phoneme, text, and speech embeddings from both modalities. These embeddings are then compared with the query speech embedding to detect the target keywords. To ensure the applicability of MM-KWS across diverse languages, we utilize a feature extractor incorporating several multilingual pre-trained models. Subsequently, we validate its effectiveness on Mandarin and English tasks. In addition, we have integrated advanced data augmentation tools for hard case mining to enhance MM-KWS in distinguishing confusable words. Experimental results on the LibriPhrase and WenetPhrase datasets demonstrate that MM-KWS outperforms prior methods significantly.

6/12/2024

eess.AS cs.CL cs.SD