Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation

Read original: arXiv:2407.10303 - Published 7/16/2024 by Ruizhe Huang, Mahsa Yarmohammadi, Sanjeev Khudanpur, Daniel Povey

Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation

Overview

This paper proposes techniques to improve the performance of contextual speech recognition models, which use surrounding text information to enhance the accuracy of automatic speech recognition (ASR).
The key ideas include "early context injection" to incorporate text context earlier in the ASR model, and "text perturbation" to augment the training data with diverse text variations.
The authors demonstrate that these methods can lead to significant improvements in ASR accuracy, particularly for conversational speech with rich context.

Plain English Explanation

Automatic speech recognition (ASR) systems, which convert spoken words into text, can be greatly improved by incorporating "context" - information about the surrounding text or situation. For example, if you're having a conversation about a specific topic, the ASR system can use that topical context to better understand what you're saying.

This paper explores two ways to enhance this "contextual" ASR. First, the authors propose "early context injection" - feeding the contextual information into the ASR model earlier in the process, rather than just at the end. This allows the model to use that context more effectively throughout its speech recognition.

Second, the researchers use "text perturbation" - artificially modifying the training text data in various ways to create more diverse examples. This helps the model generalize better and handle the natural variations that occur in real-world speech.

By combining these techniques, the paper demonstrates significant improvements in ASR accuracy, especially for conversational speech where there is rich contextual information available. This could lead to more robust and reliable speech recognition systems, with applications in areas like voice assistants, transcription, and human-computer interaction.

Technical Explanation

The paper focuses on improving contextual automatic speech recognition (ASR), where surrounding text information is used to enhance recognition accuracy. The authors propose two key techniques:

Early Context Injection: Rather than only injecting context features at the output layer of the ASR model, the authors introduce a mechanism to incorporate contextual information much earlier in the model, allowing it to better influence the internal speech representations. This is achieved by concatenating the context features with the input acoustic features before feeding them into the model.
Text Perturbation: To improve the model's ability to handle diverse text variations, the authors apply various text augmentation techniques, such as word substitution, deletion, and reordering, to the training text data. This "text perturbation" creates a richer set of examples for the model to learn from, helping it generalize better to real-world conversational speech.

The proposed methods are evaluated on a conversational speech recognition task, where the ASR model can leverage surrounding text context to improve performance. Experiments show that the combination of early context injection and text perturbation leads to significant accuracy improvements over a baseline contextual ASR model, particularly for situations with rich contextual information.

Critical Analysis

The paper presents a compelling approach to enhancing contextual ASR, with a clear technical implementation and thorough experimental evaluation. However, there are a few potential limitations and areas for further research:

The text perturbation techniques rely on heuristic rules for word substitution, deletion, and reordering. More sophisticated text augmentation methods that better preserve the semantic and grammatical structure of the training data could potentially lead to even greater performance gains.
The paper focuses on conversational speech recognition, where the surrounding text context is relatively rich and predictable. It would be interesting to see how the proposed techniques perform in more open-ended, diverse conversational scenarios, or in other contextual ASR applications like meeting transcription or voice-based assistants.
While the paper demonstrates improvements in ASR accuracy, the authors do not provide an in-depth analysis of the model's behavior or the specific types of errors it is able to correct through the use of contextual information. A more detailed investigation of these aspects could yield valuable insights for further improving contextual ASR systems.

Overall, the techniques proposed in this paper represent a promising step forward in enhancing the performance of contextual speech recognition models, with potential applications in a variety of real-world scenarios.

Conclusion

This paper presents two novel techniques, "early context injection" and "text perturbation," to improve the performance of contextual automatic speech recognition (ASR) models. By incorporating surrounding text information earlier in the ASR model and augmenting the training data with diverse text variations, the authors demonstrate significant accuracy improvements, particularly for conversational speech recognition tasks where rich contextual information is available.

These findings have important implications for the development of more robust and reliable speech recognition systems, with applications ranging from voice assistants and meeting transcription to human-computer interaction. The paper's contributions represent a valuable advancement in the field of contextual ASR, and the proposed methods could be further refined and extended to handle a broader range of real-world speech recognition scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation

Ruizhe Huang, Mahsa Yarmohammadi, Sanjeev Khudanpur, Daniel Povey

Existing research suggests that automatic speech recognition (ASR) models can benefit from additional contexts (e.g., contact lists, user specified vocabulary). Rare words and named entities can be better recognized with contexts. In this work, we propose two simple yet effective techniques to improve context-aware ASR models. First, we inject contexts into the encoders at an early stage instead of merely at their last layers. Second, to enforce the model to leverage the contexts during training, we perturb the reference transcription with alternative spellings so that the model learns to rely on the contexts to make correct predictions. On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion, making the new state-of-the-art performance. On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.

7/16/2024

Text Injection for Neural Contextual Biasing

Zhong Meng, Zelin Wu, Rohit Prabhavalkar, Cal Peyser, Weiran Wang, Nanxin Chen, Tara N. Sainath, Bhuvana Ramabhadran

Neural contextual biasing effectively improves automatic speech recognition (ASR) for crucial phrases within a speaker's context, particularly those that are infrequent in the training data. This work proposes contextual text injection (CTI) to enhance contextual ASR. CTI leverages not only the paired speech-text data, but also a much larger corpus of unpaired text to optimize the ASR model and its biasing component. Unpaired text is converted into speech-like representations and used to guide the model's attention towards relevant bias phrases. Moreover, we introduce a contextual text-injected (CTI) minimum word error rate (MWER) training, which minimizes the expected WER caused by contextual biasing when unpaired text is injected into the model. Experiments show that CTI with 100 billion text sentences can achieve up to 43.3% relative WER reduction from a strong neural biasing model. CTI-MWER provides a further relative improvement of 23.5%.

6/12/2024

Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss

Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

Contextualized end-to-end automatic speech recognition has been an active research area, with recent efforts focusing on the implicit learning of contextual phrases based on the final loss objective. However, these approaches ignore the useful contextual knowledge encoded in the intermediate layers. We hypothesize that employing explicit biasing loss as an auxiliary task in the encoder intermediate layers may better align text tokens or audio frames with the desired objectives. Our proposed intermediate biasing loss brings more regularization and contextualization to the network. Our method outperforms a conventional contextual biasing baseline on the LibriSpeech corpus, achieving a relative improvement of 22.5% in biased word error rate (B-WER) and up to 44% compared to the non-contextual baseline with a biasing list size of 100. Moreover, employing RNN-transducer-driven joint decoding further reduces the unbiased word error rate (U-WER), resulting in a more robust network.

6/26/2024

Deferred NAM: Low-latency Top-K Context Injection via DeferredContext Encoding for Non-Streaming ASR

Zelin Wu, Gan Song, Christopher Li, Pat Rondon, Zhong Meng, Xavier Velez, Weiran Wang, Diamantino Caseiro, Golan Pundak, Tsendsuren Munkhdalai, Angad Chandorkar, Rohit Prabhavalkar

Contextual biasing enables speech recognizers to transcribe important phrases in the speaker's context, such as contact names, even if they are rare in, or absent from, the training data. Attention-based biasing is a leading approach which allows for full end-to-end cotraining of the recognizer and biasing system and requires no separate inference-time components. Such biasers typically consist of a context encoder; followed by a context filter which narrows down the context to apply, improving per-step inference time; and, finally, context application via cross attention. Though much work has gone into optimizing per-frame performance, the context encoder is at least as important: recognition cannot begin before context encoding ends. Here, we show the lightweight phrase selection pass can be moved before context encoding, resulting in a speedup of up to 16.1 times and enabling biasing to scale to 20K phrases with a maximum pre-decoding delay under 33ms. With the addition of phrase- and wordpiece-level cross-entropy losses, our technique also achieves up to a 37.5% relative WER reduction over the baseline without the losses and lightweight phrase selection pass.

4/24/2024