Deferred NAM: Low-latency Top-K Context Injection via DeferredContext Encoding for Non-Streaming ASR

Read original: arXiv:2404.10180 - Published 4/24/2024 by Zelin Wu, Gan Song, Christopher Li, Pat Rondon, Zhong Meng, Xavier Velez, Weiran Wang, Diamantino Caseiro, Golan Pundak, Tsendsuren Munkhdalai and 2 others

Deferred NAM: Low-latency Top-K Context Injection via DeferredContext Encoding for Non-Streaming ASR

Overview

This paper introduces a novel approach called "Deferred NAM" for low-latency top-K context injection in non-streaming Automatic Speech Recognition (ASR) systems.
Deferred NAM addresses the challenge of efficiently incorporating contextual information into ASR models without incurring significant latency.
The key idea is to defer the encoding of context until the final decoding stage, allowing for efficient top-K context injection without compromising real-time performance.

Plain English Explanation

Automatic speech recognition (ASR) systems are designed to convert spoken language into text. These systems often rely on contextual information, such as the topic of the conversation or the speaker's previous statements, to improve their accuracy. However, incorporating this contextual information can be computationally expensive and introduce latency, which is undesirable for real-time applications.

The Deferred NAM approach presented in this paper aims to solve this problem by "deferring" the encoding of the context until the final stage of the ASR process. This means that the model initially generates text without considering the context, and then selectively applies the relevant contextual information at the end. This allows the system to maintain low latency while still benefiting from the improved accuracy that comes with using contextual information.

The key advantage of Deferred NAM is that it can efficiently incorporate the top-K most relevant contextual elements, rather than having to process the entire context at every step. This makes the system more scalable and efficient, particularly for applications where the available context is large or constantly changing.

Technical Explanation

The Deferred NAM approach proposed in this paper consists of two main components: a base ASR model and a context encoder. The base ASR model generates an initial set of text hypotheses without considering any contextual information. The context encoder then selectively injects the most relevant contextual information into the final decoding stage, allowing for low-latency top-K context injection.

The context encoder uses a transformer-based architecture to encode the available context and compute relevance scores for each contextual element. During decoding, the top-K most relevant contextual elements are dynamically selected and injected into the final hypothesis generation, without requiring the full context to be processed at every step.

The authors evaluate the Deferred NAM approach on several non-streaming ASR tasks and demonstrate significant improvements in both accuracy and latency compared to traditional context-aware ASR models. The anti-LM decoding technique is also employed to further enhance the efficiency of the context injection process.

Critical Analysis

One potential limitation of the Deferred NAM approach is that it relies on the accuracy of the initial text hypotheses generated by the base ASR model. If these hypotheses are poor, the subsequent context injection may not be able to significantly improve the final output. The paper acknowledges this and suggests that further research is needed to explore ways of making the base ASR model more robust to ensure high-quality initial hypotheses.

Additionally, the paper does not provide a detailed analysis of the computational complexity and resource requirements of the Deferred NAM approach compared to other context-aware ASR models. While the authors claim improvements in latency, the trade-offs in terms of memory usage, inference time, and GPU/CPU utilization should be further investigated to fully understand the practical implications of this approach.

Conclusion

The Deferred NAM approach presented in this paper represents a significant advancement in the field of non-streaming ASR by addressing the challenge of efficiently incorporating contextual information into the recognition process. By deferring the context encoding to the final decoding stage, the system is able to maintain low latency while still benefiting from the improved accuracy that comes with using relevant contextual information.

The key contribution of this work is the position-aware parameter-efficient fine-tuning approach that enables selective top-K context injection, making the system more scalable and efficient. The promising results demonstrated in the paper suggest that Deferred NAM could have a transformative impact on real-world ASR applications that require low-latency performance without sacrificing accuracy.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Deferred NAM: Low-latency Top-K Context Injection via DeferredContext Encoding for Non-Streaming ASR

Zelin Wu, Gan Song, Christopher Li, Pat Rondon, Zhong Meng, Xavier Velez, Weiran Wang, Diamantino Caseiro, Golan Pundak, Tsendsuren Munkhdalai, Angad Chandorkar, Rohit Prabhavalkar

Contextual biasing enables speech recognizers to transcribe important phrases in the speaker's context, such as contact names, even if they are rare in, or absent from, the training data. Attention-based biasing is a leading approach which allows for full end-to-end cotraining of the recognizer and biasing system and requires no separate inference-time components. Such biasers typically consist of a context encoder; followed by a context filter which narrows down the context to apply, improving per-step inference time; and, finally, context application via cross attention. Though much work has gone into optimizing per-frame performance, the context encoder is at least as important: recognition cannot begin before context encoding ends. Here, we show the lightweight phrase selection pass can be moved before context encoding, resulting in a speedup of up to 16.1 times and enabling biasing to scale to 20K phrases with a maximum pre-decoding delay under 33ms. With the addition of phrase- and wordpiece-level cross-entropy losses, our technique also achieves up to a 37.5% relative WER reduction over the baseline without the losses and lightweight phrase selection pass.

4/24/2024

Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation

Ruizhe Huang, Mahsa Yarmohammadi, Sanjeev Khudanpur, Daniel Povey

Existing research suggests that automatic speech recognition (ASR) models can benefit from additional contexts (e.g., contact lists, user specified vocabulary). Rare words and named entities can be better recognized with contexts. In this work, we propose two simple yet effective techniques to improve context-aware ASR models. First, we inject contexts into the encoders at an early stage instead of merely at their last layers. Second, to enforce the model to leverage the contexts during training, we perturb the reference transcription with alternative spellings so that the model learns to rely on the contexts to make correct predictions. On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion, making the new state-of-the-art performance. On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.

7/16/2024

Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss

Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

Contextualized end-to-end automatic speech recognition has been an active research area, with recent efforts focusing on the implicit learning of contextual phrases based on the final loss objective. However, these approaches ignore the useful contextual knowledge encoded in the intermediate layers. We hypothesize that employing explicit biasing loss as an auxiliary task in the encoder intermediate layers may better align text tokens or audio frames with the desired objectives. Our proposed intermediate biasing loss brings more regularization and contextualization to the network. Our method outperforms a conventional contextual biasing baseline on the LibriSpeech corpus, achieving a relative improvement of 22.5% in biased word error rate (B-WER) and up to 44% compared to the non-contextual baseline with a biasing list size of 100. Moreover, employing RNN-transducer-driven joint decoding further reduces the unbiased word error rate (U-WER), resulting in a more robust network.

6/26/2024

Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter

Andrei Andrusenko, Aleksandr Laptev, Vladimir Bataev, Vitaly Lavrukhin, Boris Ginsburg

Accurate recognition of rare and new words remains a pressing problem for contextualized Automatic Speech Recognition (ASR) systems. Most context-biasing methods involve modification of the ASR model or the beam-search decoding algorithm, complicating model reuse and slowing down inference. This work presents a new approach to fast context-biasing with CTC-based Word Spotter (CTC-WS) for CTC and Transducer (RNN-T) ASR models. The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates. The valid candidates then replace their greedy recognition counterparts in corresponding frame intervals. A Hybrid Transducer-CTC model enables the CTC-WS application for the Transducer model. The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER compared to baseline methods. The proposed method is publicly available in the NVIDIA NeMo toolkit.

6/12/2024