Multi-stage Large Language Model Correction for Speech Recognition

2310.11532

Published 6/18/2024 by Jie Pu, Thai-Son Nguyen, Sebastian Stuker

Multi-stage Large Language Model Correction for Speech Recognition

Abstract

In this paper, we investigate the usage of large language models (LLMs) to improve the performance of competitive speech recognition systems. Different from previous LLM-based ASR error correction methods, we propose a novel multi-stage approach that utilizes uncertainty estimation of ASR outputs and reasoning capability of LLMs. Specifically, the proposed approach has two stages: the first stage is about ASR uncertainty estimation and exploits N-best list hypotheses to identify less reliable transcriptions; The second stage works on these identified transcriptions and performs LLM-based corrections. This correction task is formulated as a multi-step rule-based LLM reasoning process, which uses explicitly written rules in prompts to decompose the task into concrete reasoning steps. Our experimental results demonstrate the effectiveness of the proposed method by showing 10% ~ 20% relative improvement in WER over competitive ASR systems -- across multiple test domains and in zero-shot settings.

Create account to get full access

Overview

This paper explores a multi-stage approach to improving speech recognition using large language models (LLMs)
The key idea is to leverage the power of LLMs to correct and refine the output of a base speech recognition system
The method involves two main stages: 1) rescoring the output of the base ASR system using an LLM, and 2) generating alternative recognition hypotheses and selecting the best one

Plain English Explanation

Speech recognition systems, which convert spoken language into text, have made impressive advances in recent years. However, they can still make mistakes, particularly with complex or ambiguous speech. This paper proposes a way to improve speech recognition by using powerful large language models (LLMs) - AI models trained on vast amounts of text data - to correct and refine the output of a base speech recognition system.

The method works in two stages. First, the base speech recognition system produces an initial transcript of the speech. Then, the LLM is used to "rescore" this transcript, assigning probabilities to each word based on how likely it is to occur in that context. This helps catch and correct any mistakes made by the base system.

In the second stage, the LLM is used to generate alternative recognition hypotheses - other possible ways the speech could have been transcribed. The system then selects the hypothesis that the LLM deems most likely to be correct. This allows the system to choose the best possible transcription, going beyond the initial output of the base speech recognizer.

This multi-stage approach leverages the strengths of both the base speech recognition system and the powerful language understanding capabilities of the LLM. By combining these elements, the researchers were able to achieve significant improvements in transcription accuracy, making speech recognition systems more reliable and useful in real-world applications.

Technical Explanation

The core of this approach is the use of a large language model (LLM) to refine and correct the output of a base automatic speech recognition (ASR) system. The method consists of two main stages:

Stage 1: Language Model Re-scoring - In this stage, the initial transcript produced by the base ASR system is re-scored using the LLM. The LLM assigns probabilities to each word in the transcript based on the context, allowing the system to catch and fix any mistakes made by the base ASR.
Stage 2: Alternative Hypothesis Generation and Selection - Here, the LLM is used to generate multiple alternative recognition hypotheses - other possible ways the speech could have been transcribed. The system then selects the hypothesis that the LLM deems most likely to be correct, going beyond the initial output of the base ASR.

By leveraging the powerful language understanding capabilities of the LLM, this multi-stage approach is able to achieve significant improvements in speech recognition accuracy compared to the base ASR system alone. The researchers evaluated their method on several benchmark datasets, including the LibriSpeech and Switchboard corpora, and found consistent gains in performance.

Critical Analysis

The researchers acknowledge several potential limitations and areas for further work. For example, the method relies on the availability of a high-quality LLM, which may not always be feasible, especially for low-resource languages. Additionally, the computational overhead of generating and evaluating multiple recognition hypotheses could be prohibitive in some real-time applications.

Another area for further research is the integration of multimodal information, such as visual cues or speaker diarization, to further improve the accuracy and robustness of the system. The researchers also note the potential to extend this approach to other languages, such as Chinese, where speech recognition remains a challenging problem.

Conclusion

This paper presents a promising multi-stage approach to improving speech recognition accuracy by leveraging the power of large language models. By combining the strengths of base ASR systems and advanced language modeling, the researchers were able to achieve significant performance gains, demonstrating the potential of this technique to make speech recognition more reliable and useful in a variety of real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

LLM-based speaker diarization correction: A generalizable approach

Georgios Efstathiadis, Vijay Yadav, Anzar Abbas

Speaker diarization is necessary for interpreting conversations transcribed using automated speech recognition (ASR) tools. Despite significant developments in diarization methods, diarization accuracy remains an issue. Here, we investigate the use of large language models (LLMs) for diarization correction as a post-processing step. LLMs were fine-tuned using the Fisher corpus, a large dataset of transcribed conversations. The ability of the models to improve diarization accuracy in a holdout dataset was measured. We report that fine-tuned LLMs can markedly improve diarization accuracy. However, model performance is constrained to transcripts produced using the same ASR tool as the transcripts used for fine-tuning, limiting generalizability. To address this constraint, an ensemble model was developed by combining weights from three separate models, each fine-tuned using transcripts from a different ASR tool. The ensemble model demonstrated better overall performance than each of the ASR-specific models, suggesting that a generalizable and ASR-agnostic approach may be achievable. We hope to make these models accessible through public-facing APIs for use by third-party applications.

6/10/2024

eess.AS cs.CL

Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models

Yuchen Hu, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng, Ruizhe Li

Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses. Thanks to the strong language generation ability of LLMs and rich information in the N-best list, GER shows great effectiveness in enhancing ASR results. However, it still suffers from two limitations: 1) LLMs are unaware of the source speech during GER, which may lead to results that are grammatically correct but violate the source speech content, 2) N-best hypotheses usually only vary in a few tokens, making it redundant to send all of them for GER, which could confuse LLM about which tokens to focus on and thus lead to increased miscorrection. In this paper, we propose ClozeGER, a new paradigm for ASR generative error correction. First, we introduce a multimodal LLM (i.e., SpeechGPT) to receive source speech as extra input to improve the fidelity of correction output. Then, we reformat GER as a cloze test with logits calibration to remove the input information redundancy and simplify GER with clear instructions. Experiments show that ClozeGER achieves a new breakthrough over vanilla GER on 9 popular ASR datasets.

5/17/2024

cs.CL cs.AI cs.LG cs.SD eess.AS

Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

Zijin Gu, Tatiana Likhomanenko, He Bai, Erik McDermott, Ronan Collobert, Navdeep Jaitly

Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a $textit{scaled}$ error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. We use text-to-speech (TTS) systems to synthesize audio, which is fed into an ASR system to produce noisy hypotheses, which are then paired with the original texts to train the DLM. DLM has several $textit{key ingredients}$: (i) up-scaled model and data; (ii) usage of multi-speaker TTS systems; (iii) combination of multiple noise augmentation strategies; and (iv) new decoding techniques. With a Transformer-CTC ASR, DLM achieves 1.5% word error rate (WER) on $textit{test-clean}$ and 3.3% WER on $textit{test-other}$ on Librispeech, which to our knowledge are the best reported numbers in the setting where no external audio data are used and even match self-supervised methods which use external audio data. Furthermore, a single DLM is applicable to different ASRs, and greatly surpassing the performance of conventional LM based beam-search rescoring. These results indicate that properly investigated error correction models have the potential to replace conventional LMs, holding the key to a new level of accuracy in ASR systems.

5/27/2024

cs.LG cs.CL cs.SD eess.AS

Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024

Sai Koneru, Thai-Binh Nguyen, Ngoc-Quan Pham, Danni Liu, Zhaolin Li, Alexander Waibel, Jan Niehues

Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST). In this paper, we present KIT's offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. Specifically, we integrate Mistral-7Bfootnote{mistralai/Mistral-7B-Instruct-v0.1} into our system to enhance it in two ways. Firstly, we refine the ASR outputs by utilizing the N-best lists generated by our system and fine-tuning the LLM to predict the transcript accurately. Secondly, we refine the MT outputs at the document level by fine-tuning the LLM, leveraging both ASR and MT predictions to improve translation quality. We find that integrating the LLM into the ASR and MT systems results in an absolute improvement of $0.3%$ in Word Error Rate and $0.65%$ in COMET for tst2019 test set. In challenging test sets with overlapping speakers and background noise, we find that integrating LLM is not beneficial due to poor ASR performance. Here, we use ASR with chunked long-form decoding to improve context usage that may be unavailable when transcribing with Voice Activity Detection segmentation alone.

6/26/2024

cs.CL cs.AI