Using Large Language Model for End-to-End Chinese ASR and NER

2401.11382

Published 6/7/2024 by Yuang Li, Jiawei Yu, Min Zhang, Mengxin Ren, Yanqing Zhao, Xiaofeng Zhao, Shimin Tao, Jinsong Su, Hao Yang

cs.CL cs.AI

Using Large Language Model for End-to-End Chinese ASR and NER

Abstract

Mapping speech tokens to the same feature space as text tokens has become the paradigm for the integration of speech modality into decoder-only large language models (LLMs). An alternative approach is to use an encoder-decoder architecture that incorporates speech features through cross-attention. This approach, however, has received less attention in the literature. In this work, we connect the Whisper encoder with ChatGLM3 and provide in-depth comparisons of these two approaches using Chinese automatic speech recognition (ASR) and name entity recognition (NER) tasks. We evaluate them not only by conventional metrics like the F1 score but also by a novel fine-grained taxonomy of ASR-NER errors. Our experiments reveal that encoder-decoder architecture outperforms decoder-only architecture with a short context, while decoder-only architecture benefits from a long context as it fully exploits all layers of the LLM. By using LLM, we significantly reduced the entity omission errors and improved the entity ASR accuracy compared to the Conformer baseline. Additionally, we obtained a state-of-the-art (SOTA) F1 score of 0.805 on the AISHELL-NER test set by using chain-of-thought (CoT) NER which first infers long-form ASR transcriptions and then predicts NER labels.

Create account to get full access

Overview

This paper explores the use of large language models (LLMs) for end-to-end Chinese automatic speech recognition (ASR) and named entity recognition (NER) tasks.
The researchers propose a "decoder-only" model architecture that leverages the powerful language modeling capabilities of LLMs to perform these tasks without the need for a separate encoder.
The paper also investigates the potential of using LLMs for cross-modal and cross-lingual applications, as well as a new paradigm for question answering based on speech recognition.
Additionally, the authors explore anchor-based methods for adapting LLMs to specific domains and tasks.

Plain English Explanation

In this research, the authors investigate how large language models can be used for Chinese speech recognition and named entity identification. Instead of using a separate speech recognition model and language model, they propose a "decoder-only" approach that relies solely on the language modeling capabilities of the LLM to perform these tasks.

The key idea is that LLMs, which are trained on vast amounts of text data, have a strong understanding of language structure and can use this knowledge to accurately transcribe speech and identify important named entities, even without specialized speech recognition components. This "decoder-only" architecture simplifies the model design and potentially makes the system more efficient and flexible.

The researchers also explore how LLMs can be adapted for cross-modal and cross-lingual applications, as well as a new question-answering paradigm that combines speech recognition with language understanding. Additionally, they investigate anchor-based methods for adapting LLMs to specific domains and tasks.

Technical Explanation

The researchers propose a "decoder-only" model architecture for end-to-end Chinese ASR and NER tasks. This approach leverages the powerful language modeling capabilities of LLMs, such as GPT-3, to directly generate the transcribed text and named entities without the need for a separate speech recognition encoder.

The model is trained on a large corpus of transcribed audio data, where the input is the audio signal, and the output is the corresponding text and named entity annotations. The LLM is used as the sole component of the model, effectively learning to map the audio input to the desired textual output through its language understanding and generation abilities.

The authors also investigate several techniques to enhance the performance and versatility of the LLM-based approach:

Cross-modal and cross-lingual adaptation: The researchers explore methods to adapt the LLM for use in cross-modal (e.g., speech-to-text) and cross-lingual (e.g., Chinese-to-English) applications.
Question-answering paradigm: The authors propose a new question-answering paradigm that combines speech recognition and language understanding, where the system must listen to audio and then choose the correct answer to a question about the content.
Anchor-based adaptation: The paper explores anchor-based methods for adapting the LLM to specific domains and tasks, which can help improve performance in specialized scenarios.

The experimental results demonstrate the effectiveness of the proposed LLM-based approach for both Chinese ASR and NER tasks, highlighting the potential of this simplified architecture to rival or outperform traditional, more complex models.

Critical Analysis

The paper presents a compelling approach for leveraging the powerful language modeling capabilities of LLMs to perform end-to-end Chinese ASR and NER tasks. The proposed "decoder-only" architecture is a significant simplification over traditional models that require separate speech recognition and language understanding components.

However, the paper acknowledges several potential limitations and areas for further research:

The performance of the LLM-based approach may be sensitive to the quality and quantity of the training data, particularly for the speech recognition component. Expanding the training data and improving data augmentation techniques could be valuable.
The cross-modal and cross-lingual adaptation methods explored in the paper may have additional complexities and challenges that require further investigation.
The proposed question-answering paradigm, while intriguing, may face challenges in scaling to more complex and open-ended question types. Evaluating the approach on a broader range of tasks would be beneficial.
The anchor-based adaptation methods may be limited in their ability to fully capture the nuances of specific domains and tasks, and more advanced fine-tuning or meta-learning techniques could be explored.

Overall, the paper presents a promising direction for leveraging LLMs in speech and language processing tasks, but further research is needed to fully understand the capabilities and limitations of this approach.

Conclusion

This research paper explores the potential of using large language models (LLMs) for end-to-end Chinese automatic speech recognition (ASR) and named entity recognition (NER) tasks. The key contribution is the proposed "decoder-only" architecture, which simplifies the model design by relying solely on the language modeling capabilities of the LLM to perform these tasks.

The paper also investigates several techniques to enhance the performance and versatility of the LLM-based approach, including cross-modal and cross-lingual adaptation, a novel question-answering paradigm, and anchor-based adaptation methods. The experimental results demonstrate the effectiveness of the LLM-based approach, highlighting its potential to rival or outperform traditional, more complex models.

Overall, this research represents an exciting step forward in the application of LLMs to speech and language processing tasks, and the findings have important implications for the development of more efficient and flexible natural language processing systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤿

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie

Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we introduce a three-stage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and Test_Meeting test sets. Our analysis presents an empirical foundation for future research in LLM-based ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pre-trained models and training logs to promote reproducible research.

5/7/2024

cs.SD cs.CL eess.AS

Transferable speech-to-text large language model alignment module

Boyong Wu, Chao Yan, Haoran Pu

By leveraging the power of Large Language Models(LLMs) and speech foundation models, state of the art speech-text bimodal works can achieve challenging tasks like spoken translation(ST) and question answering(SQA) altogether with much simpler architectures. In this paper, we utilize the capability of Whisper encoder and pre-trained Yi-6B. Empirical results reveal that modal alignment can be achieved with one layer module and hundred hours of speech-text multitask corpus. We further swap the Yi-6B with human preferences aligned version of Yi-6B-Chat during inference, and discover that the alignment capability is applicable as well. In addition, the alignment subspace revealed by singular value decomposition(SVD) also implies linear alignment subspace is sparse, which leaves the possibility to concatenate other features like voice-print or video to expand modality.

6/21/2024

cs.CL cs.SD eess.AS

Decoder-only Architecture for Streaming End-to-end Speech Recognition

Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe

Decoder-only language models (LMs) have been successfully adopted for speech-processing tasks including automatic speech recognition (ASR). The LMs have ample expressiveness and perform efficiently. This efficiency is a suitable characteristic for streaming applications of ASR. In this work, we propose to use a decoder-only architecture for blockwise streaming ASR. In our approach, speech features are compressed using CTC output and context embedding using blockwise speech subnetwork, and are sequentially provided as prompts to the decoder. The decoder estimates the output tokens promptly at each block. To this end, we also propose a novel training scheme using random-length prefix prompts to make the model robust to the truncated prompts caused by blockwise processing. An experimental comparison shows that our proposed decoder-only streaming ASR achieves 8% relative word error rate reduction in the LibriSpeech test-other set while being twice as fast as the baseline model.

6/26/2024

eess.AS cs.CL

Multi-stage Large Language Model Correction for Speech Recognition

Jie Pu, Thai-Son Nguyen, Sebastian Stuker

In this paper, we investigate the usage of large language models (LLMs) to improve the performance of competitive speech recognition systems. Different from previous LLM-based ASR error correction methods, we propose a novel multi-stage approach that utilizes uncertainty estimation of ASR outputs and reasoning capability of LLMs. Specifically, the proposed approach has two stages: the first stage is about ASR uncertainty estimation and exploits N-best list hypotheses to identify less reliable transcriptions; The second stage works on these identified transcriptions and performs LLM-based corrections. This correction task is formulated as a multi-step rule-based LLM reasoning process, which uses explicitly written rules in prompts to decompose the task into concrete reasoning steps. Our experimental results demonstrate the effectiveness of the proposed method by showing 10% ~ 20% relative improvement in WER over competitive ASR systems -- across multiple test domains and in zero-shot settings.

6/18/2024

cs.CL eess.AS