Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

2405.02132

Published 5/7/2024 by Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li and 2 others

cs.SD cs.CL eess.AS

🤿

Abstract

Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we introduce a three-stage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and Test_Meeting test sets. Our analysis presents an empirical foundation for future research in LLM-based ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pre-trained models and training logs to promote reproducible research.

Create account to get full access

Overview

The research explores the integration of large language models (LLMs) with automatic speech recognition (ASR) to improve performance on a large open-source Chinese dataset.
The study evaluates the impact of various configurations of speech encoders, LLMs, and projector modules in the speech foundation encoder-LLM ASR paradigm.
A three-stage training approach is introduced to enhance the model's ability to align auditory and textual information.
The implementation of this approach, along with strategic integration of ASR components, enabled the researchers to achieve state-of-the-art performance on several Chinese ASR benchmarks.

Plain English Explanation

Large language models (LLMs) have shown remarkable capabilities in various natural language processing tasks. Integrating LLMs with automatic speech recognition is becoming a common approach in the field. In this research, the team delved into a detailed examination of this paradigm using a large open-source Chinese dataset.

The researchers aimed to understand how different configurations of speech encoders, LLMs, and projector modules impact the performance of the speech foundation encoder-LLM ASR system. They also introduced a three-stage training approach to help the model better align audio and text information.

By implementing this training approach and strategically integrating ASR components, the team was able to achieve state-of-the-art results on several Chinese ASR benchmarks, including AISHELL1, TestNet, and TestMeeting. This research provides an empirical foundation for future work on LLM-based ASR systems, particularly in the context of Chinese language processing.

Technical Explanation

The researchers employed a speech foundation encoder-LLM paradigm, where they evaluated the impact of various configurations of speech encoders, LLMs, and projector modules. This paradigm has been explored in previous work on transforming LLMs into cross-modal, cross-lingual systems.

To enhance the model's ability to align auditory and textual information, the team introduced a three-stage training approach. In the first stage, they trained the speech encoder and LLM separately. In the second stage, they fine-tuned the LLM with the speech encoder's output. Finally, in the third stage, they jointly trained the entire system, including the projector module.

The strategic integration of these ASR components, along with the three-stage training approach, enabled the researchers to achieve state-of-the-art performance on the AISHELL1, TestNet, and TestMeeting Chinese ASR benchmarks.

Critical Analysis

The paper provides a comprehensive evaluation of the speech foundation encoder-LLM ASR paradigm on a large Chinese dataset. The three-stage training approach seems to be a novel and effective strategy for aligning audio and textual information, which is a critical aspect of ASR systems.

However, the paper does not discuss the computational complexity or resource requirements of the proposed approach. [It would be interesting to understand the trade-offs between model performance and resource usage, especially in the context of Chinese-centric large language model pretraining].

Additionally, the paper could have provided more insights into the specific configurations and architectural choices that led to the performance improvements. This would help researchers better understand the key factors driving the model's success and potentially guide future work in this area.

Conclusion

This research demonstrates the potential of integrating large language models with automatic speech recognition, particularly in the context of Chinese language processing. The proposed three-stage training approach and strategic integration of ASR components led to state-of-the-art results on several Chinese benchmarks.

The findings of this study provide a solid foundation for future research on LLM-based ASR systems, offering valuable insights into optimizing performance and aligning audio-textual information. As the field of speech recognition continues to evolve, this work highlights the importance of exploring innovative approaches that leverage the capabilities of large language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Using Large Language Model for End-to-End Chinese ASR and NER

Yuang Li, Jiawei Yu, Min Zhang, Mengxin Ren, Yanqing Zhao, Xiaofeng Zhao, Shimin Tao, Jinsong Su, Hao Yang

Mapping speech tokens to the same feature space as text tokens has become the paradigm for the integration of speech modality into decoder-only large language models (LLMs). An alternative approach is to use an encoder-decoder architecture that incorporates speech features through cross-attention. This approach, however, has received less attention in the literature. In this work, we connect the Whisper encoder with ChatGLM3 and provide in-depth comparisons of these two approaches using Chinese automatic speech recognition (ASR) and name entity recognition (NER) tasks. We evaluate them not only by conventional metrics like the F1 score but also by a novel fine-grained taxonomy of ASR-NER errors. Our experiments reveal that encoder-decoder architecture outperforms decoder-only architecture with a short context, while decoder-only architecture benefits from a long context as it fully exploits all layers of the LLM. By using LLM, we significantly reduced the entity omission errors and improved the entity ASR accuracy compared to the Conformer baseline. Additionally, we obtained a state-of-the-art (SOTA) F1 score of 0.805 on the AISHELL-NER test set by using chain-of-thought (CoT) NER which first infers long-form ASR transcriptions and then predicts NER labels.

6/7/2024

cs.CL cs.AI

MaLa-ASR: Multimedia-Assisted LLM-Based ASR

Guanrou Yang, Ziyang Ma, Fan Yu, Zhifu Gao, Shiliang Zhang, Xie Chen

As more and more information-rich data like video become available, utilizing multi-modal auxiliary information to enhance audio tasks has sparked widespread research interest. The recent surge in research on LLM-based audio models provides fresh perspectives for tackling audio tasks. Given that LLM can flexibly ingest multiple inputs, we propose MaLa-ASR, an LLM-based ASR model that can integrate textual keywords extracted from presentation slides to improve recognition of conference content. MaLa-ASR yields average WERs of 9.4% and 11.7% on the L95 and S95 subsets of the SlideSpeech corpus, representing a significant relative WER drop of 27.9% and 44.7% over the baseline model reported in SlideSpeech. MaLa-ASR underscores LLM's strong performance in speech tasks and the capability to integrate auxiliary information conveniently. By adding keywords to the input prompt, the biased word error rate (B-WER) reduces relatively by 46.0% and 44.2%, establishing a new SOTA on this dataset.

6/14/2024

eess.AS cs.AI

Multi-stage Large Language Model Correction for Speech Recognition

Jie Pu, Thai-Son Nguyen, Sebastian Stuker

In this paper, we investigate the usage of large language models (LLMs) to improve the performance of competitive speech recognition systems. Different from previous LLM-based ASR error correction methods, we propose a novel multi-stage approach that utilizes uncertainty estimation of ASR outputs and reasoning capability of LLMs. Specifically, the proposed approach has two stages: the first stage is about ASR uncertainty estimation and exploits N-best list hypotheses to identify less reliable transcriptions; The second stage works on these identified transcriptions and performs LLM-based corrections. This correction task is formulated as a multi-step rule-based LLM reasoning process, which uses explicitly written rules in prompts to decompose the task into concrete reasoning steps. Our experimental results demonstrate the effectiveness of the proposed method by showing 10% ~ 20% relative improvement in WER over competitive ASR systems -- across multiple test domains and in zero-shot settings.

6/18/2024

cs.CL eess.AS

🛸

Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach

Ara Yeroyan (Data Science Department, American University of Armenia), Nikolay Karpov (Nvidia, NeMo Conversational AI team)

In recent years, automatic speech recognition (ASR) systems have significantly improved, especially in languages with a vast amount of transcribed speech data. However, ASR systems tend to perform poorly for low-resource languages with fewer resources, such as minority and regional languages. This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks, which typically feature a single transcript associated with hours-long audios. The common structure of these audiobooks poses a unique challenge due to the extensive length of audio segments, whereas optimal ASR training requires segments ranging from 4 to 15 seconds. To address this, we propose a method for effectively aligning audio with its corresponding text and segmenting it into lengths suitable for ASR training. Our approach simplifies data preparation for ASR systems in low-resource languages and demonstrates its application through a case study involving the Armenian language. Our method, which is portable to many low-resource languages, not only mitigates the issue of data scarcity but also enhances the performance of ASR models for underrepresented languages.

6/4/2024

cs.CL cs.LG eess.AS eess.SP