Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

2310.14278

Published 4/30/2024 by Kun Wei, Bei Li, Hang Lv, Quan Lu, Ning Jiang, Lei Xie

🗣️

Abstract

Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. Our approach leverages a cross-modal extractor that combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This enables the extraction of richer historical speech context without explicit error propagation. We also incorporate conditional latent variational modules to learn conversational level attributes such as role preference and topic coherence. By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss, achieving relative accuracy improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and MagicData-RAMC, respectively, compared to the standard Conformer model.

Create account to get full access

Overview

The paper addresses the unique challenges of Automatic Speech Recognition (ASR) in conversational settings, where extracting relevant contextual information from previous conversational turns is crucial.
Existing methods struggle to extract longer and more effective contexts due to issues like irrelevant content, error propagation, and redundancy.
The researchers introduce a novel conversational ASR system that extends the Conformer encoder-decoder model with cross-modal conversational representation.

Plain English Explanation

The paper focuses on improving Automatic Speech Recognition (ASR) in conversational settings, where the system needs to understand the context of the conversation to recognize speech accurately. Existing ASR systems often struggle with this because they have trouble extracting useful information from the previous parts of the conversation. This can happen due to irrelevant content, errors in the transcription, or redundant information.

To address this, the researchers developed a new ASR system that combines pre-trained speech and text models in a way that allows it to better understand the conversation context. This cross-modal extractor uses a specialized encoder and input mask to extract richer historical speech context without propagating errors from previous transcriptions.

The system also incorporates conditional latent variational modules to learn attributes of the conversation, like the roles of the speakers and the overall topic. By incorporating both the cross-modal and conversational representations, the model is able to maintain context over longer sentences without losing important information.

This approach led to significant accuracy improvements of 8.8% and 23% on two different Mandarin conversation datasets, compared to a standard Conformer ASR model.

Technical Explanation

The researchers introduce a novel conversational ASR system that extends the Conformer encoder-decoder model by incorporating cross-modal and conversational representations.

The cross-modal extractor combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This allows the model to extract richer historical speech context without explicit error propagation from previous transcriptions.

Additionally, the system includes conditional latent variational modules to learn conversational-level attributes, such as role preference and topic coherence. By incorporating both cross-modal and conversational representations into the decoder, the model is able to retain context over longer sentences without information loss.

The researchers evaluated their approach on two Mandarin conversation datasets, HKUST and MagicData-RAMC, and reported relative accuracy improvements of 8.8% and 23%, respectively, compared to the standard Conformer model.

Critical Analysis

The paper presents a novel and promising approach to addressing the challenges of ASR in conversational settings. The incorporation of cross-modal and conversational representations is a clever way to leverage additional context and overcome the limitations of traditional ASR models.

However, the paper does not provide much insight into the specific architectural details or training procedures of the conditional latent variational modules. More transparency in this area would be helpful for understanding the model's inner workings and potential limitations.

Additionally, the evaluation is limited to Mandarin conversational datasets, so it would be valuable to see how the approach performs on other languages and conversation styles. Expanding the evaluation to more diverse datasets could help assess the generalizability of the proposed system.

Further research could also explore the integration of self-supervised representations to potentially enhance the cross-modal and conversational modeling capabilities of the system.

Conclusion

The paper presents a novel conversational ASR system that extends the Conformer encoder-decoder model with cross-modal and conversational representations. This approach allows the model to extract richer historical speech context and maintain context over longer sentences, leading to significant accuracy improvements on Mandarin conversation datasets.

The incorporation of cross-modal and conversational modeling is a promising direction for advancing the state-of-the-art in ASR for conversational settings. While the paper demonstrates the effectiveness of this approach, further research is needed to fully understand the model's capabilities and limitations, as well as explore opportunities for integrating additional cross-modal and cross-lingual enhancements.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

He Wang, Pengcheng Guo, Pan Zhou, Lei Xie

While automatic speech recognition (ASR) systems degrade significantly in noisy environments, audio-visual speech recognition (AVSR) systems aim to complement the audio stream with noise-invariant visual cues and improve the system's robustness. However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the contextual relationship during the modality feature learning. In this study, we propose a multi-layer cross-attention fusion based AVSR (MLCA-AVSR) approach that promotes representation learning of each modality by fusing them at different levels of audio/visual encoders. Experimental results on the MISP2022-AVSR Challenge dataset show the efficacy of our proposed system, achieving a concatenated minimum permutation character error rate (cpCER) of 30.57% on the Eval set and yielding up to 3.17% relative improvement compared with our previous system which ranked the second place in the challenge. Following the fusion of multiple systems, our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.

4/9/2024

cs.SD cs.AI eess.AS

An efficient text augmentation approach for contextualized Mandarin speech recognition

Naijun Zheng, Xucheng Wan, Kai Liu, Ziqing Du, Zhou Huan

Although contextualized automatic speech recognition (ASR) systems are commonly used to improve the recognition of uncommon words, their effectiveness is hindered by the inherent limitations of speech-text data availability. To address this challenge, our study proposes to leverage extensive text-only datasets and contextualize pre-trained ASR models using a straightforward text-augmentation (TA) technique, all while keeping computational costs minimal. In particular, to contextualize a pre-trained CIF-based ASR, we construct a codebook using limited speech-text data. By utilizing a simple codebook lookup process, we convert available text-only data into latent text embeddings. These embeddings then enhance the inputs for the contextualized ASR. Our experiments on diverse Mandarin test sets demonstrate that our TA approach significantly boosts recognition performance. The top-performing system shows relative CER improvements of up to 30% on rare words and 15% across all words in general.

6/17/2024

cs.SD cs.CL eess.AS

🗣️

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

Maxime Burchi, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg, Radu Timofte

Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy conditions. In this work, we present a multilingual AVSR model incorporating several enhancements to improve performance and audio noise robustness. Notably, we adapt the recently proposed Fast Conformer model to process both audio and visual modalities using a novel hybrid CTC/RNN-T architecture. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets (VoxCeleb2 and AVSpeech). Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%. On the recently introduced MuAViC benchmark, our model yields an absolute average-WER reduction of 11.9% in comparison to the original baseline. Finally, we demonstrate the ability of the proposed model to perform audio-only, visual-only, and audio-visual speech recognition at test time.

5/24/2024

eess.AS cs.AI cs.CV cs.MM cs.SD

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model

Jinlong Xue, Yayue Deng, Yicheng Han, Yingming Gao, Ya Li

Recent advances in large language models (LLMs) and development of audio codecs greatly propel the zero-shot TTS. They can synthesize personalized speech with only a 3-second speech of an unseen speaker as acoustic prompt. However, they only support short speech prompts and cannot leverage longer context information, as required in audiobook and conversational TTS scenarios. In this paper, we introduce a novel audio codec-based TTS model to adapt context features with multiple enhancements. Inspired by the success of Qformer, we propose a multi-modal context-enhanced Qformer (MMCE-Qformer) to utilize additional multi-modal context information. Besides, we adapt a pretrained LLM to leverage its understanding ability to predict semantic tokens, and use a SoundStorm to generate acoustic tokens thereby enhancing audio quality and speaker similarity. The extensive objective and subjective evaluations show that our proposed method outperforms baselines across various context TTS scenarios.

6/7/2024

cs.SD cs.CL eess.AS