NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge

Read original: arXiv:2409.05554 - Published 9/10/2024 by Naoyuki Kamo, Naohiro Tawara, Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa and 8 others

NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge

Overview

The paper presents the NTT multi-speaker Automatic Speech Recognition (ASR) system developed for the Distant Automatic Speech Recognition (DASR) task of the CHiME-8 challenge.
The system focuses on tackling the challenges of distant multi-speaker speech recognition, such as overlapping speech, background noise, and variable microphone array configurations.
Key components of the system include diarization, speaker counting, and multi-channel speech recognition.

Plain English Explanation

The researchers at NTT developed a speech recognition system that can handle recordings with multiple speakers talking at the same time from a distance. This is a challenging task because the speech can overlap, there is background noise, and the microphones used to record the audio may be set up differently for each recording.

To address these challenges, the NTT system has several key parts:

Diarization and speaker counting: This component identifies when each speaker is talking and how many speakers there are in the recording.

Multi-channel speech recognition: This part takes the audio from multiple microphones and combines it to improve the accuracy of the speech recognition.

The researchers tested their system on the CHiME-8 challenge dataset, which includes recordings of people having conversations in noisy environments like cafes and buses. Their system was able to accurately recognize the speech even with multiple speakers and background noise.

Technical Explanation

Diarization and speaker counting

The NTT system uses a two-stage approach for diarization and speaker counting:

Overlap-aware speaker diarization: This first stage identifies when each speaker is talking, even if their speech overlaps with others.
Speaker counting: The second stage estimates the number of unique speakers in the recording.

These components work together to locate and count the speakers, which is crucial for the subsequent multi-channel speech recognition.

Multi-channel speech recognition

The multi-channel ASR module in the NTT system has several key elements:

Multi-channel feature extraction: Audio features are extracted from each microphone channel and combined.
Speaker-aware acoustic model: The acoustic model is designed to recognize speech from multiple speakers.
Speaker-aware language model: The language model also accounts for the different speaking styles of multiple speakers.

By incorporating speaker information into both the acoustic and language models, the NTT system can more accurately recognize speech even in challenging multi-speaker environments.

Critical Analysis

The paper provides a thorough description of the NTT multi-speaker ASR system and its key components. The researchers have carefully addressed the challenges of distant, multi-speaker speech recognition, including overlapping speech and variable microphone setups.

One potential limitation mentioned in the paper is the need for accurate speaker diarization and counting, as errors in these components could impact the overall speech recognition performance. Additionally, the system was tested on the CHiME-8 dataset, which may not fully represent the diversity of real-world multi-speaker scenarios.

Further research could explore ways to improve the robustness of the diarization and speaker counting modules, as well as extend the system's capabilities to handle a wider range of acoustic conditions and speaker configurations.

Conclusion

The NTT multi-speaker ASR system represents a significant advancement in the field of distant, multi-speaker speech recognition. By addressing the key challenges of overlapping speech, background noise, and variable microphone setups, the researchers have developed a system that can accurately recognize speech in complex, real-world environments.

The system's strong performance on the CHiME-8 dataset suggests that it could have valuable applications in areas such as smart home assistants, meeting transcription, and other scenarios involving distant, multi-speaker interactions. As the field of speech recognition continues to evolve, the insights and techniques presented in this paper could contribute to the development of even more advanced and versatile speech recognition systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge

Naoyuki Kamo, Naohiro Tawara, Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet, Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, Shoko Araki

We present a distant automatic speech recognition (DASR) system developed for the CHiME-8 DASR track. It consists of a diarization first pipeline. For diarization, we use end-to-end diarization with vector clustering (EEND-VC) followed by target speaker voice activity detection (TS-VAD) refinement. To deal with various numbers of speakers, we developed a new multi-channel speaker counting approach. We then apply guided source separation (GSS) with several improvements to the baseline system. Finally, we perform ASR using a combination of systems built from strong pre-trained models. Our proposed system achieves a macro tcpWER of 21.3 % on the dev set, which is a 57 % relative improvement over the baseline.

9/10/2024

🗣️

The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization

Samuele Cornell, Taejin Park, Steve Huang, Christoph Boeddeker, Xuankai Chang, Matthew Maciejewski, Matthew Wiesner, Paola Garcia, Shinji Watanabe

This paper presents the CHiME-8 DASR challenge which carries on from the previous edition CHiME-7 DASR (C7DASR) and the past CHiME-6 challenge. It focuses on joint multi-channel distant speech recognition (DASR) and diarization with one or more, possibly heterogeneous, devices. The main goal is to spur research towards meeting transcription approaches that can generalize across arbitrary number of speakers, diverse settings (formal vs. informal conversations), meeting duration, wide-variety of acoustic scenarios and different recording configurations. Novelties with respect to C7DASR include: i) the addition of NOTSOFAR-1, an additional office/corporate meeting scenario, ii) a manually corrected Mixer 6 development set, iii) a new track in which we allow the use of large-language models (LLM) iv) a jury award mechanism to encourage participants to explore also more practical and innovative solutions. To lower the entry barrier for participants, we provide a standalone toolkit for downloading and preparing such datasets as well as performing text normalization and scoring their submissions. Furthermore, this year we also provide two baseline systems, one directly inherited from C7DASR and based on ESPnet and another one developed on NeMo and based on NeMo team submission in last year C7DASR. Baseline system results suggest that the addition of the NOTSOFAR-1 scenario significantly increases the task's difficulty due to its high number of speakers and very short duration.

7/24/2024

The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge

Shutong Niu, Ruoyu Wang, Jun Du, Gaobin Yang, Yanhui Tu, Siyuan Wu, Shuangqing Qian, Huaxin Wu, Haitao Xu, Xueyang Zhang, Guolong Zhong, Xindi Yu, Jieru Chen, Mengzhi Wang, Di Cai, Tian Gao, Genshun Wan, Feng Ma, Jia Pan, Jianqing Gao

This technical report outlines our submission system for the CHiME-8 NOTSOFAR-1 Challenge. The primary difficulty of this challenge is the dataset recorded across various conference rooms, which captures real-world complexities such as high overlap rates, background noises, a variable number of speakers, and natural conversation styles. To address these issues, we optimized the system in several aspects: For front-end speech signal processing, we introduced a data-driven joint training method for diarization and separation (JDS) to enhance audio quality. Additionally, we also integrated traditional guided source separation (GSS) for multi-channel track to provide complementary information for the JDS. For back-end speech recognition, we enhanced Whisper with WavLM, ConvNeXt, and Transformer innovations, applying multi-task training and Noise KLD augmentation, to significantly advance ASR robustness and accuracy. Our system attained a Time-Constrained minimum Permutation Word Error Rate (tcpWER) of 14.265% and 22.989% on the CHiME-8 NOTSOFAR-1 Dev-set-2 multi-channel and single-channel tracks, respectively.

9/4/2024

🗣️

The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge

Jingguang Tian, Shuaishuai Ye, Shunfei Chen, Yang Xiang, Zhaohui Yin, Xinhui Hu, Xinkang Xu

This paper presents our system submission for the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge, which focuses on speaker diarization and speech recognition in complex multi-speaker scenarios. To address these challenges, we develop end-to-end speaker diarization models that notably decrease the diarization error rate (DER) by 49.58% compared to the official baseline on the development set. For speech recognition, we utilize self-supervised learning representations to train end-to-end ASR models. By integrating these models, we achieve a character error rate (CER) of 16.93% on the track 1 evaluation set, and a concatenated minimum permutation character error rate (cpCER) of 25.88% on the track 2 evaluation set.

5/10/2024