The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization

Read original: arXiv:2407.16447 - Published 7/24/2024 by Samuele Cornell, Taejin Park, Steve Huang, Christoph Boeddeker, Xuankai Chang, Matthew Maciejewski, Matthew Wiesner, Paola Garcia, Shinji Watanabe

🗣️

Overview

The paper discusses the CHiME-8 DASR Challenge, which aims to advance the field of distant automatic speech recognition (DASR) and diarization.
The challenge focuses on developing generalizable and array-agnostic systems that can handle diverse acoustic environments and microphone arrays.
It provides a dataset and evaluation tasks to measure progress in DASR and diarization, with the goal of improving real-world applications.

Plain English Explanation

The paper describes a new challenge in the field of speech recognition and speaker diarization. The challenge, called CHiME-8 DASR, focuses on developing speech recognition and diarization systems that can work well in a variety of real-world situations, even when the speakers are far away from the microphones.

Current speech recognition systems often struggle with distant or noisy audio recordings, such as those captured in a crowded room or a public space. The CHiME-8 DASR Challenge aims to push the boundaries of this technology by providing a dataset and evaluation tasks that mimic these challenging acoustic environments. The goal is to encourage the development of speech recognition and diarization algorithms that can work reliably in a wide range of settings, without relying on specific microphone arrays or other specialized hardware.

By addressing these challenges, the researchers hope to improve the practical applications of speech recognition and diarization technologies, such as in smart home assistants, meeting transcription systems, and other real-world scenarios where clear, accurate audio capture is essential.

Technical Explanation

The CHiME-8 DASR Challenge is designed to advance the state of the art in distant automatic speech recognition (DASR) and speaker diarization. The key objectives are to develop systems that are generalizable, meaning they can perform well in diverse acoustic environments, and array-agnostic, meaning they can work with different types of microphone arrays without requiring retraining or specialized hardware.

To achieve these goals, the challenge provides a dataset that includes audio recordings from a variety of settings, such as cafes, offices, and public spaces. The recordings were captured using different microphone arrays, including linear, circular, and irregular configurations. The dataset also includes annotations for speaker diarization, allowing researchers to develop and evaluate systems that can not only recognize speech, but also identify who is speaking at any given time.

The evaluation tasks in the challenge include both speech recognition and diarization, with metrics designed to measure the systems' ability to generalize and handle diverse acoustic conditions. Researchers are encouraged to develop novel techniques, such as neural blind source separation and array-agnostic processing, to address the unique challenges posed by the dataset.

By pushing the boundaries of DASR and diarization, the CHiME-8 DASR Challenge aims to contribute to the development of more robust and practical speech recognition and diarization systems that can be deployed in a wide range of real-world applications.

Critical Analysis

The CHiME-8 DASR Challenge addresses an important problem in the field of speech recognition and diarization, namely the need for systems that can operate effectively in diverse and challenging acoustic environments. The focus on generalizable and array-agnostic approaches is particularly valuable, as it has the potential to make these technologies more accessible and usable in a wider range of real-world scenarios.

One potential limitation of the challenge is the relatively small size of the dataset, which may not fully capture the full range of acoustic variability encountered in the real world. Additionally, the evaluation tasks may not fully reflect the complexities of real-world applications, where other factors such as speaker overlap, background noise, and hardware limitations may come into play.

Another area for further research is the development of more advanced techniques for blind source separation and array-agnostic processing. While the challenge encourages the use of these approaches, the specific methods and their implementation details may still be an ongoing area of research and development.

Overall, the CHiME-8 DASR Challenge represents an important step forward in the quest to develop more robust and practical speech recognition and diarization systems. By fostering innovation and collaboration in this space, the challenge has the potential to drive significant progress in the field and unlock new applications for these technologies.

Conclusion

The CHiME-8 DASR Challenge is a timely and important initiative aimed at advancing the state of the art in distant automatic speech recognition and speaker diarization. By focusing on the development of generalizable and array-agnostic systems, the challenge seeks to address the limitations of current technologies and pave the way for more practical and deployable solutions in real-world applications.

The dataset, evaluation tasks, and emphasis on novel techniques like neural blind source separation and array-agnostic processing provide a valuable framework for researchers to push the boundaries of what is possible in this domain. While the challenge may have some limitations, its overall impact has the potential to be far-reaching, contributing to the development of more robust and reliable speech recognition and diarization systems that can be leveraged in a wide range of settings, from smart home assistants to meeting transcription tools and beyond.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization

Samuele Cornell, Taejin Park, Steve Huang, Christoph Boeddeker, Xuankai Chang, Matthew Maciejewski, Matthew Wiesner, Paola Garcia, Shinji Watanabe

This paper presents the CHiME-8 DASR challenge which carries on from the previous edition CHiME-7 DASR (C7DASR) and the past CHiME-6 challenge. It focuses on joint multi-channel distant speech recognition (DASR) and diarization with one or more, possibly heterogeneous, devices. The main goal is to spur research towards meeting transcription approaches that can generalize across arbitrary number of speakers, diverse settings (formal vs. informal conversations), meeting duration, wide-variety of acoustic scenarios and different recording configurations. Novelties with respect to C7DASR include: i) the addition of NOTSOFAR-1, an additional office/corporate meeting scenario, ii) a manually corrected Mixer 6 development set, iii) a new track in which we allow the use of large-language models (LLM) iv) a jury award mechanism to encourage participants to explore also more practical and innovative solutions. To lower the entry barrier for participants, we provide a standalone toolkit for downloading and preparing such datasets as well as performing text normalization and scoring their submissions. Furthermore, this year we also provide two baseline systems, one directly inherited from C7DASR and based on ESPnet and another one developed on NeMo and based on NeMo team submission in last year C7DASR. Baseline system results suggest that the addition of the NOTSOFAR-1 scenario significantly increases the task's difficulty due to its high number of speakers and very short duration.

7/24/2024

NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge

Naoyuki Kamo, Naohiro Tawara, Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet, Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, Shoko Araki

We present a distant automatic speech recognition (DASR) system developed for the CHiME-8 DASR track. It consists of a diarization first pipeline. For diarization, we use end-to-end diarization with vector clustering (EEND-VC) followed by target speaker voice activity detection (TS-VAD) refinement. To deal with various numbers of speakers, we developed a new multi-channel speaker counting approach. We then apply guided source separation (GSS) with several improvements to the baseline system. Finally, we perform ASR using a combination of systems built from strong pre-trained models. Our proposed system achieves a macro tcpWER of 21.3 % on the dev set, which is a 57 % relative improvement over the baseline.

9/10/2024

The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge

Shutong Niu, Ruoyu Wang, Jun Du, Gaobin Yang, Yanhui Tu, Siyuan Wu, Shuangqing Qian, Huaxin Wu, Haitao Xu, Xueyang Zhang, Guolong Zhong, Xindi Yu, Jieru Chen, Mengzhi Wang, Di Cai, Tian Gao, Genshun Wan, Feng Ma, Jia Pan, Jianqing Gao

This technical report outlines our submission system for the CHiME-8 NOTSOFAR-1 Challenge. The primary difficulty of this challenge is the dataset recorded across various conference rooms, which captures real-world complexities such as high overlap rates, background noises, a variable number of speakers, and natural conversation styles. To address these issues, we optimized the system in several aspects: For front-end speech signal processing, we introduced a data-driven joint training method for diarization and separation (JDS) to enhance audio quality. Additionally, we also integrated traditional guided source separation (GSS) for multi-channel track to provide complementary information for the JDS. For back-end speech recognition, we enhanced Whisper with WavLM, ConvNeXt, and Transformer innovations, applying multi-task training and Noise KLD augmentation, to significantly advance ASR robustness and accuracy. Our system attained a Time-Constrained minimum Permutation Word Error Rate (tcpWER) of 14.265% and 22.989% on the CHiME-8 NOTSOFAR-1 Dev-set-2 multi-channel and single-channel tracks, respectively.

9/4/2024

🗣️

The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge

Jingguang Tian, Shuaishuai Ye, Shunfei Chen, Yang Xiang, Zhaohui Yin, Xinhui Hu, Xinkang Xu

This paper presents our system submission for the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge, which focuses on speaker diarization and speech recognition in complex multi-speaker scenarios. To address these challenges, we develop end-to-end speaker diarization models that notably decrease the diarization error rate (DER) by 49.58% compared to the official baseline on the development set. For speech recognition, we utilize self-supervised learning representations to train end-to-end ASR models. By integrating these models, we achieve a character error rate (CER) of 16.93% on the track 1 evaluation set, and a concatenated minimum permutation character error rate (cpCER) of 25.88% on the track 2 evaluation set.

5/10/2024