Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization

Read original: arXiv:2309.16482 - Published 5/7/2024 by Thilo von Neumann, Christoph Boeddeker, Tobias Cord-Landwehr, Marc Delcroix, Reinhold Haeb-Umbach

👁️

Overview

This paper proposes a modular pipeline for separating, recognizing, and diarizing speech in meeting-style recordings.
The pipeline uses a Continuous Speech Separation (CSS) system with a TF-GridNet separation architecture, followed by a speaker-agnostic speech recognizer.
A d-vector-based diarization module is employed to extract speaker embeddings and assign the CSS outputs to the correct speaker.
The researchers propose a syntactically informed diarization approach that uses sentence- and word-level boundaries from the Automatic Speech Recognition (ASR) module to support speaker turn detection.
The pipeline achieves state-of-the-art performance on the Libri-CSS dataset in terms of Optimal Reference Combination Word Error Rate (ORC WER) and Concatenated minimum-Permutation Word Error Rate (cpWER).

Plain English Explanation

This research presents a system that can take audio recordings of group conversations, like meetings, and automatically separate the different speakers, transcribe what they're saying, and identify who said what. This is a challenging task because in group settings, people often talk over each other, making it hard to distinguish individual voices.

The key components of the system are:

A speech separation module that can isolate each speaker's voice from the mixed audio signal. This uses a neural network architecture called TF-GridNet.
A speech recognition module that transcribes the separated audio into text, without knowing who the speakers are.
A speaker diarization module that analyzes the transcripts and audio to determine which parts were said by which speaker.

The researchers found that using information about the sentence and word structure of the transcripts helped the diarization module do a better job of identifying the different speakers. This resulted in state-of-the-art performance on a dataset of meeting recordings, meaning the system was able to accurately separate the voices, transcribe the content, and attribute it to the correct speakers.

This type of technology could be very useful for applications like automated meeting transcription, which can help teams stay organized and productive. It could also have applications in areas like conversational speech recognition and speaker identification.

Technical Explanation

The researchers propose a modular pipeline for single-channel speech separation, recognition, and diarization. The core components are:

Continuous Speech Separation (CSS): A TF-GridNet-based CSS system is used to separate the individual speaker voices from the mixed audio input. This builds on previous work in cocktail party speech separation.
Speech Recognition: A speaker-agnostic speech recognizer is employed to transcribe the separated audio streams into text.
Speaker Diarization: A d-vector-based diarization module is used to extract speaker embeddings from the enhanced signals and assign the CSS outputs to the correct speaker. The researchers propose a syntactically informed diarization approach that leverages the sentence- and word-level boundaries from the ASR module to improve speaker turn detection.

The pipeline is evaluated on the Libri-CSS dataset, a benchmark for meeting-style recordings. The researchers report state-of-the-art performance in terms of Optimal Reference Combination Word Error Rate (ORC WER) and Concatenated minimum-Permutation Word Error Rate (cpWER), which measure the quality of the speech recognition and diarization, respectively.

Critical Analysis

The proposed pipeline represents a significant advancement in the field of multi-speaker meeting transcription. The researchers have combined several state-of-the-art techniques, including CSS, speaker-agnostic ASR, and syntactically informed diarization, to create a robust and high-performing system.

One potential limitation is the reliance on the Libri-CSS dataset, which may not fully capture the complexity and diversity of real-world meeting scenarios. It would be interesting to see how the pipeline performs on more varied and challenging datasets, such as those with overlapping speech, background noise, or accented speakers.

Additionally, the paper does not provide a detailed analysis of the computational and memory requirements of the pipeline. This information would be useful for assessing the feasibility of deploying such a system in practical applications, especially on resource-constrained devices.

Further research could also explore ways to make the pipeline more robust to errors or missing information in the ASR outputs, as the diarization module relies heavily on the accuracy of the transcripts. Approaches like joint optimization of speech separation and recognition could be investigated.

Overall, this research represents an important step forward in the quest for conversational speech recognition at industrial scale, and the authors have made a valuable contribution to the field.

Conclusion

This paper presents a modular pipeline for separating, recognizing, and diarizing speech in meeting-style recordings. By combining state-of-the-art techniques in speech separation, recognition, and speaker diarization, the researchers have achieved impressive results on the Libri-CSS benchmark dataset.

The proposed system could have significant practical applications, such as automated meeting transcription, which can help teams stay organized and productive. It also has the potential to contribute to the broader field of conversational speech recognition and speaker identification.

While the pipeline represents an important advancement, further research is needed to address potential limitations and explore ways to make the system more robust and widely applicable. Overall, this work demonstrates the power of combining multiple AI technologies to tackle complex real-world problems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👁️

Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization

Thilo von Neumann, Christoph Boeddeker, Tobias Cord-Landwehr, Marc Delcroix, Reinhold Haeb-Umbach

We propose a modular pipeline for the single-channel separation, recognition, and diarization of meeting-style recordings and evaluate it on the Libri-CSS dataset. Using a Continuous Speech Separation (CSS) system with a TF-GridNet separation architecture, followed by a speaker-agnostic speech recognizer, we achieve state-of-the-art recognition performance in terms of Optimal Reference Combination Word Error Rate (ORC WER). Then, a d-vector-based diarization module is employed to extract speaker embeddings from the enhanced signals and to assign the CSS outputs to the correct speaker. Here, we propose a syntactically informed diarization using sentence- and word-level boundaries of the ASR module to support speaker turn detection. This results in a state-of-the-art Concatenated minimum-Permutation Word Error Rate (cpWER) for the full meeting recognition pipeline.

5/7/2024

🗣️

End-to-End Integration of Speech Separation and Voice Activity Detection for Low-Latency Diarization of Telephone Conversations

Giovanni Morrone, Samuele Cornell, Luca Serafini, Enrico Zovato, Alessio Brutti, Stefano Squartini

Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 s. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.

5/24/2024

Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

William Ravenscroft, George Close, Stefan Goetze, Thomas Hain, Mohammad Soleymanpour, Anurag Chowdhury, Mark C. Fuhs

One solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domain audio where reference transcript information is not always available. This paper proposes a transcription-free method for joint training using only audio signals. The proposed method uses embedding differences of pre-trained ASR encoders as a loss with a proposed modification to permutation invariant training (PIT) called guided PIT (GPIT). The method achieves a 6.4% improvement in word error rate (WER) measures over a signal-level loss and also shows enhancement improvements in perceptual measures such as short-time objective intelligibility (STOI).

6/14/2024

LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization

Zengrui Jin, Yifan Yang, Mohan Shi, Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Liyong Guo, Lingwei Meng, Long Lin, Yong Xu, Shi-Xiong Zhang, Daniel Povey

The evolving speech processing landscape is increasingly focused on complex scenarios like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions. Existing methodologies for addressing these challenges fall into two categories: multi-channel and single-channel solutions. Single-channel approaches, notable for their generality and convenience, do not require specific information about microphone arrays. This paper presents a large-scale far-field overlapping speech dataset, crafted to advance research in speech separation, recognition, and speaker diarization. This dataset is a critical resource for decoding ``Who said What and When'' in multi-talker, reverberant environments, a daunting challenge in the field. Additionally, we introduce a pipeline system encompassing speech separation, recognition, and diarization as a foundational benchmark. Evaluations on the WHAMR! dataset validate the broad applicability of the proposed data.

9/4/2024