From Modular to End-to-End Speaker Diarization

Read original: arXiv:2407.08752 - Published 7/15/2024 by Federico Landini

🏷️

Overview

Speaker diarization is the task of determining "who spoke when" in a recording.
Traditional modular approaches reached state-of-the-art performance but struggled with overlapped speech.
More recent end-to-end neural models have shown promise in handling overlapped speech.
This paper explores both modular and end-to-end approaches, highlighting their advantages and limitations.

Plain English Explanation

Speaker diarization is like figuring out "who said what" in an audio recording. In the past, the best systems used a modular approach, where different components handled different parts of the task. These systems worked well in many cases, but had trouble when people were talking over each other.

More recently, researchers have started using end-to-end neural networks, which can handle all aspects of speaker diarization with a single model. These models have shown they can do a better job with overlapping speech. This paper looks at both the traditional and newer approaches, explaining their strengths and weaknesses.

Technical Explanation

The paper first describes a modular system called VBx that uses a Bayesian hidden Markov model to cluster x-vectors (speaker embeddings from a neural network). VBx has demonstrated strong performance on various datasets.

The paper then focuses on end-to-end neural diarization (EEND) methods. Since EEND models require large amounts of training data, the researchers propose a technique to generate "simulated conversations" that resemble real conversations in terms of speaker turns and overlaps. They show this approach leads to better performance than a previous method of creating "simulated mixtures" when training the EEND-EDA model.

The paper also introduces a new EEND-based model called DiaPer, which the authors claim can outperform EEND-EDA, especially when dealing with many speakers and overlapping speech.

Finally, the paper compares the performance of the VBx-based and DiaPer systems on a variety of datasets.

Critical Analysis

The paper acknowledges the need for large, manually annotated datasets to train effective end-to-end diarization models. The proposed method of generating "simulated conversations" is an interesting compromise, but it remains to be seen how well these synthetic samples translate to real-world scenarios.

Additionally, while the DiaPer model appears to offer improved performance, especially for challenging cases, the paper does not provide a detailed analysis of its inner workings or a clear explanation of why it outperforms EEND-EDA. Further research may be needed to fully understand the model's strengths and limitations.

Conclusion

This paper presents a comprehensive exploration of both modular and end-to-end approaches to speaker diarization. The findings suggest that while traditional systems like VBx have their merits, the newer end-to-end neural models, such as DiaPer, hold promise for better handling of overlapping speech and more complex scenarios. As the field continues to evolve, the insights from this research can help guide the development of increasingly robust and versatile speaker diarization systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏷️

From Modular to End-to-End Speaker Diarization

Federico Landini

Speaker diarization is usually referred to as the task that determines ``who spoke when'' in a recording. Until a few years ago, all competitive approaches were modular. Systems based on this framework reached state-of-the-art performance in most scenarios but had major difficulties dealing with overlapped speech. More recently, the advent of end-to-end models, capable of dealing with all aspects of speaker diarization with a single model and better performing regarding overlapped speech, has brought high levels of attention. This thesis is framed during a period of co-existence of these two trends. We describe a system based on a Bayesian hidden Markov model used to cluster x-vectors (speaker embeddings obtained with a neural network), known as VBx, which has shown remarkable performance on different datasets and challenges. We comment on its advantages and limitations and evaluate results on different relevant corpora. Then, we move towards end-to-end neural diarization (EEND) methods. Due to the need for large training sets for training these models and the lack of manually annotated diarization data in sufficient quantities, the compromise solution consists in generating training data artificially. We describe an approach for generating synthetic data which resembles real conversations in terms of speaker turns and overlaps. We show how this method generating ``simulated conversations'' allows for better performance than using a previously proposed method for creating ``simulated mixtures'' when training the popular EEND with encoder-decoder attractors (EEND-EDA). We also propose a new EEND-based model, which we call DiaPer, and show that it can perform better than EEND-EDA, especially when dealing with many speakers and handling overlapped speech. Finally, we compare both VBx-based and DiaPer systems on a wide variety of corpora and comment on the advantages of each technique.

7/15/2024

🧠

DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors

Federico Landini, Mireia Diez, Themos Stafylakis, Luk'av{s} Burget

Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-based one and show its advantages over EEND-EDA; namely obtaining better performance on the largely studied Callhome dataset, finding the quantity of speakers in a conversation more accurately, and faster inference time. Furthermore, when exhaustively compared with other methods, our model, DiaPer, reaches remarkable performance with a very lightweight design. Besides, we perform comparisons with other works and a cascaded baseline across more than ten public wide-band datasets. Together with this publication, we release the code of DiaPer as well as models trained on public and free data.

6/4/2024

Speakers Unembedded: Embedding-free Approach to Long-form Neural Diarization

Xiang Li, Vivek Govindan, Rohit Paturi, Sundararajan Srinivasan

End-to-end neural diarization (EEND) models offer significant improvements over traditional embedding-based Speaker Diarization (SD) approaches but falls short on generalizing to long-form audio with large number of speakers. EEND-vector-clustering method mitigates this by combining local EEND with global clustering of speaker embeddings from local windows, but this requires an additional speaker embedding framework alongside the EEND module. In this paper, we propose a novel framework applying EEND both locally and globally for long-form audio without separate speaker embeddings. This approach achieves significant relative DER reduction of 13% and 10% over the conventional 1-pass EEND on Callhome American English and RT03-CTS datasets respectively and marginal improvements over EEND-vector-clustering without the need for additional speaker embeddings. Furthermore, we discuss the computational complexity of our proposed framework and explore strategies for reducing processing times.

6/28/2024

Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios

Juan Ignacio Alvarez-Trejos, Beltr'an Labrador, Alicia Lozano-Diez

End-to-end neural speaker diarization systems are able to address the speaker diarization task while effectively handling speech overlap. This work explores the incorporation of speaker information embeddings into the end-to-end systems to enhance the speaker discriminative capabilities, while maintaining their overlap handling strengths. To achieve this, we propose several methods for incorporating these embeddings along the acoustic features. Furthermore, we delve into an analysis of the correct handling of silence frames, the window length for extracting speaker embeddings and the transformer encoder size. The effectiveness of our proposed approach is thoroughly evaluated on the CallHome dataset for the two-speaker diarization task, with results that demonstrate a significant reduction in diarization error rates achieving a relative improvement of a 10.78% compared to the baseline end-to-end model.

7/2/2024