TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024

Read original: arXiv:2407.12743 - Published 7/18/2024 by Joonas Kalda, Tanel Alumae, Martin Lebourdais, Herv'e Bredin, S'everin Baroudi, Ricard Marxer

TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024

Overview

• This paper presents the speaker and language diarization systems developed by the TalTech-IRIT-LIS team for the DISPLACE 2024 challenge.

• The DISPLACE challenge aims to advance speech technology for real-world conversational scenarios, with tasks including speaker diarization and language identification.

Plain English Explanation

• The researchers developed two key components for the DISPLACE challenge:

A speaker diarization system to identify who is speaking when in a conversation
A language identification system to determine what languages are being used

• Speaker diarization is important for tasks like meeting transcription, where it's crucial to know which parts of the audio correspond to which speakers. Language identification is also useful for understanding multilingual conversations.

• The team's systems leverage state-of-the-art deep learning models and techniques to tackle these challenging real-world speech processing problems. Their approaches aim to be robust and accurate, handling factors like overlapping speech and rapid language switching.

Technical Explanation

• For speaker diarization, the team used a transformer-based neural network model trained on a large corpus of conversational data. This allowed the system to learn contextual cues to better separate speakers, even in complex scenarios with frequent interruptions.

• The language identification system combined multilingual acoustic and language models, enabling accurate detection of multiple languages within a single conversation. This built on prior work on multi-speaker, multi-lingual voice cloning.

• The team also integrated their diarization and language ID components with a simultaneous speech translation system to create a comprehensive X-LANCE conversational AI platform.

Critical Analysis

• A key challenge noted by the authors is handling rapid language switching, where speakers may alternate between multiple languages within a single utterance. Their approaches aim to be robust to this, but further improvements could still be made.

• The systems were developed and evaluated on the DISPLACE dataset, which represents real-world conversational scenarios. However, their performance on other datasets or in different domains may vary and would require further testing.

• While the technical details are impressive, the ultimate usefulness of these systems will depend on how well they function in practical applications, such as live meeting transcription or multilingual customer service.

Conclusion

• The TalTech-IRIT-LIS team has made significant advances in speaker diarization and language identification for real-world conversational AI. Their systems leverage state-of-the-art machine learning to tackle these challenging speech processing tasks.

• These capabilities have important applications in areas like meeting transcription, multilingual customer service, and spoken language translation. As conversational AI systems become more prevalent, robust diarization and language ID will be crucial for understanding and mediating human-to-human and human-to-machine interactions.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024

Joonas Kalda, Tanel Alumae, Martin Lebourdais, Herv'e Bredin, S'everin Baroudi, Ricard Marxer

This paper describes the submissions of team TalTech-IRIT-LIS to the DISPLACE 2024 challenge. Our team participated in the speaker diarization and language diarization tracks of the challenge. In the speaker diarization track, our best submission was an ensemble of systems based on the pyannote.audio speaker diarization pipeline utilizing powerset training and our recently proposed PixIT method that performs joint diarization and speech separation. We improve upon PixIT by using the separation outputs for speaker embedding extraction. Our ensemble achieved a diarization error rate of 27.1% on the evaluation dataset. In the language diarization track, we fine-tuned a pre-trained Wav2Vec2-BERT language embedding model on in-domain data, and clustered short segments using AHC and VBx, based on similarity scores from LDA/PLDA. This led to a language diarization error rate of 27.6% on the evaluation data. Both results were ranked first in their respective challenge tracks.

7/18/2024

🤿

System Description for the Displace Speaker Diarization Challenge 2023

Ali Aliyev

This paper describes our solution for the Diarization of Speaker and Language in Conversational Environments Challenge (Displace 2023). We used a combination of VAD for finding segfments with speech, Resnet architecture based CNN for feature extraction from these segments, and spectral clustering for features clustering. Even though it was not trained with using Hindi, the described algorithm achieves the following metrics: DER 27. 1% and DER 27. 4%, on the development and phase-1 evaluation parts of the dataset, respectively.

6/26/2024

The Second DISPLACE Challenge : DIarization of SPeaker and LAnguage in Conversational Environments

Shareef Babu Kalluri, Prachi Singh, Pratik Roy Chowdhuri, Apoorva Kulkarni, Shikha Baghel, Pradyoth Hegde, Swapnil Sontakke, Deepak K T, S. R. Mahadeva Prasanna, Deepu Vijayasenan, Sriram Ganapathy

The DIarization of SPeaker and LAnguage in Conversational Environments (DISPLACE) 2024 challenge is the second in the series of DISPLACE challenges, which involves tasks of speaker diarization (SD) and language diarization (LD) on a challenging multilingual conversational speech dataset. In the DISPLACE 2024 challenge, we also introduced the task of automatic speech recognition (ASR) on this dataset. The dataset containing 158 hours of speech, consisting of both supervised and unsupervised mono-channel far-field recordings, was released for LD and SD tracks. Further, 12 hours of close-field mono-channel recordings were provided for the ASR track conducted on 5 Indian languages. The details of the dataset, baseline systems and the leader board results are highlighted in this paper. We have also compared our baseline models and the team's performances on evaluation data of DISPLACE-2023 to emphasize the advancements made in this second version of the challenge.

6/17/2024

NAIST Simultaneous Speech Translation System for IWSLT 2024

Yuka Ko, Ryo Fukuda, Yuta Nishikawa, Yasumasa Kano, Tomoya Yanagita, Kosuke Doi, Mana Makinae, Haotian Tan, Makoto Sakai, Sakriani Sakti, Katsuhito Sudoh, Satoshi Nakamura

This paper describes NAIST's submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign: English-to-{German, Japanese, Chinese} speech-to-text translation and English-to-Japanese speech-to-speech translation. We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART. We trained this model with two decoding policies, Local Agreement (LA) and AlignAtt. The submitted models employ the LA policy because it outperformed the AlignAtt policy in previous models. Our speech-to-speech translation method is a cascade of the above speech-to-text model and an incremental text-to-speech (TTS) module that incorporates a phoneme estimation model, a parallel acoustic model, and a parallel WaveGAN vocoder. We improved our incremental TTS by applying the Transformer architecture with the AlignAtt policy for the estimation model. The results show that our upgraded TTS module contributed to improving the system performance.

7/2/2024