A framework of text-dependent speaker verification for chinese numerical string corpus

Read original: arXiv:2405.07029 - Published 5/22/2024 by Litong Zheng, Feng Hong, Weijie Xu, Wan Zheng

🌐

Overview

The paper presents an end-to-end speaker verification system that enhances text-dependent speaker verification (TD-SV) by decoupling speaker and text information.
The researchers recorded a new publicly available Chinese numerical corpus called SHALCAS22A (SHAL) to address the scarcity of data for this task.
The proposed system achieves significant performance improvements on the Hi-Mia and SHAL datasets compared to existing approaches.

Plain English Explanation

The paper focuses on improving speaker verification, which is the process of confirming a person's identity based on their voice. This is particularly important for financial transactions, where it's crucial to ensure the speaker is who they claim to be.

The researchers found that in short speech scenarios, text-dependent speaker verification (TD-SV) outperforms text-independent speaker verification (TI-SV). However, TD-SV can be negatively impacted by factors like reading rhythms and pauses.

To address this, the researchers developed an end-to-end speaker verification system that separates the speaker's identity from the text they are saying. The system has three main components:

A text embedding extractor, which uses an enhanced Transformer model to extract information about the text.
A speaker embedding extractor, which uses a multi-scale pooling method to capture information about the speaker's voice.
A fusion module that combines the text and speaker embeddings to verify the identity.

To help train this system, the researchers recorded a new Chinese numerical speech corpus called SHAL, which is publicly available. They also used data augmentation techniques to generate more training data.

The end result is a system that significantly outperforms previous approaches on the Hi-Mia and SHAL datasets, reducing the equal error rate (a key performance metric) by 49.2% and 75.0%, respectively.

Technical Explanation

The paper proposes an end-to-end speaker verification system that enhances text-dependent speaker verification (TD-SV) by decoupling speaker and text information. The system consists of three main components:

Text Embedding Extractor: This module employs an enhanced Transformer model to extract text embeddings. It is trained using a triple loss function that includes text classification loss, connectionist temporal classification (CTC) loss, and decoder loss.
Speaker Embedding Extractor: This module creates speaker embeddings using a multi-scale pooling method that combines sliding window attentive statistics pooling (SWASP) with attentive statistics pooling (ASP).
Fusion Module: This component combines the text and speaker embeddings to perform the final speaker verification task.

To address the scarcity of data for this task, the researchers recorded a publicly available Chinese numerical corpus named SHALCAS22A (SHAL), which can be accessed on Open-SLR. They also employed data augmentation techniques using Tacotron2 and HiFi-GAN to generate additional training data.

The proposed method achieves significant performance improvements, reducing the equal error rate (EER) by 49.2% on the Hi-Mia dataset and 75.0% on the SHAL dataset, compared to previous approaches.

Critical Analysis

The paper presents a well-designed end-to-end speaker verification system that effectively decouples speaker and text information to address the limitations of traditional text-dependent speaker verification (TD-SV) approaches.

One potential limitation of the research is the use of a relatively small, domain-specific dataset (SHAL) for evaluation. While the authors address the data scarcity issue through data augmentation, it would be valuable to see the system's performance on larger, more diverse datasets to assess its robustness and generalizability.

Additionally, the paper does not provide detailed insights into the individual contributions of the text embedding extractor and speaker embedding extractor components, nor does it explore the potential trade-offs between the decoupling approach and more traditional TD-SV methods. Further analysis in these areas could help researchers better understand the strengths and weaknesses of the proposed system.

Overall, the research presents an interesting and promising direction for enhancing speaker verification, particularly in short-speech scenarios. The open-sourcing of the SHAL dataset is a valuable contribution to the research community, and the significant performance improvements on the evaluated datasets suggest that the proposed system warrants further investigation and development.

Conclusion

The paper introduces an innovative end-to-end speaker verification system that decouples speaker and text information to address the limitations of traditional text-dependent speaker verification (TD-SV) approaches. By recording a new publicly available Chinese numerical corpus (SHAL) and employing data augmentation techniques, the researchers were able to train a system that outperforms previous methods on the Hi-Mia and SHAL datasets.

The proposed system's ability to decouple speaker and text information, as well as its significant performance improvements, suggest that it could be a valuable tool for speaker verification, particularly in financial and other security-critical applications. The open-sourcing of the SHAL dataset also contributes to the research community's efforts to address the data scarcity challenges in this domain.

While the research presents a promising approach, further investigation into the system's generalizability and the individual contributions of its components could help researchers better understand its strengths and limitations. Overall, the paper makes a valuable contribution to the field of speaker verification and lays the groundwork for future advancements in this important area of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌐

A framework of text-dependent speaker verification for chinese numerical string corpus

Litong Zheng, Feng Hong, Weijie Xu, Wan Zheng

The Chinese numerical string corpus, serves as a valuable resource for speaker verification, particularly in financial transactions. Researches indicate that in short speech scenarios, text-dependent speaker verification (TD-SV) consistently outperforms text-independent speaker verification (TI-SV). However, TD-SV potentially includes the validation of text information, that can be negatively impacted by reading rhythms and pauses. To address this problem, we propose an end-to-end speaker verification system that enhances TD-SV by decoupling speaker and text information. Our system consists of a text embedding extractor, a speaker embedding extractor and a fusion module. In the text embedding extractor, we employ an enhanced Transformer and introduce a triple loss including text classification loss, connectionist temporal classification (CTC) loss and decoder loss; while in the speaker embedding extractor, we create a multi-scale pooling method by combining sliding window attentive statistics pooling (SWASP) with attentive statistics pooling (ASP). To mitigate the scarcity of data, we have recorded a publicly available Chinese numerical corpus named SHALCAS22A (hereinafter called SHAL), which can be accessed on Open-SLR. Moreover, we employ data augmentation techniques using Tacotron2 and HiFi-GAN. Our method achieves an equal error rate (EER) performance improvement of 49.2% on Hi-Mia and 75.0% on SHAL, respectively.

5/22/2024

🔍

Text-dependent Speaker Verification (TdSV) Challenge 2024: Challenge Evaluation Plan

Zeinali Hossein, Lee Kong Aik, Alam Jahangir, Burget Lukas

This document outlines the Text-dependent Speaker Verification (TdSV) Challenge 2024, which centers on analyzing and exploring novel approaches for text-dependent speaker verification. The primary goal of this challenge is to motive participants to develop single yet competitive systems, conduct thorough analyses, and explore innovative concepts such as multi-task learning, self-supervised learning, few-shot learning, and others, for text-dependent speaker verification.

4/23/2024

An efficient text augmentation approach for contextualized Mandarin speech recognition

Naijun Zheng, Xucheng Wan, Kai Liu, Ziqing Du, Zhou Huan

Although contextualized automatic speech recognition (ASR) systems are commonly used to improve the recognition of uncommon words, their effectiveness is hindered by the inherent limitations of speech-text data availability. To address this challenge, our study proposes to leverage extensive text-only datasets and contextualize pre-trained ASR models using a straightforward text-augmentation (TA) technique, all while keeping computational costs minimal. In particular, to contextualize a pre-trained CIF-based ASR, we construct a codebook using limited speech-text data. By utilizing a simple codebook lookup process, we convert available text-only data into latent text embeddings. These embeddings then enhance the inputs for the contextualized ASR. Our experiments on diverse Mandarin test sets demonstrate that our TA approach significantly boosts recognition performance. The top-performing system shows relative CER improvements of up to 30% on rare words and 15% across all words in general.

6/17/2024

USTC-KXDIGIT System Description for ASVspoof5 Challenge

Yihao Chen, Haochen Wu, Nan Jiang, Xiang Xia, Qing Gu, Yunqi Hao, Pengfei Cai, Yu Guan, Jialong Wang, Weilin Xie, Lei Fang, Sian Fang, Yan Song, Wu Guo, Lin Liu, Minqiang Xu

This paper describes the USTC-KXDIGIT system submitted to the ASVspoof5 Challenge for Track 1 (speech deepfake detection) and Track 2 (spoofing-robust automatic speaker verification, SASV). Track 1 showcases a diverse range of technical qualities from potential processing algorithms and includes both open and closed conditions. For these conditions, our system consists of a cascade of a frontend feature extractor and a back-end classifier. We focus on extensive embedding engineering and enhancing the generalization of the back-end classifier model. Specifically, the embedding engineering is based on hand-crafted features and speech representations from a self-supervised model, used for closed and open conditions, respectively. To detect spoof attacks under various adversarial conditions, we trained multiple systems on an augmented training set. Additionally, we used voice conversion technology to synthesize fake audio from genuine audio in the training set to enrich the synthesis algorithms. To leverage the complementary information learned by different model architectures, we employed activation ensemble and fused scores from different systems to obtain the final decision score for spoof detection. During the evaluation phase, the proposed methods achieved 0.3948 minDCF and 14.33% EER in the close condition, and 0.0750 minDCF and 2.59% EER in the open condition, demonstrating the robustness of our submitted systems under adversarial conditions. In Track 2, we continued using the CM system from Track 1 and fused it with a CNN-based ASV system. This approach achieved 0.2814 min-aDCF in the closed condition and 0.0756 min-aDCF in the open condition, showcasing superior performance in the SASV system.

9/4/2024