Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis

Read original: arXiv:2406.10836 - Published 6/18/2024 by Xin Wang, Tomi Kinnunen, Kong Aik Lee, Paul-Gauthier No'e, Junichi Yamagishi

Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis

Overview

This paper revisits and improves the scoring fusion approach for spoofing-aware speaker verification (SASV), which aims to detect both impersonation and synthetic speech attacks.
The authors propose using compositional data analysis (CODA) to better interpret the fusion of spoofing detection and speaker verification scores, leading to improved performance.
Key contributions include a new scoring fusion approach based on CODA and an in-depth analysis of the SASV task and its evaluation metrics.

Plain English Explanation

The paper focuses on improving a technique called "spoofing-aware speaker verification" (SASV), which is used to detect both impersonation attacks (where someone pretends to be another person) and synthetic speech attacks (where artificial speech is generated to sound like someone else).

The researchers found that the existing way of combining the scores from the spoofing detection and speaker verification models could be improved. They propose using a technique called "compositional data analysis" (CODA) to better interpret how these two scores should be fused together. This new fusion approach led to better performance in detecting both types of attacks.

The paper provides a detailed analysis of the SASV task and the metrics used to evaluate it. This helps researchers and engineers better understand the challenges involved and how to address them.

Technical Explanation

The paper revisits the task of spoofing-aware speaker verification (SASV), which aims to detect both impersonation and synthetic speech attacks. The authors propose using compositional data analysis (CODA) to better interpret and improve the scoring fusion process.

Traditionally, SASV systems combine scores from a spoofing detection model and a speaker verification model. However, the authors argue that the existing fusion methods do not fully capture the compositional nature of the problem. They introduce a new CODA-based fusion approach that treats the spoofing and speaker verification scores as compositional data, allowing for a more principled interpretation and combination of the two.

The paper also provides an in-depth analysis of the SASV task and its evaluation metrics, including discussions on neural codec-based adversarial sample detection and multi-layer cross-attention fusion techniques. This analysis sheds light on the challenges and considerations involved in building robust ASV systems.

The experiments demonstrate that the proposed CODA-based fusion approach outperforms the traditional fusion methods, leading to improved spoofing-aware speaker verification performance.

Critical Analysis

The paper provides a thoughtful approach to improving SASV by rethinking the scoring fusion process. The use of CODA is a novel and promising direction, as it acknowledges the compositional nature of the problem and allows for a more principled combination of the spoofing detection and speaker verification scores.

However, the paper does not delve into the potential limitations or caveats of the CODA-based fusion approach. For example, it would be interesting to understand how the method performs when faced with more complex or adversarial attacks, or how it scales to larger datasets and more diverse attack scenarios.

Additionally, the paper focuses primarily on the technical aspects of the fusion process and does not discuss the broader implications or real-world applications of spoofing-aware speaker verification. Further discussion on the societal impacts and ethical considerations of this technology would be valuable.

Conclusion

This paper offers a significant contribution to the field of spoofing-aware speaker verification by introducing a new scoring fusion approach based on compositional data analysis. The proposed method demonstrates improved performance in detecting both impersonation and synthetic speech attacks, highlighting the importance of properly interpreting the relationship between spoofing detection and speaker verification scores.

The in-depth analysis of the SASV task and its evaluation metrics provides valuable insights for researchers and practitioners working in this area. While the paper focuses on the technical aspects, the findings have broader implications for building more robust and secure speaker verification systems that can withstand a variety of attacks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis

Xin Wang, Tomi Kinnunen, Kong Aik Lee, Paul-Gauthier No'e, Junichi Yamagishi

Fusing outputs from automatic speaker verification (ASV) and spoofing countermeasure (CM) is expected to make an integrated system robust to zero-effort imposters and synthesized spoofing attacks. Many score-level fusion methods have been proposed, but many remain heuristic. This paper revisits score-level fusion using tools from decision theory and presents three main findings. First, fusion by summing the ASV and CM scores can be interpreted on the basis of compositional data analysis, and score calibration before fusion is essential. Second, the interpretation leads to an improved fusion method that linearly combines the log-likelihood ratios of ASV and CM. However, as the third finding reveals, this linear combination is inferior to a non-linear one in making optimal decisions. The outcomes of these findings, namely, the score calibration before fusion, improved linear fusion, and better non-linear fusion, were found to be effective on the SASV challenge database.

6/18/2024

Spoofing-Robust Speaker Verification Using Parallel Embedding Fusion: BTU Speech Group's Approach for ASVspoof5 Challenge

Ou{g}uzhan Kurnaz, Selim Can Demirtac{s}, Aykut Buker, Jagabandhu Mishra, Cemal Hanilc{c}i

This paper introduces the parallel network-based spoofing-aware speaker verification (SASV) system developed by BTU Speech Group for the ASVspoof5 Challenge. The SASV system integrates ASV and CM systems to enhance security against spoofing attacks. Our approach employs score and embedding fusion from ASV models (ECAPA-TDNN, WavLM) and CM models (AASIST). The fused embeddings are processed using a simple DNN structure, optimizing model performance with a combination of recently proposed a-DCF and BCE losses. We introduce a novel parallel network structure where two identical DNNs, fed with different inputs, independently process embeddings and produce SASV scores. The final SASV probability is derived by averaging these scores, enhancing robustness and accuracy. Experimental results demonstrate that the proposed parallel DNN structure outperforms traditional single DNN methods, offering a more reliable and secure speaker verification system against spoofing attacks.

8/29/2024

Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches

Chang Zeng, Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi

In real-world applications, it is challenging to build a speaker verification system that is simultaneously robust against common threats, including spoofing attacks, channel mismatch, and domain mismatch. Traditional automatic speaker verification (ASV) systems often tackle these issues separately, leading to suboptimal performance when faced with simultaneous challenges. In this paper, we propose an integrated framework that incorporates pair-wise learning and spoofing attack simulation into the meta-learning paradigm to enhance robustness against these multifaceted threats. This novel approach employs an asymmetric dual-path model and a multi-task learning strategy to handle ASV, anti-spoofing, and spoofing-aware ASV tasks concurrently. A new testing dataset, CNComplex, is introduced to evaluate system performance under these combined threats. Experimental results demonstrate that our integrated model significantly improves performance over traditional ASV systems across various scenarios, showcasing its potential for real-world deployment. Additionally, the proposed framework's ability to generalize across different conditions highlights its robustness and reliability, making it a promising solution for practical ASV applications.

9/11/2024

Toward Improving Synthetic Audio Spoofing Detection Robustness via Meta-Learning and Disentangled Training With Adversarial Examples

Zhenyu Wang, John H. L. Hansen

Advances in automatic speaker verification (ASV) promote research into the formulation of spoofing detection systems for real-world applications. The performance of ASV systems can be degraded severely by multiple types of spoofing attacks, namely, synthetic speech (SS), voice conversion (VC), replay, twins and impersonation, especially in the case of unseen synthetic spoofing attacks. A reliable and robust spoofing detection system can act as a security gate to filter out spoofing attacks instead of having them reach the ASV system. A weighted additive angular margin loss is proposed to address the data imbalance issue, and different margins has been assigned to improve generalization to unseen spoofing attacks in this study. Meanwhile, we incorporate a meta-learning loss function to optimize differences between the embeddings of support versus query set in order to learn a spoofing-category-independent embedding space for utterances. Furthermore, we craft adversarial examples by adding imperceptible perturbations to spoofing speech as a data augmentation strategy, then we use an auxiliary batch normalization (BN) to guarantee that corresponding normalization statistics are performed exclusively on the adversarial examples. Additionally, A simple attention module is integrated into the residual block to refine the feature extraction process. Evaluation results on the Logical Access (LA) track of the ASVspoof 2019 corpus provides confirmation of our proposed approaches' effectiveness in terms of a pooled EER of 0.87%, and a min t-DCF of 0.0277. These advancements offer effective options to reduce the impact of spoofing attacks on voice recognition/authentication systems.

8/27/2024