BUT Systems and Analyses for the ASVspoof 5 Challenge

Read original: arXiv:2408.11152 - Published 8/22/2024 by Johan Rohdin, Lin Zhang, Oldv{r}ich Plchot, Vojtv{e}ch Stanv{e}k, David Mihola, Junyi Peng, Themos Stafylakis, Dmitriy Beveraki, Anna Silnova, Jan Brukner and 1 other

BUT Systems and Analyses for the ASVspoof 5 Challenge

Overview

Presents the systems and analyses of BUT for the ASVspoof 5 Challenge, which focused on detecting deepfake speech.
Covers both the track 1 deepfake detection system and the track 2 anti-spoofing system.
Provides insights and performance analysis of the proposed systems.

Plain English Explanation

The paper describes the systems and analyses developed by the BUT (Brno University of Technology) team for the ASVspoof 5 Challenge. This challenge was focused on detecting deepfake speech, which are synthetic voice recordings that are difficult to distinguish from real human speech.

The BUT team developed two main systems for this challenge. The first was for the track 1 deepfake detection task, where the goal was to classify whether a given audio clip was real or fake. The second system was for the track 2 anti-spoofing task, which involved detecting if a speaker verification system was being attacked by a deepfake.

The paper provides details on the architecture and performance of these systems, as well as insights gained from their analyses. It discusses aspects like the importance of data augmentation, the effectiveness of different feature representations, and the challenges of detecting high-quality deepfakes.

Overall, this research contributes to the ongoing efforts to develop robust techniques for detecting synthetic speech, which is an important problem as deepfake technology becomes more advanced and widespread.

Technical Explanation

The BUT team proposed a deep learning-based system for the track 1 deepfake detection task. Their architecture used a time-delay neural network (TDNN) to extract temporal features from the audio, which were then fed into a convolutional neural network (CNN) for classification.

To improve performance, the researchers utilized various data augmentation techniques, such as adding background noise, reverberation, and time-scale modifications to the training data. They also explored different input representations, including raw waveforms, log-Mel spectrograms, and constant-Q transforms.

For the track 2 anti-spoofing task, the BUT system employed a similar TDNN-CNN architecture. However, instead of directly classifying the audio as real or fake, the model was trained to detect if a speaker verification system was being attacked. This involved learning representations that were robust to spoofing attacks.

The paper provides a comprehensive evaluation of the proposed systems on the ASVspoof 5 dataset. The results show that the data augmentation and input representation choices had a significant impact on the deepfake detection performance. Additionally, the anti-spoofing system demonstrated promising results in identifying attacks on speaker verification.

Critical Analysis

The research highlights the challenges of detecting high-quality deepfakes, which can closely mimic real human speech. The authors acknowledge that while their systems achieved strong performance, there is still room for improvement, especially in handling more advanced deepfake generation techniques.

One limitation of the study is that it focuses on a specific dataset and challenge scenario. It would be valuable to evaluate the generalizability of the proposed approaches on a wider range of deepfake and anti-spoofing datasets, including those that capture the evolving nature of this technology.

Additionally, the paper does not delve into the potential ethical implications of deepfake detection research, such as the risk of these techniques being used to enable surveillance or other forms of abuse. As this field continues to advance, it will be important for researchers to consider the broader societal impact of their work.

Conclusion

The BUT team's research demonstrates the development of effective deepfake detection and anti-spoofing systems for the ASVspoof 5 Challenge. By leveraging techniques like data augmentation and robust feature representations, they were able to achieve strong performance in identifying synthetic speech.

This work contributes to the ongoing efforts to address the growing threat of deepfakes, which have the potential to undermine trust in digital media and enable various forms of misinformation and fraud. As deepfake technology continues to evolve, further research in this area will be crucial for developing reliable countermeasures and safeguarding the integrity of audio-based communication.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

BUT Systems and Analyses for the ASVspoof 5 Challenge

Johan Rohdin, Lin Zhang, Oldv{r}ich Plchot, Vojtv{e}ch Stanv{e}k, David Mihola, Junyi Peng, Themos Stafylakis, Dmitriy Beveraki, Anna Silnova, Jan Brukner, Luk'av{s} Burget

This paper describes the BUT submitted systems for the ASVspoof 5 challenge, along with analyses. For the conventional deepfake detection task, we use ResNet18 and self-supervised models for the closed and open conditions, respectively. In addition, we analyze and visualize different combinations of speaker information and spoofing information as label schemes for training. For spoofing-robust automatic speaker verification (SASV), we introduce effective priors and propose using logistic regression to jointly train affine transformations of the countermeasure scores and the automatic speaker verification scores in such a way that the SASV LLR is optimized.

8/22/2024

ASASVIcomtech: The Vicomtech-UGR Speech Deepfake Detection and SASV Systems for the ASVspoof5 Challenge

Juan M. Mart'in-Do~nas, Eros Rosell'o, Angel M. Gomez, Aitor 'Alvarez, Iv'an L'opez-Espejo, Antonio M. Peinado

This paper presents the work carried out by the ASASVIcomtech team, made up of researchers from Vicomtech and University of Granada, for the ASVspoof5 Challenge. The team has participated in both Track 1 (speech deepfake detection) and Track 2 (spoofing-aware speaker verification). This work started with an analysis of the challenge available data, which was regarded as an essential step to avoid later potential biases of the trained models, and whose main conclusions are presented here. With respect to the proposed approaches, a closed-condition system employing a deep complex convolutional recurrent architecture was developed for Track 1, although, unfortunately, no noteworthy results were achieved. On the other hand, different possibilities of open-condition systems, based on leveraging self-supervised models, augmented training data from previous challenges, and novel vocoders, were explored for both tracks, finally achieving very competitive results with an ensemble system.

8/21/2024

USTC-KXDIGIT System Description for ASVspoof5 Challenge

Yihao Chen, Haochen Wu, Nan Jiang, Xiang Xia, Qing Gu, Yunqi Hao, Pengfei Cai, Yu Guan, Jialong Wang, Weilin Xie, Lei Fang, Sian Fang, Yan Song, Wu Guo, Lin Liu, Minqiang Xu

This paper describes the USTC-KXDIGIT system submitted to the ASVspoof5 Challenge for Track 1 (speech deepfake detection) and Track 2 (spoofing-robust automatic speaker verification, SASV). Track 1 showcases a diverse range of technical qualities from potential processing algorithms and includes both open and closed conditions. For these conditions, our system consists of a cascade of a frontend feature extractor and a back-end classifier. We focus on extensive embedding engineering and enhancing the generalization of the back-end classifier model. Specifically, the embedding engineering is based on hand-crafted features and speech representations from a self-supervised model, used for closed and open conditions, respectively. To detect spoof attacks under various adversarial conditions, we trained multiple systems on an augmented training set. Additionally, we used voice conversion technology to synthesize fake audio from genuine audio in the training set to enrich the synthesis algorithms. To leverage the complementary information learned by different model architectures, we employed activation ensemble and fused scores from different systems to obtain the final decision score for spoof detection. During the evaluation phase, the proposed methods achieved 0.3948 minDCF and 14.33% EER in the close condition, and 0.0750 minDCF and 2.59% EER in the open condition, demonstrating the robustness of our submitted systems under adversarial conditions. In Track 2, we continued using the CM system from Track 1 and fused it with a CNN-based ASV system. This approach achieved 0.2814 min-aDCF in the closed condition and 0.0756 min-aDCF in the open condition, showcasing superior performance in the SASV system.

9/4/2024

ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

Xin Wang, Hector Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, Junichi Yamagishi

ASVspoof 5 is the fifth edition in a series of challenges that promote the study of speech spoofing and deepfake attacks, and the design of detection solutions. Compared to previous challenges, the ASVspoof 5 database is built from crowdsourced data collected from a vastly greater number of speakers in diverse acoustic conditions. Attacks, also crowdsourced, are generated and tested using surrogate detection models, while adversarial attacks are incorporated for the first time. New metrics support the evaluation of spoofing-robust automatic speaker verification (SASV) as well as stand-alone detection solutions, i.e., countermeasures without ASV. We describe the two challenge tracks, the new database, the evaluation metrics, baselines, and the evaluation platform, and present a summary of the results. Attacks significantly compromise the baseline systems, while submissions bring substantial improvements.

8/19/2024