VoiceWukong: Benchmarking Deepfake Voice Detection

Read original: arXiv:2409.06348 - Published 9/11/2024 by Ziwei Yan, Yanjie Zhao, Haoyu Wang

VoiceWukong: Benchmarking Deepfake Voice Detection

Overview

The provided paper presents VoiceWukong, a benchmark for evaluating deepfake voice detection models.
Deepfake voice detection is an important task to prevent the spread of misinformation and malicious use of synthetic voices.
The VoiceWukong benchmark includes a diverse dataset of real and synthetic voices, as well as evaluation metrics to assess model performance.

Plain English Explanation

The paper discusses VoiceWukong, a new tool for testing the ability of artificial intelligence (AI) systems to detect fake voices. Fake voices, also known as "deepfakes," are created using advanced technologies that can make it sound like someone is saying things they never actually said. This is a growing problem as deepfakes can be used to spread misinformation or impersonate people for malicious purposes.

The VoiceWukong benchmark includes a large dataset of real and synthetic voices that AI systems can use to practice detecting fakes. It also provides standard metrics to measure how well these systems perform at this task. By having a common benchmark, researchers can more easily compare the effectiveness of different deepfake detection approaches.

The goal is to help develop better AI tools that can reliably identify deepfake voices, so that we can prevent them from being used to mislead people. This is an important step in combating the spread of false information online and protecting the integrity of audio communications.

Technical Explanation

The paper introduces VoiceWukong, a benchmark for evaluating deepfake voice detection models. The benchmark includes a diverse dataset of real and synthetic voices, as well as standardized evaluation metrics.

The dataset contains speech samples from over 1,000 speakers across a variety of accents and languages. It includes real human voices as well as deepfake samples generated using state-of-the-art voice conversion and voice cloning techniques. The dataset is designed to capture the range of challenges in detecting deepfake voices, such as cross-lingual and cross-gender scenarios.

The paper also proposes several evaluation metrics to assess deepfake detection performance, including classification accuracy, false acceptance rate, and false rejection rate. These metrics provide a comprehensive way to evaluate model robustness and generalization capabilities.

Experiments are conducted using several baseline deepfake detection models, including spectrogram-based convolutional neural networks and self-supervised learning approaches like WavLM. The results demonstrate the value of the VoiceWukong benchmark in identifying strengths and weaknesses of different detection techniques.

Critical Analysis

The VoiceWukong benchmark addresses an important real-world problem of deepfake voice detection. By providing a standardized dataset and evaluation framework, the paper enables more rigorous and comparable performance assessments of deepfake detection models.

However, the paper acknowledges some limitations of the current VoiceWukong dataset. For example, the synthetic voice samples are generated using existing techniques, and may not fully capture the evolving capabilities of future deepfake technologies. Further research is needed to incorporate ever-more realistic and diverse deepfake samples.

Additionally, the paper focuses primarily on binary classification (real vs. fake) and does not explore more fine-grained detection of attributes like speaker identity or emotional state. Expanding the benchmark to include these additional detection tasks could further strengthen the utility of VoiceWukong.

Overall, the VoiceWukong benchmark represents an important step forward in the critical area of deepfake voice detection. Continued refinement and expansion of the dataset, as well as exploration of ensemble detection approaches, can help advance the state of the art in this rapidly evolving field.

Conclusion

The VoiceWukong paper presents a new benchmark for evaluating deepfake voice detection models. By providing a diverse dataset of real and synthetic voices, along with standardized evaluation metrics, the benchmark enables more rigorous and comparable performance assessments of different detection techniques.

As the threat of deepfake voices grows, tools like VoiceWukong will be crucial for developing reliable AI systems that can accurately identify manipulated audio. While the current benchmark has some limitations, the authors highlight important directions for future research to expand and refine the dataset and evaluation framework.

Overall, the VoiceWukong benchmark represents a valuable contribution to the ongoing efforts to combat the spread of misinformation and protect the integrity of audio communications in the digital age.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

VoiceWukong: Benchmarking Deepfake Voice Detection

Ziwei Yan, Yanjie Zhao, Haoyu Wang

With the rapid advancement of technologies like text-to-speech (TTS) and voice conversion (VC), detecting deepfake voices has become increasingly crucial. However, both academia and industry lack a comprehensive and intuitive benchmark for evaluating detectors. Existing datasets are limited in language diversity and lack many manipulations encountered in real-world production environments. To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate the performance of deepfake voice detectors. To build the dataset, we first collected deepfake voices generated by 19 advanced and widely recognized commercial tools and 15 open-source tools. We then created 38 data variants covering six types of manipulations, constructing the evaluation dataset for deepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200 Chinese deepfake voice samples. Using VoiceWukong, we evaluated 12 state-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of 13.50%, while all others exceeded 20%. Our findings reveal that these detectors face significant challenges in real-world applications, with dramatically declining performance. In addition, we conducted a user study with more than 300 participants. The results are compared with the performance of the 12 detectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio, where different detectors and humans exhibit varying identification capabilities for deepfake voices at different deception levels, while the LALM demonstrates no detection ability at all. Furthermore, we provide a leaderboard for deepfake voice detection, publicly available at {https://voicewukong.github.io}.

9/11/2024

Cross-Domain Audio Deepfake Detection: Dataset and Analysis

Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang

Audio deepfake detection (ADD) is essential for preventing the misuse of synthetic voices that may infringe on personal rights and privacy. Recent zero-shot text-to-speech (TTS) models pose higher risks as they can clone voices with a single utterance. However, the existing ADD datasets are outdated, leading to suboptimal generalization of detection models. In this paper, we construct a new cross-domain ADD dataset comprising over 300 hours of speech data that is generated by five advanced zero-shot TTS models. To simulate real-world scenarios, we employ diverse attack methods and audio prompts from different datasets. Experiments show that, through novel attack-augmented training, the Wav2Vec2-large and Whisper-medium models achieve equal error rates of 4.1% and 6.5% respectively. Additionally, we demonstrate our models' outstanding few-shot ADD ability by fine-tuning with just one minute of target-domain data. Nonetheless, neural codec compressors greatly affect the detection accuracy, necessitating further research.

4/9/2024

Exploring WavLM Back-ends for Speech Spoofing and Deepfake Detection

Theophile Stourbe, Victor Miara, Theo Lepage, Reda Dehak

This paper describes our submitted systems to the ASVspoof 5 Challenge Track 1: Speech Deepfake Detection - Open Condition, which consists of a stand-alone speech deepfake (bonafide vs spoof) detection task. Recently, large-scale self-supervised models become a standard in Automatic Speech Recognition (ASR) and other speech processing tasks. Thus, we leverage a pre-trained WavLM as a front-end model and pool its representations with different back-end techniques. The complete framework is fine-tuned using only the trained dataset of the challenge, similar to the close condition. Besides, we adopt data-augmentation by adding noise and reverberation using MUSAN noise and RIR datasets. We also experiment with codec augmentations to increase the performance of our method. Ultimately, we use the Bosaris toolkit for score calibration and system fusion to get better Cllr scores. Our fused system achieves 0.0937 minDCF, 3.42% EER, 0.1927 Cllr, and 0.1375 actDCF.

9/10/2024

WavLM model ensemble for audio deepfake detection

David Combei, Adriana Stan, Dan Oneata, Horia Cucu

Audio deepfake detection has become a pivotal task over the last couple of years, as many recent speech synthesis and voice cloning systems generate highly realistic speech samples, thus enabling their use in malicious activities. In this paper we address the issue of audio deepfake detection as it was set in the ASVspoof5 challenge. First, we benchmark ten types of pretrained representations and show that the self-supervised representations stemming from the wav2vec2 and wavLM families perform best. Of the two, wavLM is better when restricting the pretraining data to LibriSpeech, as required by the challenge rules. To further improve performance, we finetune the wavLM model for the deepfake detection task. We extend the ASVspoof5 dataset with samples from other deepfake detection datasets and apply data augmentation. Our final challenge submission consists of a late fusion combination of four models and achieves an equal error rate of 6.56% and 17.08% on the two evaluation sets.

8/15/2024