Towards generalizing deep-audio fake detection networks

2305.13033

Published 4/10/2024 by Konstantin Gasenzer (High Performance Computing and Analytics Lab, Universitat Bonn, Germany), Moritz Wolter (High Performance Computing and Analytics Lab, Universitat Bonn, Germany)

cs.SD cs.LG eess.AS

🔎

Abstract

Today's generative neural networks allow the creation of high-quality synthetic speech at scale. While we welcome the creative use of this new technology, we must also recognize the risks. As synthetic speech is abused for monetary and identity theft, we require a broad set of deepfake identification tools. Furthermore, previous work reported a limited ability of deep classifiers to generalize to unseen audio generators. We study the frequency domain fingerprints of current audio generators. Building on top of the discovered frequency footprints, we train excellent lightweight detectors that generalize. We report improved results on the WaveFake dataset and an extended version. To account for the rapid progress in the field, we extend the WaveFake dataset by additionally considering samples drawn from the novel Avocodo and BigVGAN networks. For illustration purposes, the supplementary material contains audio samples of generator artifacts.

Create account to get full access

Overview

Generative neural networks can now create high-quality synthetic speech at scale
While this technology has creative potential, it also poses risks like identity theft and financial fraud through the creation of deepfakes
Previous research has found that deep classifiers have limited ability to generalize to new audio generators
This paper studies the frequency domain fingerprints of current audio generators and trains lightweight detectors that can better generalize

Plain English Explanation

Modern artificial intelligence (AI) models can now generate very realistic-sounding synthetic speech. While this opens up new creative possibilities, it also creates risks, as bad actors could use this technology to create deepfakes - fake audio or video designed to deceive people. Previous attempts to detect these deepfakes have had limited success, as the AI models used to identify them couldn't generalize well to new types of synthetic audio.

This paper takes a different approach. The researchers looked at the underlying frequency patterns of the current audio generators, and used that information to train new AI models that could detect deepfakes more effectively, even when faced with new types of synthetic speech. The models they developed were lightweight and able to generalize well, outperforming previous approaches on standard benchmarks.

To keep up with the rapid progress in this field, the researchers also expanded an existing dataset of deepfake audio samples to include the latest generation of synthetic speech models, like Avocodo and BigVGAN. This ensures the models can be tested against the latest threats.

Technical Explanation

The key insight of this research is that current audio generation models leave behind distinctive frequency domain signatures that can be used to detect their outputs. The authors study these frequency-domain fingerprints and use them to train lightweight neural network detectors that are able to generalize to unseen synthetic audio generators.

Experiments on the WaveFake dataset and an extended version that includes samples from the Avocodo and BigVGAN models show that the proposed detectors outperform previous approaches. The authors hypothesize that the frequency-domain features are more robust and generalizable than the time-domain features used in prior work.

Critical Analysis

The researchers acknowledge that their work does not address the rapid pace of progress in audio synthesis, and that their dataset extensions may quickly become outdated. There is an ongoing arms race between deepfake creators and detection algorithms, and continued research will be needed to stay ahead of the latest threats.

Additionally, the paper does not explore the potential for adversarial attacks to fool the proposed detectors, or the ability of audio generators to adapt and evade detection over time. Real-world deployment would likely require continuous monitoring and model updates to maintain effectiveness.

Overall, this is a promising approach that leverages the unique properties of synthetic audio to improve detection. However, further research is needed to make these systems sufficiently robust and future-proof in the face of increasingly sophisticated deepfake technologies.

Conclusion

This paper presents a novel approach to detecting deepfake audio by analyzing the underlying frequency domain characteristics of current synthetic speech models. The lightweight detectors developed in this research demonstrate improved generalization capabilities compared to previous work, and the expanded dataset ensures the models can handle the latest generation of audio synthesis technologies.

While this represents an important step forward, the deepfake detection challenge is an ongoing one that will require sustained research efforts. Continued vigilance and adaptation will be necessary to stay ahead of bad actors who seek to misuse these powerful generative technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

FakeSound: Deepfake General Audio Detection

Zeyu Xie, Baihan Li, Xuenan Xu, Zheng Liang, Kai Yu, Mengyue Wu

With the advancement of audio generation, generative models can produce highly realistic audios. However, the proliferation of deepfake general audio can pose negative consequences. Therefore, we propose a new task, deepfake general audio detection, which aims to identify whether audio content is manipulated and to locate deepfake regions. Leveraging an automated manipulation pipeline, a dataset named FakeSound for deepfake general audio detection is proposed, and samples can be viewed on website https://FakeSoundData.github.io. The average binary accuracy of humans on all test sets is consistently below 0.6, which indicates the difficulty humans face in discerning deepfake audio and affirms the efficacy of the FakeSound dataset. A deepfake detection model utilizing a general audio pre-trained model is proposed as a benchmark system. Experimental results demonstrate that the performance of the proposed model surpasses the state-of-the-art in deepfake speech detection and human testers.

6/13/2024

cs.SD eess.AS

👁️

Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models

Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva

Generalization is a main issue for current audio deepfake detectors, which struggle to provide reliable results on out-of-distribution data. Given the speed at which more and more accurate synthesis methods are developed, it is very important to design techniques that work well also on data they were not trained for. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection, with special focus on generalization ability. To this end, the detection problem is reformulated in a speaker verification framework and fake audios are exposed by the mismatch between the voice sample under test and the voice of the claimed identity. With this paradigm, no fake speech sample is necessary in training, cutting off any link with the generation method at the root, and ensuring full generalization ability. Features are extracted by general-purpose large pre-trained models, with no need for training or fine-tuning on specific fake detection or speaker verification datasets. At detection time only a limited set of voice fragments of the identity under test is required. Experiments on several datasets widespread in the community show that detectors based on pre-trained models achieve excellent performance and show strong generalization ability, rivaling supervised methods on in-distribution data and largely overcoming them on out-of-distribution data.

5/7/2024

cs.SD cs.CV eess.AS

Detecting music deepfakes is easy but actually hard

Darius Afchar, Gabriel Meseguer-Brocal, Romain Hennequin

In the face of a new era of generative models, the detection of artificially generated content has become a matter of utmost importance. The ability to create credible minute-long music deepfakes in a few seconds on user-friendly platforms poses a real threat of fraud on streaming services and unfair competition to human artists. This paper demonstrates the possibility (and surprising ease) of training classifiers on datasets comprising real audio and fake reconstructions, achieving a convincing accuracy of 99.8%. To our knowledge, this marks the first publication of a music deepfake detector, a tool that will help in the regulation of music forgery. Nevertheless, informed by decades of literature on forgery detection in other fields, we stress that a good test score is not the end of the story. We step back from the straightforward ML framework and expose many facets that could be problematic with such a deployed detector: calibration, robustness to audio manipulation, generalisation to unseen models, interpretability and possibility for recourse. This second part acts as a position for future research steps in the field and a caveat to a flourishing market of fake content checkers.

5/24/2024

cs.SD cs.LG eess.AS

Enhancing Generalization in Audio Deepfake Detection: A Neural Collapse based Sampling and Training Approach

Mohammed Yousif, Jonat John Mathew, Huzaifa Pallan, Agamjeet Singh Padda, Syed Daniyal Shah, Sara Adamski, Madhu Reddiboina, Arjun Pankajakshan

Generalization in audio deepfake detection presents a significant challenge, with models trained on specific datasets often struggling to detect deepfakes generated under varying conditions and unknown algorithms. While collectively training a model using diverse datasets can enhance its generalization ability, it comes with high computational costs. To address this, we propose a neural collapse-based sampling approach applied to pre-trained models trained on distinct datasets to create a new training database. Using ASVspoof 2019 dataset as a proof-of-concept, we implement pre-trained models with Resnet and ConvNext architectures. Our approach demonstrates comparable generalization on unseen data while being computationally efficient, requiring less training data. Evaluation is conducted using the In-the-wild dataset.

4/22/2024

cs.SD eess.AS