EmoFake: An Initial Dataset for Emotion Fake Audio Detection

Read original: arXiv:2211.05363 - Published 7/25/2024 by Yan Zhao, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Xiaohui Zhang, Yongfeng Dong

🔎

Overview

Datasets have been developed to detect fake audio, such as those from the ASVspoof and ADD challenges.
However, these datasets do not consider the case where the emotion of the audio has been changed, while other information (e.g., speaker identity and content) remains the same.
Changing the emotion of an audio can lead to semantic changes, which could pose threats to people's lives.
This paper reports the development of a new dataset called EmoFake, which involves changing the emotion state of the origin audio.
The paper also proposes a method called Graph Attention networks using Deep Emotion embedding (GADE) for detecting emotion fake audio.

Plain English Explanation

The researchers have developed a new dataset called EmoFake to help researchers and engineers create better systems for detecting fake audio. Fake audio can be a serious problem, as it could be used to spread misinformation or even deceive people in ways that could harm them.

Most existing fake audio detection datasets focus on things like detecting if the speaker's identity has been faked. However, the EmoFake dataset is designed to test whether systems can detect when the emotion of the audio has been changed, even if other aspects like the speaker's identity remain the same.

Changing the emotion of an audio clip can alter the meaning or "semantics" of what is being said. This could be used to mislead people in harmful ways, such as making a serious statement sound like a joke. The researchers want to make sure detection systems can identify this type of manipulation.

To create the EmoFake dataset, the researchers used open-source emotion voice conversion models to alter the emotional state of existing audio recordings. They also developed a new detection method called GADE that uses graph attention networks and deep learning to identify emotion-based fake audio.

Technical Explanation

The researchers created the EmoFake dataset by using open-source emotion voice conversion models to change the emotional state of existing audio recordings, while keeping other attributes like the speaker's identity and content the same. This allows them to test whether fake audio detection systems can identify this type of semantic manipulation.

They propose a new detection method called Graph Attention networks using Deep Emotion embedding (GADE) for identifying emotion fake audio. GADE uses graph attention networks, which can model the relationships between different audio features, combined with a deep learning-based emotion embedding. This allows the system to learn patterns that distinguish genuine emotional expressions from manipulated ones.

The researchers conducted benchmark experiments to evaluate the performance of GADE on the EmoFake dataset. The results show that the EmoFake dataset poses a challenge to fake audio detection models trained on other datasets, like the LA dataset from the ASVspoof 2019 challenge. In contrast, the proposed GADE method demonstrates good performance in detecting emotion-based fake audio.

Critical Analysis

The EmoFake dataset and GADE detection method address an important limitation of existing fake audio research, which has largely focused on detecting identity-based manipulations. By considering semantic changes through emotion tampering, the researchers are pushing the field of fake audio detection to become more robust and capable of identifying a wider range of manipulations.

However, the paper does not provide a thorough analysis of the potential limitations or failure modes of the GADE method. It would be helpful to understand the types of emotion-based manipulations that GADE may struggle to detect, or if there are certain audio characteristics or scenarios where it performs poorly.

Additionally, the researchers could have explored the potential for adversarial attacks against the GADE system, which is a common concern with deep learning-based detection methods. Understanding the system's vulnerabilities would help ensure it can be deployed safely in real-world applications.

Overall, the EmoFake dataset and GADE method represent a valuable contribution to the field of fake audio detection, but further research is needed to fully understand their capabilities and limitations.

Conclusion

This paper presents a new dataset called EmoFake and a detection method called GADE to address the challenge of identifying fake audio where the emotional state of the audio has been manipulated. The EmoFake dataset provides a way to test the robustness of fake audio detection systems to semantic changes, which is an important consideration for protecting against potential harms.

The proposed GADE method, which uses graph attention networks and deep learning, shows promising results in detecting emotion-based fake audio on the EmoFake dataset. This work opens up new avenues for advancing the field of fake audio detection and ensuring that these systems can reliably identify a wider range of audio manipulations.

As the use of synthetic media continues to grow, developing effective detection methods like GADE will be crucial for maintaining trust and preventing the spread of misinformation that could harm individuals or society.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔎

EmoFake: An Initial Dataset for Emotion Fake Audio Detection

Yan Zhao, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Xiaohui Zhang, Yongfeng Dong

Many datasets have been designed to further the development of fake audio detection, such as datasets of the ASVspoof and ADD challenges. However, these datasets do not consider a situation that the emotion of the audio has been changed from one to another, while other information (e.g. speaker identity and content) remains the same. Changing the emotion of an audio can lead to semantic changes. Speech with tampered semantics may pose threats to people's lives. Therefore, this paper reports our progress in developing such an emotion fake audio detection dataset involving changing emotion state of the origin audio named EmoFake. The fake audio in EmoFake is generated by open source emotion voice conversion models. Furthermore, we proposed a method named Graph Attention networks using Deep Emotion embedding (GADE) for the detection of emotion fake audio. Some benchmark experiments are conducted on this dataset. The results show that our designed dataset poses a challenge to the fake audio detection model trained with the LA dataset of ASVspoof 2019. The proposed GADE shows good performance in the face of emotion fake audio.

7/25/2024

🔎

SceneFake: An Initial Dataset and Benchmarks for Scene Fake Audio Detection

Jiangyan Yi, Chenglong Wang, Jianhua Tao, Chu Yuan Zhang, Cunhang Fan, Zhengkun Tian, Haoxin Ma, Ruibo Fu

Many datasets have been designed to further the development of fake audio detection. However, fake utterances in previous datasets are mostly generated by altering timbre, prosody, linguistic content or channel noise of original audio. These datasets leave out a scenario, in which the acoustic scene of an original audio is manipulated with a forged one. It will pose a major threat to our society if some people misuse the manipulated audio with malicious purpose. Therefore, this motivates us to fill in the gap. This paper proposes such a dataset for scene fake audio detection named SceneFake, where a manipulated audio is generated by only tampering with the acoustic scene of an real utterance by using speech enhancement technologies. Some scene fake audio detection benchmark results on the SceneFake dataset are reported in this paper. In addition, an analysis of fake attacks with different speech enhancement technologies and signal-to-noise ratios are presented in this paper. The results indicate that scene fake utterances cannot be reliably detected by baseline models trained on the ASVspoof 2019 dataset. Although these models perform well on the SceneFake training set and seen testing set, their performance is poor on the unseen test set. The dataset (https://zenodo.org/record/7663324#.Y_XKMuPYuUk) and benchmark source codes (https://github.com/ADDchallenge/SceneFake) are publicly available.

4/5/2024

FakeSound: Deepfake General Audio Detection

Zeyu Xie, Baihan Li, Xuenan Xu, Zheng Liang, Kai Yu, Mengyue Wu

With the advancement of audio generation, generative models can produce highly realistic audios. However, the proliferation of deepfake general audio can pose negative consequences. Therefore, we propose a new task, deepfake general audio detection, which aims to identify whether audio content is manipulated and to locate deepfake regions. Leveraging an automated manipulation pipeline, a dataset named FakeSound for deepfake general audio detection is proposed, and samples can be viewed on website https://FakeSoundData.github.io. The average binary accuracy of humans on all test sets is consistently below 0.6, which indicates the difficulty humans face in discerning deepfake audio and affirms the efficacy of the FakeSound dataset. A deepfake detection model utilizing a general audio pre-trained model is proposed as a benchmark system. Experimental results demonstrate that the performance of the proposed model surpasses the state-of-the-art in deepfake speech detection and human testers.

6/13/2024

EmoFace: Audio-driven Emotional 3D Face Animation

Chang Liu, Qunfen Lin, Zijiao Zeng, Ye Pan

Audio-driven emotional 3D face animation aims to generate emotionally expressive talking heads with synchronized lip movements. However, previous research has often overlooked the influence of diverse emotions on facial expressions or proved unsuitable for driving MetaHuman models. In response to this deficiency, we introduce EmoFace, a novel audio-driven methodology for creating facial animations with vivid emotional dynamics. Our approach can generate facial expressions with multiple emotions, and has the ability to generate random yet natural blinks and eye movements, while maintaining accurate lip synchronization. We propose independent speech encoders and emotion encoders to learn the relationship between audio, emotion and corresponding facial controller rigs, and finally map into the sequence of controller values. Additionally, we introduce two post-processing techniques dedicated to enhancing the authenticity of the animation, particularly in blinks and eye movements. Furthermore, recognizing the scarcity of emotional audio-visual data suitable for MetaHuman model manipulation, we contribute an emotional audio-visual dataset and derive control parameters for each frames. Our proposed methodology can be applied in producing dialogues animations of non-playable characters (NPCs) in video games, and driving avatars in virtual reality environments. Our further quantitative and qualitative experiments, as well as an user study comparing with existing researches show that our approach demonstrates superior results in driving 3D facial models. The code and sample data are available at https://github.com/SJTU-Lucy/EmoFace.

7/18/2024