FineFake: A Knowledge-Enriched Dataset for Fine-Grained Multi-Domain Fake News Detecction

Read original: arXiv:2404.01336 - Published 4/30/2024 by Ziyi Zhou, Xiaoming Zhang, Litian Zhang, Jiacheng Liu, Xi Zhang, Chaozhuo Li

FineFake: A Knowledge-Enriched Dataset for Fine-Grained Multi-Domain Fake News Detecction

Overview

The paper describes the development of a new dataset called "FineFake" for fine-grained multi-domain fake news detection.
The dataset contains news articles across 15 different domains, with each article labeled as either real or fake.
The articles are accompanied by additional knowledge-based features to help train more accurate fake news detection models.

Plain English Explanation

The researchers created a new dataset called FineFake to help improve the ability of AI systems to detect fake news. Fake news is a big problem these days, as it can spread misinformation and sway people's opinions.

The FineFake dataset contains thousands of real and fake news articles across 15 different topics, like politics, entertainment, and health. By having articles from many domains, the dataset allows AI models to learn the patterns of fake news more broadly, rather than just for a single subject.

In addition to the articles themselves, the dataset also includes extra information or "knowledge-based features" about each article. This might include details about the publisher, author, claims made in the article, and how the article relates to real-world events and entities. Giving the AI models this additional context helps them better distinguish real news from fake.

The goal is that by training AI systems on the FineFake dataset, they will become more accurate at detecting fake news in the real world, across many different topics. This could be an important tool for combating the spread of misinformation online.

Technical Explanation

The FineFake dataset was developed to address limitations in existing fake news detection datasets, which often focus on a single domain or lack contextual information about the articles.

The dataset contains 30,000 news articles evenly split between real and fake, spanning 15 different domains such as politics, business, and entertainment. Each article is annotated with labels indicating its veracity.

Beyond just the article text, the dataset also includes a range of knowledge-based features to provide additional context. This includes metadata about the article source and author, as well as information extracted from external knowledge bases about the claims, entities, and events referenced in the article.

The researchers conducted experiments to evaluate the effectiveness of this additional knowledge-based information for training fake news detection models. They found that incorporating the contextual features significantly improved model performance compared to using just the article text alone.

Critical Analysis

A key strength of the FineFake dataset is its breadth, covering a diverse range of news domains. This helps ensure the resulting AI models can generalize well to detect fake news across different topics, rather than just specialized areas.

That said, the researchers acknowledge that the dataset is still limited to English-language articles. Expanding to multi-lingual coverage could further broaden the applicability of the models.

Additionally, the dataset only provides binary real/fake labels, without deeper classification of the type of misinformation present. Developing finer-grained taxonomy of fake news types could yield additional insights.

Overall, the FineFake dataset represents an important step forward in creating richer, more contextual resources for advancing fake news detection capabilities. Continued refinement and expansion of such datasets will be crucial as AI systems take on a greater role in combating online misinformation.

Conclusion

The FineFake dataset provides a new, knowledge-enriched resource to train more accurate and generalizable fake news detection models. By incorporating a diverse range of news domains and leveraging external contextual information, it aims to equip AI systems with a more holistic understanding of the patterns and characteristics of fake content.

As the challenge of misinformation continues to grow, tools like FineFake will be vital for developing robust solutions to identify and mitigate the spread of fake news across the internet. The dataset's focus on multi-domain coverage and contextual features represents an important advance in this critical area of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

FineFake: A Knowledge-Enriched Dataset for Fine-Grained Multi-Domain Fake News Detecction

Ziyi Zhou, Xiaoming Zhang, Litian Zhang, Jiacheng Liu, Xi Zhang, Chaozhuo Li

Existing benchmarks for fake news detection have significantly contributed to the advancement of models in assessing the authenticity of news content. However, these benchmarks typically focus solely on news pertaining to a single semantic topic or originating from a single platform, thereby failing to capture the diversity of multi-domain news in real scenarios. In order to understand fake news across various domains, the external knowledge and fine-grained annotations are indispensable to provide precise evidence and uncover the diverse underlying strategies for fabrication, which are also ignored by existing benchmarks. To address this gap, we introduce a novel multi-domain knowledge-enhanced benchmark with fine-grained annotations, named textbf{FineFake}. FineFake encompasses 16,909 data samples spanning six semantic topics and eight platforms. Each news item is enriched with multi-modal content, potential social context, semi-manually verified common knowledge, and fine-grained annotations that surpass conventional binary labels. Furthermore, we formulate three challenging tasks based on FineFake and propose a knowledge-enhanced domain adaptation network. Extensive experiments are conducted on FineFake under various scenarios, providing accurate and reliable benchmarks for future endeavors. The entire FineFake project is publicly accessible as an open-source repository at url{https://github.com/Accuser907/FineFake}.

4/30/2024

MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News Detection

Yupeng Li, Haorui He, Jin Bai, Dacheng Wen

The prevalence of fake news across various online sources has had a significant influence on the public. Existing Chinese fake news detection datasets are limited to news sourced solely from Weibo. However, fake news originating from multiple sources exhibits diversity in various aspects, including its content and social context. Methods trained on purely one single news source can hardly be applicable to real-world scenarios. Our pilot experiment demonstrates that the F1 score of the state-of-the-art method that learns from a large Chinese fake news detection dataset, Weibo-21, drops significantly from 0.943 to 0.470 when the test data is changed to multi-source news data, failing to identify more than one-third of the multi-source fake news. To address this limitation, we constructed the first multi-source benchmark dataset for Chinese fake news detection, termed MCFEND, which is composed of news we collected from diverse sources such as social platforms, messaging apps, and traditional online news outlets. Notably, such news has been fact-checked by 14 authoritative fact-checking agencies worldwide. In addition, various existing Chinese fake news detection methods are thoroughly evaluated on our proposed dataset in cross-source, multi-source, and unseen source ways. MCFEND, as a benchmark dataset, aims to advance Chinese fake news detection approaches in real-world scenarios.

7/25/2024

COOL: Comprehensive Knowledge Enhanced Prompt Learning for Domain Adaptive Few-shot Fake News Detection

Yi Ouyang, Peng Wu, Li Pan

Most Fake News Detection (FND) methods often struggle with data scarcity for emerging news domain. Recently, prompt learning based on Pre-trained Language Models (PLM) has emerged as a promising approach in domain adaptive few-shot learning, since it greatly reduces the need for labeled data by bridging the gap between pre-training and downstream task. Furthermore, external knowledge is also helpful in verifying emerging news, as emerging news often involves timely knowledge that may not be contained in the PLM's outdated prior knowledge. To this end, we propose COOL, a Comprehensive knOwledge enhanced prOmpt Learning method for domain adaptive few-shot FND. Specifically, we propose a comprehensive knowledge extraction module to extract both structured and unstructured knowledge that are positively or negatively correlated with news from external sources, and adopt an adversarial contrastive enhanced hybrid prompt learning strategy to model the domain-invariant news-knowledge interaction pattern for FND. Experimental results demonstrate the superiority of COOL over various state-of-the-arts.

6/18/2024

GM-DF: Generalized Multi-Scenario Deepfake Detection

Yingxin Lai, Zitong Yu, Jing Yang, Bin Li, Xiangui Kang, Linlin Shen

Existing face forgery detection usually follows the paradigm of training models in a single domain, which leads to limited generalization capacity when unseen scenarios and unknown attacks occur. In this paper, we elaborately investigate the generalization capacity of deepfake detection models when jointly trained on multiple face forgery detection datasets. We first find a rapid degradation of detection accuracy when models are directly trained on combined datasets due to the discrepancy across collection scenarios and generation methods. To address the above issue, a Generalized Multi-Scenario Deepfake Detection framework (GM-DF) is proposed to serve multiple real-world scenarios by a unified model. First, we propose a hybrid expert modeling approach for domain-specific real/forgery feature extraction. Besides, as for the commonality representation, we use CLIP to extract the common features for better aligning visual and textual features across domains. Meanwhile, we introduce a masked image reconstruction mechanism to force models to capture rich forged details. Finally, we supervise the models via a domain-aware meta-learning strategy to further enhance their generalization capacities. Specifically, we design a novel domain alignment loss to strongly align the distributions of the meta-test domains and meta-train domains. Thus, the updated models are able to represent both specific and common real/forgery features across multiple datasets. In consideration of the lack of study of multi-dataset training, we establish a new benchmark leveraging multi-source data to fairly evaluate the models' generalization capacity on unseen scenarios. Both qualitative and quantitative experiments on five datasets conducted on traditional protocols as well as the proposed benchmark demonstrate the effectiveness of our approach.

7/1/2024