MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News Detection

Read original: arXiv:2403.09092 - Published 7/25/2024 by Yupeng Li, Haorui He, Jin Bai, Dacheng Wen

MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News Detection

Overview

Introduces MCFEND, a new multi-source benchmark dataset for Chinese fake news detection
Aims to facilitate cross-source and multi-source evaluation of fake news detection models
Dataset contains over 30,000 news articles from 6 major Chinese news platforms

Plain English Explanation

The paper presents a new dataset called MCFEND (Multi-source Chinese Fake News Evaluation Dataset) that can be used to test the performance of AI models designed to detect fake news in Chinese. Fake news, or deliberately misleading or false information, has become a major problem online, so being able to reliably identify it is important.

MCFEND is unique because it includes news articles from multiple different Chinese news sources, rather than just a single source. This allows researchers to evaluate how well their fake news detection models work when applied to data from a variety of platforms, which is more realistic than testing on a single dataset.

The dataset contains over 30,000 news articles that have been carefully labeled as either real or fake. By providing this large, diverse corpus of data, the researchers aim to help advance the development of more robust and generalizable fake news detection systems in the Chinese language domain.

Technical Explanation

The MCFEND dataset was constructed by crawling news articles from 6 major Chinese news platforms: Sina, Tencent, Sohu, Ifeng, Chinanews, and the People's Daily. In total, the dataset contains 32,507 news articles, with 16,253 real news articles and 16,254 fake news articles.

To label the articles as real or fake, the researchers relied on a combination of manual review by multiple annotators as well as leveraging metadata, article content, and other signals. This allowed them to create a high-quality dataset with reliable labels.

The dataset is designed to enable cross-source and multi-source evaluation of fake news detection models. Cross-source evaluation involves training a model on data from one source and testing it on data from a different source, to see how well it generalizes. Multi-source evaluation involves training and testing a model on a combination of data from multiple sources.

Critical Analysis

The MCFEND dataset represents a significant advancement in the field of fake news detection, as it provides a more realistic and challenging benchmark for evaluating model performance. By including data from multiple sources, it helps address the common issue of models overfitting to the quirks of a single dataset.

However, the paper does acknowledge some limitations of the dataset. For example, the fake news articles may not fully represent the range of techniques and tactics used by real-world disinformation campaigns. Additionally, the dataset is focused on the Chinese language, so it may not generalize well to other languages or cultural contexts.

Further research is needed to explore the performance of different fake news detection approaches on this dataset, as well as to investigate ways to improve the dataset's coverage and realism. Ongoing efforts to create high-quality, diverse benchmarks like MCFEND are crucial for driving progress in this important area of research.

Conclusion

The MCFEND dataset represents a valuable new resource for researchers working on Chinese fake news detection. By providing a large, multi-source dataset with reliable labels, it enables more rigorous and realistic evaluation of model performance. The ability to assess how well fake news detection systems generalize across different platforms is a significant advancement that can help accelerate progress in this critical field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News Detection

Yupeng Li, Haorui He, Jin Bai, Dacheng Wen

The prevalence of fake news across various online sources has had a significant influence on the public. Existing Chinese fake news detection datasets are limited to news sourced solely from Weibo. However, fake news originating from multiple sources exhibits diversity in various aspects, including its content and social context. Methods trained on purely one single news source can hardly be applicable to real-world scenarios. Our pilot experiment demonstrates that the F1 score of the state-of-the-art method that learns from a large Chinese fake news detection dataset, Weibo-21, drops significantly from 0.943 to 0.470 when the test data is changed to multi-source news data, failing to identify more than one-third of the multi-source fake news. To address this limitation, we constructed the first multi-source benchmark dataset for Chinese fake news detection, termed MCFEND, which is composed of news we collected from diverse sources such as social platforms, messaging apps, and traditional online news outlets. Notably, such news has been fact-checked by 14 authoritative fact-checking agencies worldwide. In addition, various existing Chinese fake news detection methods are thoroughly evaluated on our proposed dataset in cross-source, multi-source, and unseen source ways. MCFEND, as a benchmark dataset, aims to advance Chinese fake news detection approaches in real-world scenarios.

7/25/2024

FineFake: A Knowledge-Enriched Dataset for Fine-Grained Multi-Domain Fake News Detecction

Ziyi Zhou, Xiaoming Zhang, Litian Zhang, Jiacheng Liu, Xi Zhang, Chaozhuo Li

Existing benchmarks for fake news detection have significantly contributed to the advancement of models in assessing the authenticity of news content. However, these benchmarks typically focus solely on news pertaining to a single semantic topic or originating from a single platform, thereby failing to capture the diversity of multi-domain news in real scenarios. In order to understand fake news across various domains, the external knowledge and fine-grained annotations are indispensable to provide precise evidence and uncover the diverse underlying strategies for fabrication, which are also ignored by existing benchmarks. To address this gap, we introduce a novel multi-domain knowledge-enhanced benchmark with fine-grained annotations, named textbf{FineFake}. FineFake encompasses 16,909 data samples spanning six semantic topics and eight platforms. Each news item is enriched with multi-modal content, potential social context, semi-manually verified common knowledge, and fine-grained annotations that surpass conventional binary labels. Furthermore, we formulate three challenging tasks based on FineFake and propose a knowledge-enhanced domain adaptation network. Extensive experiments are conducted on FineFake under various scenarios, providing accurate and reliable benchmarks for future endeavors. The entire FineFake project is publicly accessible as an open-source repository at url{https://github.com/Accuser907/FineFake}.

4/30/2024

MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs

Xuannan Liu, Zekun Li, Peipei Li, Shuhan Xia, Xing Cui, Linzhi Huang, Huaibo Huang, Weihong Deng, Zhaofeng He

Current multimodal misinformation detection (MMD) methods often assume a single source and type of forgery for each sample, which is insufficient for real-world scenarios where multiple forgery sources coexist. The lack of a benchmark for mixed-source misinformation has hindered progress in this field. To address this, we introduce MMFakeBench, the first comprehensive benchmark for mixed-source MMD. MMFakeBench includes 3 critical sources: textual veracity distortion, visual veracity distortion, and cross-modal consistency distortion, along with 12 sub-categories of misinformation forgery types. We further conduct an extensive evaluation of 6 prevalent detection methods and 15 large vision-language models (LVLMs) on MMFakeBench under a zero-shot setting. The results indicate that current methods struggle under this challenging and realistic mixed-source MMD setting. Additionally, we propose an innovative unified framework, which integrates rationales, actions, and tool-use capabilities of LVLM agents, significantly enhancing accuracy and generalization. We believe this study will catalyze future research into more realistic mixed-source multimodal misinformation and provide a fair evaluation of misinformation detection methods.

8/22/2024

Detect, Investigate, Judge and Determine: A Novel LLM-based Framework for Few-shot Fake News Detection

Ye Liu, Jiajun Zhu, Kai Zhang, Haoyu Tang, Yanghai Zhang, Xukai Liu, Qi Liu, Enhong Chen

Few-Shot Fake News Detection (FS-FND) aims to distinguish inaccurate news from real ones in extremely low-resource scenarios. This task has garnered increased attention due to the widespread dissemination and harmful impact of fake news on social media. Large Language Models (LLMs) have demonstrated competitive performance with the help of their rich prior knowledge and excellent in-context learning abilities. However, existing methods face significant limitations, such as the Understanding Ambiguity and Information Scarcity, which significantly undermine the potential of LLMs. To address these shortcomings, we propose a Dual-perspective Augmented Fake News Detection (DAFND) model, designed to enhance LLMs from both inside and outside perspectives. Specifically, DAFND first identifies the keywords of each news article through a Detection Module. Subsequently, DAFND creatively designs an Investigation Module to retrieve inside and outside valuable information concerning to the current news, followed by another Judge Module to derive its respective two prediction results. Finally, a Determination Module further integrates these two predictions and derives the final result. Extensive experiments on two publicly available datasets show the efficacy of our proposed method, particularly in low-resource settings.

7/15/2024