A Multi-Label Dataset of French Fake News: Human and Machine Insights

Read original: arXiv:2403.16099 - Published 4/12/2024 by Benjamin Icard, Franc{c}ois Maine, Morgane Casanova, G'eraud Faye, Julien Chanson, Guillaume Gadek, Ghislain Atemezing, Franc{c}ois Bancilhon, Paul 'Egr'e

A Multi-Label Dataset of French Fake News: Human and Machine Insights

Overview

• This paper presents a multi-label dataset of French fake news articles and analyzes their topics, genres, and linguistic characteristics using both human and machine insights. • The dataset, called FrenchFD, contains over 10,000 articles labeled with relevant topics and genres, providing a valuable resource for research on misinformation and disinformation in the French-speaking world. • The paper explores the dataset's composition, including the most prevalent topics and genres, and evaluates the performance of machine learning models in classifying the articles.

Plain English Explanation

• The researchers have created a large dataset of French fake news articles, each labeled with the topics and genres they cover. This helps researchers better understand the types of misinformation and disinformation circulating in the French-speaking world. • The dataset, called FrenchFD, contains over 10,000 articles that have been carefully analyzed by both humans and machines. The analysis looks at the most common topics and types of fake news articles, as well as how well AI models can classify them. • This dataset is a valuable resource for researchers studying the spread of false information online, as it provides a comprehensive and well-labeled corpus of French fake news articles. The insights from this analysis can help develop better tools and strategies for identifying and combating the spread of misinformation.

Technical Explanation

• The researchers created the FrenchFD dataset, which contains 10,230 French fake news articles collected from various online sources. • Each article was manually labeled by human annotators with relevant topics (e.g., politics, health, economy) and genres (e.g., news, opinion, satire). • The researchers analyzed the dataset's composition, finding that the most common topics were politics, economy, and health, while the most common genres were news and opinion pieces. • They then evaluated the performance of machine learning models, including logistic regression, support vector machines, and deep learning, in classifying the articles by topic and genre. The models achieved reasonably high accuracy, suggesting they could be useful for automated detection of French fake news.

Critical Analysis

• The researchers acknowledge that the dataset may not be fully representative of all French fake news, as it was primarily collected from a few online sources. • The manual labeling process, while thorough, could still be subject to human bias or inconsistencies, which could affect the reliability of the dataset. • The performance of the machine learning models, while promising, could be further improved with additional training data or more advanced techniques. • The researchers did not explore the linguistic or stylistic features of the fake news articles, which could provide additional insights into how they differ from genuine news.

Conclusion

• This study presents a valuable multi-label dataset of French fake news articles, along with an analysis of their topics, genres, and machine learning-based classification. • The FrenchFD dataset and the insights derived from it can contribute to a better understanding of the misinformation landscape in the French-speaking world, and help develop more effective strategies for identifying and combating the spread of false information online. • Future research could explore the linguistic and stylistic characteristics of the fake news articles, as well as investigate the potential impacts of misinformation on French-speaking populations.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

A Multi-Label Dataset of French Fake News: Human and Machine Insights

Benjamin Icard, Franc{c}ois Maine, Morgane Casanova, G'eraud Faye, Julien Chanson, Guillaume Gadek, Ghislain Atemezing, Franc{c}ois Bancilhon, Paul 'Egr'e

We present a corpus of 100 documents, OBSINFOX, selected from 17 sources of French press considered unreliable by expert agencies, annotated using 11 labels by 8 annotators. By collecting more labels than usual, by more annotators than is typically done, we can identify features that humans consider as characteristic of fake news, and compare them to the predictions of automated classifiers. We present a topic and genre analysis using Gate Cloud, indicative of the prevalence of satire-like text in the corpus. We then use the subjectivity analyzer VAGO, and a neural version of it, to clarify the link between ascriptions of the label Subjective and ascriptions of the label Fake News. The annotated dataset is available online at the following url: https://github.com/obs-info/obsinfox Keywords: Fake News, Multi-Labels, Subjectivity, Vagueness, Detail, Opinion, Exaggeration, French Press

4/12/2024

⚙️

Predicting Sentence-Level Factuality of News and Bias of Media Outlets

Francielle Vargas, Kokil Jaidka, Thiago A. S. Pardo, Fabr'icio Benevenuto

Automated news credibility and fact-checking at scale require accurately predicting news factuality and media bias. This paper introduces a large sentence-level dataset, titled FactNews, composed of 6,191 sentences expertly annotated according to factuality and media bias definitions proposed by AllSides. We use FactNews to assess the overall reliability of news sources, by formulating two text classification problems for predicting sentence-level factuality of news reporting and bias of media outlets. Our experiments demonstrate that biased sentences present a higher number of words compared to factual sentences, besides having a predominance of emotions. Hence, the fine-grained analysis of subjectivity and impartiality of news articles provided promising results for predicting the reliability of media outlets. Finally, due to the severity of fake news and political polarization in Brazil, and the lack of research for Portuguese, both dataset and baseline were proposed for Brazilian Portuguese.

9/16/2024

FineFake: A Knowledge-Enriched Dataset for Fine-Grained Multi-Domain Fake News Detecction

Ziyi Zhou, Xiaoming Zhang, Litian Zhang, Jiacheng Liu, Xi Zhang, Chaozhuo Li

Existing benchmarks for fake news detection have significantly contributed to the advancement of models in assessing the authenticity of news content. However, these benchmarks typically focus solely on news pertaining to a single semantic topic or originating from a single platform, thereby failing to capture the diversity of multi-domain news in real scenarios. In order to understand fake news across various domains, the external knowledge and fine-grained annotations are indispensable to provide precise evidence and uncover the diverse underlying strategies for fabrication, which are also ignored by existing benchmarks. To address this gap, we introduce a novel multi-domain knowledge-enhanced benchmark with fine-grained annotations, named textbf{FineFake}. FineFake encompasses 16,909 data samples spanning six semantic topics and eight platforms. Each news item is enriched with multi-modal content, potential social context, semi-manually verified common knowledge, and fine-grained annotations that surpass conventional binary labels. Furthermore, we formulate three challenging tasks based on FineFake and propose a knowledge-enhanced domain adaptation network. Extensive experiments are conducted on FineFake under various scenarios, providing accurate and reliable benchmarks for future endeavors. The entire FineFake project is publicly accessible as an open-source repository at url{https://github.com/Accuser907/FineFake}.

4/30/2024

Exposing and Explaining Fake News On-the-Fly

Francisco de Arriba-P'erez, Silvia Garc'ia-M'endez, F'atima Leal, Benedita Malheiro, Juan Carlos Burguillo

Social media platforms enable the rapid dissemination and consumption of information. However, users instantly consume such content regardless of the reliability of the shared data. Consequently, the latter crowdsourcing model is exposed to manipulation. This work contributes with an explainable and online classification method to recognize fake news in real-time. The proposed method combines both unsupervised and supervised Machine Learning approaches with online created lexica. The profiling is built using creator-, content- and context-based features using Natural Language Processing techniques. The explainable classification mechanism displays in a dashboard the features selected for classification and the prediction confidence. The performance of the proposed solution has been validated with real data sets from Twitter and the results attain 80 % accuracy and macro F-measure. This proposal is the first to jointly provide data stream processing, profiling, classification and explainability. Ultimately, the proposed early detection, isolation and explanation of fake news contribute to increase the quality and trustworthiness of social media contents.

9/6/2024