AMMeBa: A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild

Read original: arXiv:2405.11697 - Published 5/22/2024 by Nicholas Dufour, Arkanath Pathak, Pouya Samangouei, Nikki Hariri, Shashi Deshetti, Andrew Dudfield, Christopher Guess, Pablo Hern'andez Escayola, Bobby Tran, Mevan Babakar and 1 other

AMMeBa: A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild

Summary of Findings

This paper presents a large-scale survey and dataset called AMMeBa, which explores media-based misinformation in the wild. The key findings include:

AMMeBa contains over 1 million pieces of media content (images, videos, and text) from various online platforms, annotated for misinformation.
The dataset covers a wide range of misinformation topics, from politics and health to entertainment and science.
The researchers used a combination of human annotation and machine learning to identify and categorize the misinformation in the dataset.
AMMeBa provides a valuable resource for researchers and practitioners working on misinformation detection, understanding, and mitigation.

Introduction

The paper describes the growing challenge of media-based misinformation, which can spread rapidly online and have significant societal impacts. To better understand and address this issue, the researchers created the AMMeBa dataset, a large-scale collection of media content annotated for misinformation. AMMeBa aims to provide a comprehensive resource for researchers and developers working on misinformation-related problems.

Dataset and Methodology

The researchers collected over 1 million pieces of media content (images, videos, and text) from various online platforms, including social media, news sites, and forums. They then used a combination of human annotation and machine learning techniques to identify and categorize the misinformation in the dataset.

The human annotation process involved trained experts reviewing the content and assigning labels for factors such as misinformation type, topic, and severity. The machine learning component leveraged state-of-the-art models for tasks like image and text analysis to automatically detect and classify the misinformation.

The resulting AMMeBa dataset includes a wealth of information, including the original media content, metadata, and detailed annotations. This comprehensive dataset can be used to develop and evaluate a wide range of misinformation-related technologies and research.

Applications and Potential Impact

The AMMeBa dataset can be used to advance research and development in several key areas, including:

Misinformation detection and classification: Researchers can use the dataset to train and test machine learning models for automatically identifying and categorizing different types of misinformation.
Misinformation correction and debunking: The dataset can inform the development of systems that can effectively counter the spread of misinformation, such as large language model-powered agents.
Multimodal misinformation analysis: The inclusion of diverse media types in AMMeBa allows for research on understanding and detecting misinformation across different modalities.

By providing a comprehensive and well-annotated dataset, the AMMeBa project aims to catalyze advancements in the field of misinformation research and lead to more effective strategies for addressing this growing challenge.

Limitations and Future Work

The paper acknowledges several limitations of the AMMeBa dataset and areas for future research:

The dataset may not be fully representative of all media-based misinformation, as it relies on content that was identified and captured by the researchers.
The annotation process, while rigorous, may still include some degree of human bias or error.
Continual updates and expansion of the dataset will be necessary to keep pace with the rapidly evolving landscape of online misinformation.

Future research directions could include exploring more advanced techniques for automated misinformation detection, investigating the dynamics of misinformation spread, and developing novel interventions for countering the harmful effects of media-based misinformation.

Conclusion

The AMMeBa dataset represents a significant contribution to the field of misinformation research, providing a large-scale and well-annotated resource for exploring media-based misinformation in the wild. By enabling a wide range of applications and catalyzing further advancements, the AMMeBa project has the potential to help address the growing challenge of online misinformation and its societal impacts.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

AMMeBa: A Large-Scale Survey and Dataset of Media-Based Misinformation In-The-Wild

Nicholas Dufour, Arkanath Pathak, Pouya Samangouei, Nikki Hariri, Shashi Deshetti, Andrew Dudfield, Christopher Guess, Pablo Hern'andez Escayola, Bobby Tran, Mevan Babakar, Christoph Bregler

The prevalence and harms of online misinformation is a perennial concern for internet platforms, institutions and society at large. Over time, information shared online has become more media-heavy and misinformation has readily adapted to these new modalities. The rise of generative AI-based tools, which provide widely-accessible methods for synthesizing realistic audio, images, video and human-like text, have amplified these concerns. Despite intense public interest and significant press coverage, quantitative information on the prevalence and modality of media-based misinformation remains scarce. Here, we present the results of a two-year study using human raters to annotate online media-based misinformation, mostly focusing on images, based on claims assessed in a large sample of publicly-accessible fact checks with the ClaimReview markup. We present an image typology, designed to capture aspects of the image and manipulation relevant to the image's role in the misinformation claim. We visualize the distribution of these types over time. We show the rise of generative AI-based content in misinformation claims, and that its commonality is a relatively recent phenomenon, occurring significantly after heavy press coverage. We also show simple methods dominated historically, particularly context manipulations, and continued to hold a majority as of the end of data collection in November 2023. The dataset, Annotated Misinformation, Media-Based (AMMeBa), is publicly-available, and we hope that these data will serve as both a means of evaluating mitigation methods in a realistic setting and as a first-of-its-kind census of the types and modalities of online misinformation.

5/22/2024

ArMeme: Propagandistic Content in Arabic Memes

Firoj Alam, Abul Hasnat, Fatema Ahmed, Md Arid Hasan, Maram Hasanain

With the rise of digital communication, memes have become a significant medium for cultural and political expression that is often used to mislead audiences. Identification of such misleading and persuasive multimodal content has become more important among various stakeholders, including social media platforms, policymakers, and the broader society as they often cause harm to individuals, organizations, and/or society. While there has been effort to develop AI-based automatic systems for resource-rich languages (e.g., English), it is relatively little to none for medium to low resource languages. In this study, we focused on developing an Arabic memes dataset with manual annotations of propagandistic content. We annotated ~6K Arabic memes collected from various social media platforms, which is a first resource for Arabic multimodal research. We provide a comprehensive analysis aiming to develop computational tools for their detection. We will make them publicly available for the community.

6/7/2024

How Do Social Bots Participate in Misinformation Spread? A Comprehensive Dataset and Analysis

Herun Wan, Minnan Luo, Zihan Ma, Guang Dai, Xiang Zhao

Information spreads faster through social media platforms than traditional media, thus becoming an ideal medium to spread misinformation. Meanwhile, automated accounts, known as social bots, contribute more to the misinformation dissemination. In this paper, we explore the interplay between social bots and misinformation on the Sina Weibo platform. We propose a comprehensive and large-scale misinformation dataset, containing 11,393 misinformation and 16,416 unbiased real information with multiple modality information, with 952,955 related users. We propose a scalable weak-surprised method to annotate social bots, obtaining 68,040 social bots and 411,635 genuine accounts. To the best of our knowledge, this dataset is the largest dataset containing misinformation and social bots. We conduct comprehensive experiments and analysis on this dataset. Results show that social bots play a central role in misinformation dissemination, participating in news discussions to amplify echo chambers, manipulate public sentiment, and reverse public stances.

8/20/2024

🌿

AMIR: Automated MisInformation Rebuttal -- A COVID-19 Vaccination Datasets based Recommendation System

Shakshi Sharma, Anwitaman Datta, Rajesh Sharma

Misinformation has emerged as a major societal threat in recent years in general; specifically in the context of the COVID-19 pandemic, it has wrecked havoc, for instance, by fuelling vaccine hesitancy. Cost-effective, scalable solutions for combating misinformation are the need of the hour. This work explored how existing information obtained from social media and augmented with more curated fact checked data repositories can be harnessed to facilitate automated rebuttal of misinformation at scale. While the ideas herein can be generalized and reapplied in the broader context of misinformation mitigation using a multitude of information sources and catering to the spectrum of social media platforms, this work serves as a proof of concept, and as such, it is confined in its scope to only rebuttal of tweets, and in the specific context of misinformation regarding COVID-19. It leverages two publicly available datasets, viz. FaCov (fact-checked articles) and misleading (social media Twitter) data on COVID-19 Vaccination.

7/29/2024