A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and Other Sources about the 2024 Outbreak of Measles

Read original: arXiv:2406.07693 - Published 7/19/2024 by Nirmalya Thakur, Vanessa Su, Mingchen Shao, Kesha A. Patel, Hongseok Jeong, Victoria Knieling, Andrew Bian

⚙️

Overview

This paper presents a dataset of over 4,000 online videos about the 2024 measles outbreak, with detailed metadata and sentiment analysis.
The dataset covers videos from popular platforms like YouTube and TikTok, as well as various news websites.
Sentiment, subjectivity, and fine-grained emotion analysis were performed on the video titles and descriptions.
The dataset is intended to support machine learning research in areas like sentiment analysis and multimodal sentiment analysis.

Plain English Explanation

This research paper presents a new dataset that contains information about over 4,000 online videos related to the measles outbreak that occurred in 2024. The videos were published on 264 different websites, including popular platforms like YouTube and TikTok, as well as various news organizations.

For each video, the dataset includes the URL, title, description, and publication date. The researchers then analyzed the sentiment, subjectivity, and emotions expressed in the video titles and descriptions using different natural language processing (NLP) techniques. This includes classifying the content as positive, negative, or neutral in sentiment, as well as determining how opinionated or factual the language is.

The goal of this dataset is to provide a valuable resource for researchers working on sentiment analysis and emotion detection in the context of online discussions about health-related events like disease outbreaks. By having access to this curated dataset, researchers can train and test new machine learning models to better understand how people are reacting to and discussing these important topics on social media and the web.

Technical Explanation

The researchers collected data on 4,011 videos about the 2024 measles outbreak that were published on 264 different websites between January 1 and May 31, 2024. The majority of these videos (48.6%) were from YouTube, followed by TikTok (15.2%), with the remaining videos coming from Instagram, Facebook, and various news organizations.

For each video, the dataset includes the following attributes:

URL of the video
Title of the video post
Description of the video post
Date the video was published

The researchers then performed several types of sentiment and emotion analysis on the video titles and descriptions:

Sentiment analysis using VADER to classify each text as positive, negative, or neutral
Subjectivity analysis using TextBlob to determine how opinionated or factual the language is
Fine-grained sentiment analysis using DistilRoBERTa-base to categorize the emotion expressed as fear, surprise, joy, sadness, anger, disgust, or neutral

The results of these analyses are also included as additional attributes in the dataset, providing a rich source of information for training and evaluating machine learning models for multimodal sentiment analysis on health-related content.

Critical Analysis

The dataset and analyses presented in this paper offer a valuable resource for researchers working on sentiment and emotion detection in online text. By focusing on the 2024 measles outbreak, the researchers have created a dataset that is directly relevant to an important public health issue.

However, it's important to note that the dataset is limited to a specific time period (January to May 2024) and may not be representative of broader trends in how people discuss measles or other health topics online. Additionally, the sentiment and emotion analyses rely on existing NLP models, which may have biases or limitations that could affect the reliability of the results.

Further research could explore expanding the dataset to cover a longer time period or a wider range of health-related topics, as well as investigating the use of multi-label datasets for analyzing complex emotions in text. Researchers may also want to consider conducting their own manual annotations of the video content to validate the automated analyses.

Conclusion

This paper presents a valuable dataset of online videos related to the 2024 measles outbreak, along with detailed sentiment, subjectivity, and emotion analysis of the video titles and descriptions. The dataset is intended to support machine learning research in areas like sentiment analysis and multimodal sentiment analysis, with the ultimate goal of better understanding how people discuss important health-related events on the web and social media.

While the dataset has some limitations, it represents an important contribution to the field and could lead to the development of more robust and accurate models for analyzing online discussions around public health issues. Researchers are encouraged to explore this dataset and build upon the insights presented in the paper to further advance the state of the art in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

⚙️

A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and Other Sources about the 2024 Outbreak of Measles

Nirmalya Thakur, Vanessa Su, Mingchen Shao, Kesha A. Patel, Hongseok Jeong, Victoria Knieling, Andrew Bian

The work of this paper presents a dataset that contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. The dataset is available at https://dx.doi.org/10.21227/40s8-xf63. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. Finally, this paper also presents a list of open research questions that may be investigated using this dataset.

7/19/2024

Constructing the CORD-19 Vaccine Dataset

Manisha Singh, Divy Sharma, Alonso Ma, Bridget Tyree, Margaret Mitchell

We introduce new dataset 'CORD-19-Vaccination' to cater to scientists specifically looking into COVID-19 vaccine-related research. This dataset is extracted from CORD-19 dataset [Wang et al., 2020] and augmented with new columns for language detail, author demography, keywords, and topic per paper. Facebook's fastText model is used to identify languages [Joulin et al., 2016]. To establish author demography (author affiliation, lab/institution location, and lab/institution country columns) we processed the JSON file for each paper and then further enhanced using Google's search API to determine country values. 'Yake' was used to extract keywords from the title, abstract, and body of each paper and the LDA (Latent Dirichlet Allocation) algorithm was used to add topic information [Campos et al., 2020, 2018a,b]. To evaluate the dataset, we demonstrate a question-answering task like the one used in the CORD-19 Kaggle challenge [Goldbloom et al., 2022]. For further evaluation, sequential sentence classification was performed on each paper's abstract using the model from Dernoncourt et al. [2016]. We partially hand annotated the training dataset and used a pre-trained BERT-PubMed layer. 'CORD- 19-Vaccination' contains 30k research papers and can be immensely valuable for NLP research such as text mining, information extraction, and question answering, specific to the domain of COVID-19 vaccine research.

7/29/2024

🔄

Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment, Hate Speech, and Anxiety Analysis

Nirmalya Thakur

The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. No prior work related to social media mining has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper aims to address this research gap and makes two scientific contributions to this field. First, it presents a multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. The dataset, available at https://dx.doi.org/10.21227/7fvc-y093, contains Instagram posts about mpox in 52 languages. For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were performed. This process included classifying each post into (i) one of the sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutral, (ii) hate or not hate, and (iii) anxiety/stress detected or no anxiety/stress detected. These results are presented as separate attributes in the dataset. Second, this paper presents the results of performing sentiment analysis, hate speech analysis, and anxiety or stress analysis. The variation of the sentiment classes - fear, surprise, joy, sadness, anger, disgust, and neutral were observed to be 27.95%, 2.57%, 8.69%, 5.94%, 2.69%, 1.53%, and 50.64%, respectively. In terms of hate speech detection, 95.75% of the posts did not contain hate and the remaining 4.25% of the posts contained hate. Finally, 72.05% of the posts did not indicate any anxiety/stress, and the remaining 27.95% of the posts represented some form of anxiety/stress.

9/20/2024

🛠️

Word frequency and sentiment analysis of twitter messages during Coronavirus pandemic

Nikhil Kumar Rajput, Bhavya Ahuja Grover, Vipin Kumar Rathi, Riya Bansal

The COVID-19 epidemic has had a great impact on social media conversation, especially on sites like Twitter, which has emerged as a hub for public reaction and information sharing. This paper deals by analyzing a vast dataset of Twitter messages related to this disease, starting from January 2020. Two approaches were used: a statistical analysis of word frequencies and a sentiment analysis to gauge user attitudes. Word frequencies are modeled using unigrams, bigrams, and trigrams, with power law distribution as the fitting model. The validity of the model is confirmed through metrics like Sum of Squared Errors (SSE), R-squared ($R^2$), and Root Mean Squared Error (RMSE). High $R^2$ and low SSE/RMSE values indicate a good fit for the model. Sentiment analysis is conducted to understand the general emotional tone of Twitter users messages. The results reveal that a majority of tweets exhibit neutral sentiment polarity, with only 2.57% expressing negative polarity.

6/4/2024