Sentiment Analysis Dataset in Moroccan Dialect: Bridging the Gap Between Arabic and Latin Scripted dialect

Read original: arXiv:2303.15987 - Published 9/16/2024 by Mouad Jbel, Mourad Jabrane, Imad Hafidi, Abdulmutallib Metrane

🗣️

Overview

Sentiment analysis is the process of determining emotions or opinions expressed in text.
This research focuses on sentiment analysis of the Moroccan dialect, which features a unique linguistic landscape with multiple scripts.
Previous work in sentiment analysis has primarily targeted dialects using Arabic script, but the Moroccan web content blends Arabic and Latin script.
The study emphasizes the importance of extending sentiment analysis to encompass the full spectrum of Moroccan linguistic diversity.

Plain English Explanation

The paper explores sentiment analysis, which is the process of automatically detecting emotions or opinions in written text. This is an important task in the field of natural language processing.

One aspect of sentiment analysis that has not received much attention is the Moroccan dialect. The Moroccan language has a unique mix of Arabic and Latin scripts, which makes it different from other dialects that have primarily used Arabic script. Previous sentiment analysis research has focused on those Arabic script dialects, but may not capture the full complexity of Moroccan web content.

To address this gap, the researchers created the largest publicly available dataset for sentiment analysis of the Moroccan dialect. This dataset includes text written in both Arabic and Latin scripts. Having a diverse dataset is important for training machine learning models to accurately analyze Moroccan sentiment.

The researchers then tested different machine learning models on this dataset to see how well they could perform sentiment analysis on Moroccan text. They found that their best model achieved 92% accuracy, and it also performed well on smaller existing Moroccan dialect datasets.

Technical Explanation

The core of this research is the creation of the largest publicly available dataset for sentiment analysis of the Moroccan dialect. This dataset incorporates Moroccan dialect text written in both Arabic and Latin scripts, addressing a gap in previous work that focused on dialects using Arabic script alone.

To construct the dataset, the researchers assembled a diverse range of Moroccan text data, including 20,000 manually labeled samples. They also made publicly available lists of stop words in the Moroccan dialect, which are common words that can be removed to improve text processing.

The researchers then conducted a comparative study, testing multiple machine learning models on the dataset to assess their performance on Moroccan sentiment analysis. They experimented with both raw and preprocessed data to demonstrate the importance of the preprocessing step.

Their best-performing model achieved an impressive 92% accuracy. To further validate the model's reliability, they also tested it on smaller publicly available Moroccan dialect datasets, and the results were favorable.

Critical Analysis

The paper makes a valuable contribution by addressing the underrepresented area of sentiment analysis for the Moroccan dialect, which features a unique linguistic landscape.

One potential limitation is that the dataset, while the largest publicly available, may still not be comprehensive enough to capture the full diversity of the Moroccan dialect. The researchers acknowledge this and suggest that further expansion of the dataset could be an area for future research.

Additionally, the paper focuses on machine learning models for sentiment analysis, but does not explore other approaches, such as rule-based or hybrid methods, which could also be valuable for this task.

Overall, the research provides a solid foundation for sentiment analysis of the Moroccan dialect and highlights the importance of considering linguistic diversity in natural language processing tasks.

Conclusion

This study emphasizes the significance of extending sentiment analysis to encompass the full spectrum of Moroccan linguistic diversity, which includes both Arabic and Latin scripts. By creating the largest publicly available dataset for Moroccan dialect sentiment analysis, the researchers have made an important contribution to the field.

The high accuracy achieved by the best-performing machine learning model, along with its favorable performance on smaller datasets, suggests that the research has successfully bridged the gap in sentiment analysis for the Moroccan dialect. This work lays the groundwork for further advancements in understanding and processing Moroccan web content, with potential applications in areas such as social media monitoring, customer feedback analysis, and more.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🗣️

Sentiment Analysis Dataset in Moroccan Dialect: Bridging the Gap Between Arabic and Latin Scripted dialect

Mouad Jbel, Mourad Jabrane, Imad Hafidi, Abdulmutallib Metrane

Sentiment analysis, the automated process of determining emotions or opinions expressed in text, has seen extensive exploration in the field of natural language processing. However, one aspect that has remained underrepresented is the sentiment analysis of the Moroccan dialect, which boasts a unique linguistic landscape and the coexistence of multiple scripts. Previous works in sentiment analysis primarily targeted dialects employing Arabic script. While these efforts provided valuable insights, they may not fully capture the complexity of Moroccan web content, which features a blend of Arabic and Latin script. As a result, our study emphasizes the importance of extending sentiment analysis to encompass the entire spectrum of Moroccan linguistic diversity. Central to our research is the creation of the largest public dataset for Moroccan dialect sentiment analysis that incorporates not only Moroccan dialect written in Arabic script but also in Latin letters. By assembling a diverse range of textual data, we were able to construct a dataset with a range of 20 000 manually labeled text in Moroccan dialect and also publicly available lists of stop words in Moroccan dialect. To dive into sentiment analysis, we conducted a comparative study on multiple Machine learning models to assess their compatibility with our dataset. Experiments were performed using both raw and preprocessed data to show the importance of the preprocessing step. We were able to achieve 92% accuracy in our model and to further prove its liability we tested our model on smaller publicly available datasets of Moroccan dialect and the results were favorable.

9/16/2024

101 Billion Arabic Words Dataset

Manel Aloui, Hasna Chouikhi, Ghaith Chaabane, Haithem Kchaou, Chehir Dhaouadi

In recent years, Large Language Models have revolutionized the field of natural language processing, showcasing an impressive rise predominantly in English-centric domains. These advancements have set a global benchmark, inspiring significant efforts toward developing Arabic LLMs capable of understanding and generating the Arabic language with remarkable accuracy. Despite these advancements, a critical challenge persists: the potential bias in Arabic LLMs, primarily attributed to their reliance on datasets comprising English data that has been translated into Arabic. This reliance not only compromises the authenticity of the generated content but also reflects a broader issue -the scarcity of original quality Arabic linguistic data. This study aims to address the data scarcity in the Arab world and to encourage the development of Arabic Language Models that are true to both the linguistic and nuances of the region. We undertook a large-scale data mining project, extracting a substantial volume of text from the Common Crawl WET files, specifically targeting Arabic content. The extracted data underwent a rigorous cleaning and deduplication process, using innovative techniques to ensure the integrity and uniqueness of the dataset. The result is the 101 Billion Arabic Words Dataset, the largest Arabic dataset available to date, which can significantly contribute to the development of authentic Arabic LLMs. This study not only highlights the potential for creating linguistically and culturally accurate Arabic LLMs but also sets a precedent for future research in enhancing the authenticity of Arabic language models.

5/6/2024

💬

Personality Analysis for Social Media Users using Arabic language and its Effect on Sentiment Analysis

Mokhaiber Dandash, Masoud Asadpour

Social media is heading towards more and more personalization, where individuals reveal their beliefs, interests, habits, and activities, simply offering glimpses into their personality traits. This study, explores the correlation between the use of Arabic language on twitter, personality traits and its impact on sentiment analysis. We indicated the personality traits of users based on the information extracted from their profile activities, and the content of their tweets. Our analysis incorporated linguistic features, profile statistics (including gender, age, bio, etc.), as well as additional features like emoticons. To obtain personality data, we crawled the timelines and profiles of users who took the 16personalities test in Arabic on 16personalities.com. Our dataset, AraPers, comprised 3,250 users who shared their personality results on twitter. We implemented various machine learning techniques, to reveal personality traits and developed a dedicated model for this purpose, achieving a 74.86% accuracy rate with BERT, analysis of this dataset proved that linguistic features, profile features and derived model can be used to differentiate between different personality traits. Furthermore, our findings demonstrated that personality affect sentiment in social media. This research contributes to the ongoing efforts in developing robust understanding of the relation between human behaviour on social media and personality features for real-world applications, such as political discourse analysis, and public opinion tracking.

7/24/2024

M2SA: Multimodal and Multilingual Model for Sentiment Analysis of Tweets

Gaurish Thakkar, Sherzod Hakimov, Marko Tadi'c

In recent years, multimodal natural language processing, aimed at learning from diverse data types, has garnered significant attention. However, there needs to be more clarity when it comes to analysing multimodal tasks in multi-lingual contexts. While prior studies on sentiment analysis of tweets have predominantly focused on the English language, this paper addresses this gap by transforming an existing textual Twitter sentiment dataset into a multimodal format through a straightforward curation process. Our work opens up new avenues for sentiment-related research within the research community. Additionally, we conduct baseline experiments utilising this augmented dataset and report the findings. Notably, our evaluations reveal that when comparing unimodal and multimodal configurations, using a sentiment-tuned large language model as a text encoder performs exceptionally well.

6/13/2024