Persian Slang Text Conversion to Formal and Deep Learning of Persian Short Texts on Social Media for Sentiment Classification

Read original: arXiv:2403.06023 - Published 9/5/2024 by Mohsen Khazeni, Mohammad Heydari, Amir Albadvi

🤿

Overview

Lack of a tool for analyzing conversational Persian text has made various analyses, including sentiment analysis, difficult.
Researchers created "PSC" (Persian Slang Converter), a tool to convert conversational texts into formal ones.
They used deep learning methods with PSC to improve sentiment analysis of short Persian texts.
Trained models on over 10 million unlabeled conversational and formal texts.
Tested on 60,000 labeled Instagram comments, achieving 81.91% accuracy.

Plain English Explanation

The Persian language has many informal, slang-heavy styles of writing that are commonly used in social media and other conversations. This makes it challenging for machines to analyze the sentiment of these texts.

To address this, the researchers created a tool called PSC (Persian Slang Converter) that can take informal, conversational Persian text and convert it into a more formal, standard style. By using this tool along with advanced deep learning methods, they were able to improve the ability of machines to understand the sentiment (positive, negative, or neutral) expressed in short Persian language texts.

The researchers trained their models on a large dataset of over 10 million unlabeled texts from social media and other conversational sources, as well as 10 million news articles representing more formal Persian writing. They then tested the models on 60,000 labeled Instagram comments, achieving an impressive accuracy of 81.91%.

Technical Explanation

The researchers tackled the challenge of sentiment analysis for conversational Persian text by developing a tool called PSC (Persian Slang Converter) and integrating it with state-of-the-art deep learning techniques.

First, they collected a large corpus of over 10 million unlabeled conversational texts from social media and movie subtitles, as well as 10 million formal news articles. This data was used to train unsupervised models and formally implement the PSC tool, which is able to convert 57% of the words in the conversational corpus into a more standard format.

For supervised training of the sentiment classification model, the researchers used 60,000 labeled Instagram comments with positive, negative, and neutral sentiment. They then leveraged the PSC tool, a FastText model, and a deep LSTM network to achieve 81.91% accuracy on the test set.

Critical Analysis

The researchers have made a valuable contribution by addressing the challenge of sentiment analysis for conversational Persian text, which has important applications in areas like social media monitoring and customer service. By developing the PSC tool and integrating it with state-of-the-art deep learning techniques, they have demonstrated a promising approach to improving the performance of these models.

However, the paper does not provide much detail on the specific architecture or training procedure of the deep learning models used. It would be helpful to have a more thorough technical explanation of these components. Additionally, the researchers only tested their approach on a relatively small labeled dataset of 60,000 comments, so further validation on larger and more diverse datasets would be valuable.

Another potential limitation is that the PSC tool may not be able to capture all of the nuance and complexity of conversational Persian language, which can involve idiomatic expressions, context-dependent meanings, and other linguistic phenomena that are difficult to formalize. Exploring ways to better handle these challenges could be an interesting direction for future research.

Conclusion

Overall, this research represents an important step forward in addressing the challenges of sentiment analysis for conversational Persian text. By developing the PSC tool and integrating it with advanced deep learning techniques, the researchers have demonstrated a promising approach that can achieve strong performance on this task. While there are still some areas for potential improvement, this work lays the groundwork for further advancements in understanding the sentiment and meaning behind Persian language conversations.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤿

Persian Slang Text Conversion to Formal and Deep Learning of Persian Short Texts on Social Media for Sentiment Classification

Mohsen Khazeni, Mohammad Heydari, Amir Albadvi

The lack of a suitable tool for the analysis of conversational texts in the Persian language has made various analyses of these texts, including Sentiment Analysis, difficult. In this research, we tried to make the understanding of these texts easier for the machine by providing PSC, Persian Slang Converter, a tool for converting conversational texts into formal ones, and by using the most up-to-date and best deep learning methods along with the PSC, the sentiment learning of short Persian language texts for the machine in a better way. be made More than 10 million unlabeled texts from various social networks and movie subtitles (as Conversational texts) and about 10 million news texts (as formal texts) have been used for training unsupervised models and formal implementation of the tool. 60,000 texts from the comments of Instagram social network users with positive, negative, and neutral labels are considered supervised data for training the emotion classification model of short texts. Using the formal tool, 57% of the words of the corpus of conversation were converted. Finally, by using the formalizer, FastText model, and deep LSTM network, an accuracy of 81.91 was obtained on the test data.

9/5/2024

📈

FarSSiBERT: A Novel Transformer-based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts

Seyed Mojtaba Sadjadi, Zeinab Rajabi, Leila Rabiei, Mohammad-Shahram Moin

One fundamental task for NLP is to determine the similarity between two texts and evaluate the extent of their likeness. The previous methods for the Persian language have low accuracy and are unable to comprehend the structure and meaning of texts effectively. Additionally, these methods primarily focus on formal texts, but in real-world applications of text processing, there is a need for robust methods that can handle colloquial texts. This requires algorithms that consider the structure and significance of words based on context, rather than just the frequency of words. The lack of a proper dataset for this task in the Persian language makes it important to develop such algorithms and construct a dataset for Persian text. This paper introduces a new transformer-based model to measure semantic similarity between Persian informal short texts from social networks. In addition, a Persian dataset named FarSSiM has been constructed for this purpose, using real data from social networks and manually annotated and verified by a linguistic expert team. The proposed model involves training a large language model using the BERT architecture from scratch. This model, called FarSSiBERT, is pre-trained on approximately 104 million Persian informal short texts from social networks, making it one of a kind in the Persian language. Moreover, a novel specialized informal language tokenizer is provided that not only performs tokenization on formal texts well but also accurately identifies tokens that other Persian tokenizers are unable to recognize. It has been demonstrated that our proposed model outperforms ParsBERT, laBSE, and multilingual BERT in the Pearson and Spearman's coefficient criteria. Additionally, the pre-trained large language model has great potential for use in other NLP tasks on colloquial text and as a tokenizer for less-known informal words.

7/30/2024

🔄

Formality Style Transfer in Persian

Parastoo Falakaflaki, Mehrnoush Shamsfard

This study explores the formality style transfer in Persian, particularly relevant in the face of the increasing prevalence of informal language on digital platforms, which poses challenges for existing Natural Language Processing (NLP) tools. The aim is to transform informal text into formal while retaining the original meaning, addressing both lexical and syntactic differences. We introduce a novel model, Fa-BERT2BERT, based on the Fa-BERT architecture, incorporating consistency learning and gradient-based dynamic weighting. This approach improves the model's understanding of syntactic variations, balancing loss components effectively during training. Our evaluation of Fa-BERT2BERT against existing methods employs new metrics designed to accurately measure syntactic and stylistic changes. Results demonstrate our model's superior performance over traditional techniques across various metrics, including BLEU, BERT score, Rouge-l, and proposed metrics underscoring its ability to adeptly navigate the complexities of Persian language style transfer. This study significantly contributes to Persian language processing by enhancing the accuracy and functionality of NLP models and thereby supports the development of more efficient and reliable NLP applications, capable of handling language style transformation effectively, thereby streamlining content moderation, enhancing data mining results, and facilitating cross-cultural communication.

6/4/2024

💬

Investigating Persuasion Techniques in Arabic: An Empirical Study Leveraging Large Language Models

Abdurahmman Alzahrani, Eyad Babkier, Faisal Yanbaawi, Firas Yanbaawi, Hassan Alhuzali

In the current era of digital communication and widespread use of social media, it is crucial to develop an understanding of persuasive techniques employed in written text. This knowledge is essential for effectively discerning accurate information and making informed decisions. To address this need, this paper presents a comprehensive empirical study focused on identifying persuasive techniques in Arabic social media content. To achieve this objective, we utilize Pre-trained Language Models (PLMs) and leverage the ArAlEval dataset, which encompasses two tasks: binary classification to determine the presence or absence of persuasion techniques, and multi-label classification to identify the specific types of techniques employed in the text. Our study explores three different learning approaches by harnessing the power of PLMs: feature extraction, fine-tuning, and prompt engineering techniques. Through extensive experimentation, we find that the fine-tuning approach yields the highest results on the aforementioned dataset, achieving an f1-micro score of 0.865 and an f1-weighted score of 0.861. Furthermore, our analysis sheds light on an interesting finding. While the performance of the GPT model is relatively lower compared to the other approaches, we have observed that by employing few-shot learning techniques, we can enhance its results by up to 20%. This offers promising directions for future research and exploration in this topicfootnote{Upon Acceptance, the source code will be released on GitHub.}.

5/22/2024