FarSSiBERT: A Novel Transformer-based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts

Read original: arXiv:2407.19173 - Published 7/30/2024 by Seyed Mojtaba Sadjadi, Zeinab Rajabi, Leila Rabiei, Mohammad-Shahram Moin

📈

Overview

Determining the similarity between two texts is a fundamental task in natural language processing (NLP).
Previous methods for the Persian language have had low accuracy and struggled to understand the structure and meaning of texts effectively.
These methods have focused on formal texts, but there is a need for robust methods that can handle colloquial texts.
This requires algorithms that consider the structure and significance of words based on context, rather than just the frequency of words.
The lack of a proper dataset for this task in the Persian language makes it important to develop such algorithms and construct a dataset.

Plain English Explanation

One of the core tasks in natural language processing (NLP) is figuring out how similar two pieces of text are and how alike they are. Previous methods for doing this in the Persian language haven't been very accurate and have had trouble really understanding the structure and meaning of the texts.

These old methods have mostly focused on formal, proper texts, but in the real world, we need ways to handle more casual, conversational language that people use on social media and in everyday communications. To do this, we need algorithms that look at not just how often certain words appear, but also how the words are used in context and what they really mean.

Since there hasn't been a good dataset available for this task in Persian, it's important to develop new methods and also build up a dataset that can be used to train and test them.

Technical Explanation

This paper introduces a new transformer-based model to measure the semantic similarity between Persian informal short texts from social networks. The researchers also constructed a new Persian dataset called FarSSiM, using real data from social networks and manually annotated by linguistic experts.

The proposed model involves training a large language model from scratch using the BERT architecture. This model, called FarSSiBERT, is pre-trained on around 104 million Persian informal short texts from social media, making it unique for the Persian language. The researchers also developed a specialized informal language tokenizer that can accurately identify tokens that other Persian tokenizers miss.

The paper shows that the FarSSiBERT model outperforms other models like ParsBERT, laBSE, and multilingual BERT when it comes to measuring semantic similarity using Pearson and Spearman's correlation coefficients. The pre-trained language model also has great potential for use in other NLP tasks on casual, colloquial Persian text, as well as for tokenizing informal words that other tools struggle with.

Critical Analysis

The researchers acknowledge that their dataset, while a valuable contribution, may have some limitations in terms of the diversity of the social media sources and text types included. There could be opportunities to expand the dataset further in the future.

Additionally, while the FarSSiBERT model has shown promising results, the researchers don't provide a detailed error analysis or discussion of failure cases. It would be helpful to understand the specific challenges the model still faces in order to identify areas for improvement.

Overall, this research represents an important step forward in developing robust NLP capabilities for the Persian language, particularly for handling informal, conversational text. The dataset and model can serve as a valuable foundation for future work in this area.

Conclusion

This paper presents a new transformer-based model and dataset for measuring semantic similarity in Persian informal short texts, which is a critical task for natural language processing. The model, called FarSSiBERT, is pre-trained on a large corpus of Persian social media data and outperforms existing models.

The researchers have made a valuable contribution by addressing the need for techniques that can handle colloquial language, rather than just formal text. The pre-trained model and associated tokenizer also have potential applications beyond just semantic similarity, such as other NLP tasks on casual Persian text. Overall, this work represents an important step forward for Persian NLP.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📈

FarSSiBERT: A Novel Transformer-based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts

Seyed Mojtaba Sadjadi, Zeinab Rajabi, Leila Rabiei, Mohammad-Shahram Moin

One fundamental task for NLP is to determine the similarity between two texts and evaluate the extent of their likeness. The previous methods for the Persian language have low accuracy and are unable to comprehend the structure and meaning of texts effectively. Additionally, these methods primarily focus on formal texts, but in real-world applications of text processing, there is a need for robust methods that can handle colloquial texts. This requires algorithms that consider the structure and significance of words based on context, rather than just the frequency of words. The lack of a proper dataset for this task in the Persian language makes it important to develop such algorithms and construct a dataset for Persian text. This paper introduces a new transformer-based model to measure semantic similarity between Persian informal short texts from social networks. In addition, a Persian dataset named FarSSiM has been constructed for this purpose, using real data from social networks and manually annotated and verified by a linguistic expert team. The proposed model involves training a large language model using the BERT architecture from scratch. This model, called FarSSiBERT, is pre-trained on approximately 104 million Persian informal short texts from social networks, making it one of a kind in the Persian language. Moreover, a novel specialized informal language tokenizer is provided that not only performs tokenization on formal texts well but also accurately identifies tokens that other Persian tokenizers are unable to recognize. It has been demonstrated that our proposed model outperforms ParsBERT, laBSE, and multilingual BERT in the Pearson and Spearman's coefficient criteria. Additionally, the pre-trained large language model has great potential for use in other NLP tasks on colloquial text and as a tokenizer for less-known informal words.

7/30/2024

🔄

Formality Style Transfer in Persian

Parastoo Falakaflaki, Mehrnoush Shamsfard

This study explores the formality style transfer in Persian, particularly relevant in the face of the increasing prevalence of informal language on digital platforms, which poses challenges for existing Natural Language Processing (NLP) tools. The aim is to transform informal text into formal while retaining the original meaning, addressing both lexical and syntactic differences. We introduce a novel model, Fa-BERT2BERT, based on the Fa-BERT architecture, incorporating consistency learning and gradient-based dynamic weighting. This approach improves the model's understanding of syntactic variations, balancing loss components effectively during training. Our evaluation of Fa-BERT2BERT against existing methods employs new metrics designed to accurately measure syntactic and stylistic changes. Results demonstrate our model's superior performance over traditional techniques across various metrics, including BLEU, BERT score, Rouge-l, and proposed metrics underscoring its ability to adeptly navigate the complexities of Persian language style transfer. This study significantly contributes to Persian language processing by enhancing the accuracy and functionality of NLP models and thereby supports the development of more efficient and reliable NLP applications, capable of handling language style transformation effectively, thereby streamlining content moderation, enhancing data mining results, and facilitating cross-cultural communication.

6/4/2024

🔗

PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity

Mohammad Abdous, Poorya Piroozfar, Behrouz Minaei Bidgoli

One of the components of natural language processing that has received a lot of investigation recently is semantic textual similarity. In computational linguistics and natural language processing, assessing the semantic similarity of words, phrases, paragraphs, and texts is crucial. Calculating the degree of semantic resemblance between two textual pieces, paragraphs, or phrases provided in both monolingual and cross-lingual versions is known as semantic similarity. Cross lingual semantic similarity requires corpora in which there are sentence pairs in both the source and target languages with a degree of semantic similarity between them. Many existing cross lingual semantic similarity models use a machine translation due to the unavailability of cross lingual semantic similarity dataset, which the propagation of the machine translation error reduces the accuracy of the model. On the other hand, when we want to use semantic similarity features for machine translation the same machine translations should not be used for semantic similarity. For Persian, which is one of the low resource languages, no effort has been made in this regard and the need for a model that can understand the context of two languages is felt more than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time by using linguistic experts. We named this dataset PESTS (Persian English Semantic Textual Similarity). This corpus contains 5375 sentence pairs. Also, different models based on transformers have been fine-tuned using this dataset. The results show that using the PESTS dataset, the Pearson correlation of the XLM ROBERTa model increases from 85.87% to 95.62%.

9/6/2024

🤿

Persian Slang Text Conversion to Formal and Deep Learning of Persian Short Texts on Social Media for Sentiment Classification

Mohsen Khazeni, Mohammad Heydari, Amir Albadvi

The lack of a suitable tool for the analysis of conversational texts in the Persian language has made various analyses of these texts, including Sentiment Analysis, difficult. In this research, we tried to make the understanding of these texts easier for the machine by providing PSC, Persian Slang Converter, a tool for converting conversational texts into formal ones, and by using the most up-to-date and best deep learning methods along with the PSC, the sentiment learning of short Persian language texts for the machine in a better way. be made More than 10 million unlabeled texts from various social networks and movie subtitles (as Conversational texts) and about 10 million news texts (as formal texts) have been used for training unsupervised models and formal implementation of the tool. 60,000 texts from the comments of Instagram social network users with positive, negative, and neutral labels are considered supervised data for training the emotion classification model of short texts. Using the formal tool, 57% of the words of the corpus of conversation were converted. Finally, by using the formalizer, FastText model, and deep LSTM network, an accuracy of 81.91 was obtained on the test data.

9/5/2024