PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity

Read original: arXiv:2305.07893 - Published 9/6/2024 by Mohammad Abdous, Poorya Piroozfar, Behrouz Minaei Bidgoli

🔗

Overview

Semantic textual similarity is an important task in natural language processing.
Assessing the semantic similarity between words, phrases, and texts is crucial.
Cross-lingual semantic similarity requires parallel corpora with semantically similar sentence pairs.
Existing models often rely on machine translation, which can introduce errors.
There is a need for new approaches, especially for low-resource languages like Persian.

Plain English Explanation

Semantic textual similarity is the task of determining how similar the meaning of two pieces of text is, whether they are single words, short phrases, or entire paragraphs. This is an important capability for natural language processing systems, as it allows them to understand the conceptual relationships between different textual elements.

One particularly challenging aspect of semantic similarity is cross-lingual similarity, where the texts being compared are in different languages. To do this effectively, you need a dataset of sentence pairs in both languages that have been evaluated for their semantic similarity. However, such datasets are often scarce, especially for lower-resource languages like Persian.

Many existing cross-lingual models rely on machine translation to bridge the language gap, but this can introduce errors that reduce the accuracy of the similarity assessment. The authors of this paper sought to address this issue by creating a new dataset specifically for Persian-English semantic similarity, called PESTS (Persian English Semantic Textual Similarity).

Technical Explanation

The researchers created a corpus of 5,375 sentence pairs in Persian and English, with expert linguists evaluating the semantic similarity of each pair. This provides a high-quality dataset for training and evaluating cross-lingual semantic similarity models.

The researchers then fine-tuned several transformer-based language models using the PESTS dataset. They found that the XLM-RoBERTa model, when fine-tuned on PESTS, achieved a Pearson correlation of 95.62% on the task of assessing semantic similarity, a significant improvement over the 85.87% correlation without the fine-tuning.

Critical Analysis

The creation of the PESTS dataset is a valuable contribution to the field of cross-lingual natural language processing. By providing a high-quality resource specifically for Persian-English semantic similarity, the researchers have addressed a gap in the available tools and resources for working with low-resource languages.

However, the paper does not discuss the potential limitations of the dataset or the fine-tuned models. It would be helpful to know more about the diversity of the sentence pairs in the corpus, the backgrounds and expertise of the linguists who annotated the data, and any biases or skew in the data that could affect the model's performance.

Additionally, the researchers only evaluated the performance of the fine-tuned models on the PESTS dataset itself. It would be informative to see how the models perform on other cross-lingual semantic similarity tasks or datasets to better understand their generalization capabilities.

Conclusion

This paper presents an important step forward in the field of cross-lingual natural language processing by introducing the PESTS dataset and demonstrating the effectiveness of fine-tuning transformer-based models on this resource. The significant improvement in the Pearson correlation for the XLM-RoBERTa model highlights the value of high-quality, language-specific datasets for training accurate semantic similarity systems.

The PESTS dataset and the fine-tuned models have the potential to enable more robust and reliable cross-lingual applications, such as cross-lingual paraphrase identification and semantic role labeling, which are crucial for tasks like machine translation, information retrieval, and text understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔗

PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity

Mohammad Abdous, Poorya Piroozfar, Behrouz Minaei Bidgoli

One of the components of natural language processing that has received a lot of investigation recently is semantic textual similarity. In computational linguistics and natural language processing, assessing the semantic similarity of words, phrases, paragraphs, and texts is crucial. Calculating the degree of semantic resemblance between two textual pieces, paragraphs, or phrases provided in both monolingual and cross-lingual versions is known as semantic similarity. Cross lingual semantic similarity requires corpora in which there are sentence pairs in both the source and target languages with a degree of semantic similarity between them. Many existing cross lingual semantic similarity models use a machine translation due to the unavailability of cross lingual semantic similarity dataset, which the propagation of the machine translation error reduces the accuracy of the model. On the other hand, when we want to use semantic similarity features for machine translation the same machine translations should not be used for semantic similarity. For Persian, which is one of the low resource languages, no effort has been made in this regard and the need for a model that can understand the context of two languages is felt more than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time by using linguistic experts. We named this dataset PESTS (Persian English Semantic Textual Similarity). This corpus contains 5375 sentence pairs. Also, different models based on transformers have been fine-tuned using this dataset. The results show that using the PESTS dataset, the Pearson correlation of the XLM ROBERTa model increases from 85.87% to 95.62%.

9/6/2024

📈

FarSSiBERT: A Novel Transformer-based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts

Seyed Mojtaba Sadjadi, Zeinab Rajabi, Leila Rabiei, Mohammad-Shahram Moin

One fundamental task for NLP is to determine the similarity between two texts and evaluate the extent of their likeness. The previous methods for the Persian language have low accuracy and are unable to comprehend the structure and meaning of texts effectively. Additionally, these methods primarily focus on formal texts, but in real-world applications of text processing, there is a need for robust methods that can handle colloquial texts. This requires algorithms that consider the structure and significance of words based on context, rather than just the frequency of words. The lack of a proper dataset for this task in the Persian language makes it important to develop such algorithms and construct a dataset for Persian text. This paper introduces a new transformer-based model to measure semantic similarity between Persian informal short texts from social networks. In addition, a Persian dataset named FarSSiM has been constructed for this purpose, using real data from social networks and manually annotated and verified by a linguistic expert team. The proposed model involves training a large language model using the BERT architecture from scratch. This model, called FarSSiBERT, is pre-trained on approximately 104 million Persian informal short texts from social networks, making it one of a kind in the Persian language. Moreover, a novel specialized informal language tokenizer is provided that not only performs tokenization on formal texts well but also accurately identifies tokens that other Persian tokenizers are unable to recognize. It has been demonstrated that our proposed model outperforms ParsBERT, laBSE, and multilingual BERT in the Pearson and Spearman's coefficient criteria. Additionally, the pre-trained large language model has great potential for use in other NLP tasks on colloquial text and as a tokenizer for less-known informal words.

7/30/2024

✨

Linear Cross-Lingual Mapping of Sentence Embeddings

Oleg Vasilyev, Fumika Isono, John Bohannon

Semantics of a sentence is defined with much less ambiguity than semantics of a single word, and we assume that it should be better preserved by translation to another language. If multilingual sentence embeddings intend to represent sentence semantics, then the similarity between embeddings of any two sentences must be invariant with respect to translation. Based on this suggestion, we consider a simple linear cross-lingual mapping as a possible improvement of the multilingual embeddings. We also consider deviation from orthogonality conditions as a measure of deficiency of the embeddings.

6/28/2024

📈

A New Method for Cross-Lingual-based Semantic Role Labeling

Mohammad Ebrahimi, Behrouz Minaei Bidgoli, Nasim Khozouei

Semantic role labeling is a crucial task in natural language processing, enabling better comprehension of natural language. However, the lack of annotated data in multiple languages has posed a challenge for researchers. To address this, a deep learning algorithm based on model transfer has been proposed. The algorithm utilizes a dataset consisting of the English portion of CoNLL2009 and a corpus of semantic roles in Persian. To optimize the efficiency of training, only ten percent of the educational data from each language is used. The results of the proposed model demonstrate significant improvements compared to Niksirt et al.'s model. In monolingual mode, the proposed model achieved a 2.05 percent improvement on F1-score, while in cross-lingual mode, the improvement was even more substantial, reaching 6.23 percent. Worth noting is that the compared model only trained two of the four stages of semantic role labeling and employed golden data for the remaining two stages. This suggests that the actual superiority of the proposed model surpasses the reported numbers by a significant margin. The development of cross-lingual methods for semantic role labeling holds promise, particularly in addressing the scarcity of annotated data for various languages. These advancements pave the way for further research in understanding and processing natural language across different linguistic contexts.

8/29/2024