SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages

Read original: arXiv:2402.08638 - Published 6/3/2024 by Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Abinew Ali Ayele, Pavan Baswani and 17 others

📈

Overview

Explores and quantifies semantic relatedness, a key aspect of representing language and its applications across NLP tasks
Presents a new semantic relatedness dataset, SemRel, annotated by native speakers across 13 languages from diverse language families
Focuses on languages predominantly spoken in Africa and Asia, regions with limited NLP resources
Each instance is a sentence pair with a score representing the degree of semantic textual relatedness

Plain English Explanation

Understanding the relationships between words and sentences, known as semantic relatedness, is crucial for language processing tasks like translation, summarization, and question answering. While previous research often looked at similarities within English, this study examines the broader concept of relatedness across a diverse set of 13 languages, including Afrikaans, Arabic, Amharic, Hindi, and Telugu.

The researchers created a new dataset called SemRel, where native speakers rated how related the meanings of pairs of sentences were, on a scale. This dataset covers languages from five different language families, focusing on regions like Africa and Asia that have fewer existing language resources for AI. The researchers used this dataset to benchmark how well AI models can understand semantic relatedness in these languages, which is an important step for building more inclusive and multilingual AI systems.

Technical Explanation

The paper presents a new dataset, SemRel, for evaluating semantic relatedness across 13 diverse languages. Semantic relatedness is a broader concept than semantic similarity, as it encompasses not just how alike two words or sentences are, but how conceptually connected they are.

To create SemRel, the researchers recruited native speakers to annotate sentence pairs with a score representing the degree of semantic relatedness. The dataset covers languages from five distinct language families, including Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish, and Telugu. These languages are predominantly spoken in Africa and Asia, regions with relatively limited NLP resources compared to English.

The researchers describe their data collection and annotation processes, highlighting challenges faced when building such a multilingual dataset. They also provide baseline experiments, demonstrating the utility of the SemRel datasets for evaluating semantic relatedness models in a multilingual setting.

Critical Analysis

The SemRel dataset represents an important contribution to the field of multilingual NLP by providing a benchmark for evaluating semantic relatedness models across a diverse set of languages. However, the authors acknowledge that the dataset is limited to a relatively small number of sentence pairs per language, which may impact the robustness of the evaluations.

Additionally, the paper does not provide a detailed analysis of the differences in semantic relatedness patterns across the 13 languages. Further research could explore how cultural, grammatical, and other linguistic factors influence the way speakers perceive semantic relatedness, and how this affects the performance of NLP models.

Overall, the SemRel dataset is a valuable resource for advancing multilingual NLP research, but continued efforts are needed to expand and deepen our understanding of semantic relatedness in a global context.

Conclusion

This paper presents a new semantic relatedness dataset, SemRel, annotated by native speakers across 13 languages from diverse language families. The dataset focuses on languages predominantly spoken in Africa and Asia, regions with limited NLP resources, making it a valuable contribution to the field of multilingual language processing.

The SemRel dataset can be used to benchmark the performance of NLP models in understanding semantic relatedness, which is crucial for tasks like translation, summarization, and question answering. By focusing on a diverse set of languages, this research helps advance the development of more inclusive and multilingual AI systems that can better represent and process human language in all its complexity.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📈

SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages

Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Abinew Ali Ayele, Pavan Baswani, Meriem Beloucif, Chris Biemann, Sofia Bourhim, Christine De Kock, Genet Shanko Dekebo, Oumaima Hourrane, Gopichand Kanumolu, Lokesh Madasu, Samuel Rutunda, Manish Shrivastava, Thamar Solorio, Nirmal Surange, Hailegnaw Getaneh Tilaye, Krishnapriya Vishnubhotla, Genta Winata, Seid Muhie Yimam, Saif M. Mohammad

Exploring and quantifying semantic relatedness is central to representing language and holds significant implications across various NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present textit{SemRel}, a new semantic relatedness dataset collection annotated by native speakers across 13 languages: textit{Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish,} and textit{Telugu}. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia -- regions characterised by a relatively limited availability of NLP resources. Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. The scores are obtained using a comparative annotation framework. We describe the data collection and annotation processes, challenges when building the datasets, baseline experiments, and their impact and utility in NLP.

6/3/2024

🛸

SemEval Task 1: Semantic Textual Relatedness for African and Asian Languages

Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Meriem Beloucif, Christine De Kock, Oumaima Hourrane, Manish Shrivastava, Thamar Solorio, Nirmal Surange, Krishnapriya Vishnubhotla, Seid Muhie Yimam, Saif M. Mohammad

We present the first shared task on Semantic Textual Relatedness (STR). While earlier shared tasks primarily focused on semantic similarity, we instead investigate the broader phenomenon of semantic relatedness across 14 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia -- regions characterised by the relatively limited availability of NLP resources. Each instance in the datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. Participating systems were asked to rank sentence pairs by their closeness in meaning (i.e., their degree of semantic relatedness) in the 14 languages in three main tracks: (a) supervised, (b) unsupervised, and (c) crosslingual. The task attracted 163 participants. We received 70 submissions in total (across all tasks) from 51 different teams, and 38 system description papers. We report on the best-performing systems as well as the most common and the most effective approaches for the three different tracks.

4/19/2024

NLU-STR at SemEval-2024 Task 1: Generative-based Augmentation and Encoder-based Scoring for Semantic Textual Relatedness

Sanad Malaysha, Mustafa Jarrar, Mohammed Khalilia

Semantic textual relatedness is a broader concept of semantic similarity. It measures the extent to which two chunks of text convey similar meaning or topics, or share related concepts or contexts. This notion of relatedness can be applied in various applications, such as document clustering and summarizing. SemRel-2024, a shared task in SemEval-2024, aims at reducing the gap in the semantic relatedness task by providing datasets for fourteen languages and dialects including Arabic. This paper reports on our participation in Track A (Algerian and Moroccan dialects) and Track B (Modern Standard Arabic). A BERT-based model is augmented and fine-tuned for regression scoring in supervised track (A), while BERT-based cosine similarity is employed for unsupervised track (B). Our system ranked 1st in SemRel-2024 for MSA with a Spearman correlation score of 0.49. We ranked 5th for Moroccan and 12th for Algerian with scores of 0.83 and 0.53, respectively.

5/2/2024

🎯

Multilingual Evaluation of Semantic Textual Relatedness

Sharvi Endait, Srushti Sonavane, Ridhima Sinare, Pritika Rohera, Advait Naik, Dipali Kadam

The explosive growth of online content demands robust Natural Language Processing (NLP) techniques that can capture nuanced meanings and cultural context across diverse languages. Semantic Textual Relatedness (STR) goes beyond superficial word overlap, considering linguistic elements and non-linguistic factors like topic, sentiment, and perspective. Despite its pivotal role, prior NLP research has predominantly focused on English, limiting its applicability across languages. Addressing this gap, our paper dives into capturing deeper connections between sentences beyond simple word overlap. Going beyond English-centric NLP research, we explore STR in Marathi, Hindi, Spanish, and English, unlocking the potential for information retrieval, machine translation, and more. Leveraging the SemEval-2024 shared task, we explore various language models across three learning paradigms: supervised, unsupervised, and cross-lingual. Our comprehensive methodology gains promising results, demonstrating the effectiveness of our approach. This work aims to not only showcase our achievements but also inspire further research in multilingual STR, particularly for low-resourced languages.

4/16/2024