MaiNLP at SemEval-2024 Task 1: Analyzing Source Language Selection in Cross-Lingual Textual Relatedness

2404.02570

Published 4/4/2024 by Shijia Zhou, Huangyan Shan, Barbara Plank, Robert Litschko

💬

Abstract

This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness (STR), on Track C: Cross-lingual. The task aims to detect semantic relatedness of two sentences in a given target language without access to direct supervision (i.e. zero-shot cross-lingual transfer). To this end, we focus on different source language selection strategies on two different pre-trained languages models: XLM-R and Furina. We experiment with 1) single-source transfer and select source languages based on typological similarity, 2) augmenting English training data with the two nearest-neighbor source languages, and 3) multi-source transfer where we compare selecting on all training languages against languages from the same family. We further study machine translation-based data augmentation and the impact of script differences. Our submission achieved the first place in the C8 (Kinyarwanda) test set.

Create account to get full access

Overview

The paper discusses a system developed by MaiNLP for a task on cross-lingual textual relatedness at the SemEval-2024 conference.
The key focus is on analyzing how the selection of source language can impact the performance of cross-lingual textual relatedness models.
The paper describes the experimental setup, the model architecture, and insights gained from the research.

Plain English Explanation

The paper looks at the challenge of understanding the relationship between text written in different languages. For example, how closely related are sentences in English and their translations in Japanese or Spanish? Knowing this can be useful for applications like multilingual search and machine translation.

The researchers developed a system to tackle this "cross-lingual textual relatedness" problem as part of a competition called SemEval-2024. Rather than just evaluating the overall performance of their system, they wanted to dig deeper and analyze how the choice of source language (the language the text is originally in) impacts the results.

They ran experiments using different source languages and measured how well their model could identify related text across languages. The insights they gained can help guide the design of more effective cross-lingual language understanding systems in the future.

Technical Explanation

The paper describes the setup for the SemEval-2024 Task 1 on cross-lingual textual relatedness. The task involves assessing the semantic similarity between text pairs written in different languages. The researchers developed a model that takes text in one language as input and predicts a relatedness score for a corresponding text in another language.

To investigate the impact of source language selection, the researchers trained and evaluated their model using different source languages, including English, German, Spanish, and Chinese. They analyzed metrics like Pearson correlation to understand how the source language choice affected the model's cross-lingual performance.

The core of their model architecture is a Transformer-based encoder that encodes the input text. This is combined with a multi-layer perceptron that takes the encoded representation and outputs the relatedness score. The model is trained end-to-end on annotated cross-lingual text pairs.

Critical Analysis

The paper provides a thorough exploration of how the choice of source language can influence the performance of cross-lingual textual relatedness models. However, it does not delve into potential reasons why certain source languages may work better than others. Further analysis on factors like language similarities, data quality, or model biases could offer additional insights.

Additionally, the paper focuses solely on the SemEval-2024 dataset and task. While this offers a standardized benchmark, expanding the evaluation to other cross-lingual datasets could strengthen the generalizability of the findings.

Finally, the paper does not discuss potential real-world applications or limitations of the cross-lingual relatedness task itself. Considering the practical implications and challenges of deploying such models in production environments could enhance the overall contribution of the work.

Conclusion

This paper makes an important contribution to understanding the nuances of cross-lingual textual relatedness. By systematically investigating the impact of source language selection, the researchers provide valuable insights that can guide the development of more robust and effective multilingual language understanding systems. The findings could have applications in areas like cross-lingual search, machine translation, and multilingual content recommendation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

UMBCLU at SemEval-2024 Task 1A and 1C: Semantic Textual Relatedness with and without machine translation

Shubhashis Roy Dipta, Sai Vallurupalli

The aim of SemEval-2024 Task 1, Semantic Textual Relatedness for African and Asian Languages is to develop models for identifying semantic textual relatedness (STR) between two sentences using multiple languages (14 African and Asian languages) and settings (supervised, unsupervised, and cross-lingual). Large language models (LLMs) have shown impressive performance on several natural language understanding tasks such as multilingual machine translation (MMT), semantic similarity (STS), and encoding sentence embeddings. Using a combination of LLMs that perform well on these tasks, we developed two STR models, $textit{TranSem}$ and $textit{FineSem}$, for the supervised and cross-lingual settings. We explore the effectiveness of several training methods and the usefulness of machine translation. We find that direct fine-tuning on the task is comparable to using sentence embeddings and translating to English leads to better performance for some languages. In the supervised setting, our model performance is better than the official baseline for 3 languages with the remaining 4 performing on par. In the cross-lingual setting, our model performance is better than the baseline for 3 languages (leading to $1^{st}$ place for Africaans and $2^{nd}$ place for Indonesian), is on par for 2 languages and performs poorly on the remaining 7 languages. Our code is publicly available at https://github.com/dipta007/SemEval24-Task8.

4/15/2024

cs.CL cs.AI cs.LG

🎯

Multilingual Evaluation of Semantic Textual Relatedness

Sharvi Endait, Srushti Sonavane, Ridhima Sinare, Pritika Rohera, Advait Naik, Dipali Kadam

The explosive growth of online content demands robust Natural Language Processing (NLP) techniques that can capture nuanced meanings and cultural context across diverse languages. Semantic Textual Relatedness (STR) goes beyond superficial word overlap, considering linguistic elements and non-linguistic factors like topic, sentiment, and perspective. Despite its pivotal role, prior NLP research has predominantly focused on English, limiting its applicability across languages. Addressing this gap, our paper dives into capturing deeper connections between sentences beyond simple word overlap. Going beyond English-centric NLP research, we explore STR in Marathi, Hindi, Spanish, and English, unlocking the potential for information retrieval, machine translation, and more. Leveraging the SemEval-2024 shared task, we explore various language models across three learning paradigms: supervised, unsupervised, and cross-lingual. Our comprehensive methodology gains promising results, demonstrating the effectiveness of our approach. This work aims to not only showcase our achievements but also inspire further research in multilingual STR, particularly for low-resourced languages.

4/16/2024

cs.CL

🛸

SemEval Task 1: Semantic Textual Relatedness for African and Asian Languages

Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Meriem Beloucif, Christine De Kock, Oumaima Hourrane, Manish Shrivastava, Thamar Solorio, Nirmal Surange, Krishnapriya Vishnubhotla, Seid Muhie Yimam, Saif M. Mohammad

We present the first shared task on Semantic Textual Relatedness (STR). While earlier shared tasks primarily focused on semantic similarity, we instead investigate the broader phenomenon of semantic relatedness across 14 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia -- regions characterised by the relatively limited availability of NLP resources. Each instance in the datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. Participating systems were asked to rank sentence pairs by their closeness in meaning (i.e., their degree of semantic relatedness) in the 14 languages in three main tracks: (a) supervised, (b) unsupervised, and (c) crosslingual. The task attracted 163 participants. We received 70 submissions in total (across all tasks) from 51 different teams, and 38 system description papers. We report on the best-performing systems as well as the most common and the most effective approaches for the three different tracks.

4/19/2024

cs.CL

AAdaM at SemEval-2024 Task 1: Augmentation and Adaptation for Multilingual Semantic Textual Relatedness

Miaoran Zhang, Mingyang Wang, Jesujoba O. Alabi, Dietrich Klakow

This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages. The shared task aims at measuring the semantic textual relatedness between pairs of sentences, with a focus on a range of under-represented languages. In this work, we propose using machine translation for data augmentation to address the low-resource challenge of limited training data. Moreover, we apply task-adaptive pre-training on unlabeled task data to bridge the gap between pre-training and task adaptation. For model training, we investigate both full fine-tuning and adapter-based tuning, and adopt the adapter framework for effective zero-shot cross-lingual transfer. We achieve competitive results in the shared task: our system performs the best among all ranked teams in both subtask A (supervised learning) and subtask C (cross-lingual transfer).

6/10/2024

cs.CL