UMBCLU at SemEval-2024 Task 1A and 1C: Semantic Textual Relatedness with and without machine translation

2402.12730

Published 4/15/2024 by Shubhashis Roy Dipta, Sai Vallurupalli

UMBCLU at SemEval-2024 Task 1A and 1C: Semantic Textual Relatedness with and without machine translation

Abstract

The aim of SemEval-2024 Task 1, Semantic Textual Relatedness for African and Asian Languages is to develop models for identifying semantic textual relatedness (STR) between two sentences using multiple languages (14 African and Asian languages) and settings (supervised, unsupervised, and cross-lingual). Large language models (LLMs) have shown impressive performance on several natural language understanding tasks such as multilingual machine translation (MMT), semantic similarity (STS), and encoding sentence embeddings. Using a combination of LLMs that perform well on these tasks, we developed two STR models, $textit{TranSem}$ and $textit{FineSem}$, for the supervised and cross-lingual settings. We explore the effectiveness of several training methods and the usefulness of machine translation. We find that direct fine-tuning on the task is comparable to using sentence embeddings and translating to English leads to better performance for some languages. In the supervised setting, our model performance is better than the official baseline for 3 languages with the remaining 4 performing on par. In the cross-lingual setting, our model performance is better than the baseline for 3 languages (leading to $1^{st}$ place for Africaans and $2^{nd}$ place for Indonesian), is on par for 2 languages and performs poorly on the remaining 7 languages. Our code is publicly available at https://github.com/dipta007/SemEval24-Task8.

Create account to get full access

Overview

This paper describes the system developed by UMBCLU for SemEval-2024 Task 1A and 1C, which focused on semantic textual relatedness with and without machine translation.
The researchers developed two models: one for Subtask A, which required scoring the relatedness of text pairs with no machine translation, and one for Subtask C, which involved scoring relatedness for text pairs that had been machine translated.

Plain English Explanation

The researchers at UMBCLU created two artificial intelligence (AI) models to participate in a competition called SemEval-2024. The competition had two parts, or "subtasks":

Subtask A: Semantic Textual Relatedness without Machine Translation - In this part, the models had to evaluate how related or similar the meaning of two pieces of text were, without any machine translation involved.
Subtask C: Semantic Textual Relatedness with Machine Translation - In this part, the models had to evaluate the relatedness of text pairs that had been automatically translated from one language to another using machine translation technology.

The researchers developed separate AI models to handle each of these subtasks. This allowed them to tailor the approaches and techniques used to the specific requirements of each part of the competition.

Technical Explanation

For Subtask A, the researchers used a BERT-based model with additional fine-tuning and data augmentation techniques to improve its performance on scoring the relatedness of text pairs without machine translation.

For Subtask C, the researchers developed a model that combined BERT with a machine translation component. This allowed the model to better handle the additional complexity introduced by the machine-translated text pairs.

The key insights from this work include the importance of tailoring the AI model architecture and training approaches to the specific requirements of each subtask, as well as the benefits of using data augmentation and other techniques to improve model performance.

Critical Analysis

The paper does not delve into significant limitations or caveats of the proposed approaches. It would be helpful to understand any challenges the researchers faced in developing these models, such as issues with data quality, language coverage, or model robustness.

Additionally, the paper does not provide much insight into how the UMBCLU models performed compared to other competing systems in the SemEval-2024 competition. Some other papers offer more detailed comparative analyses that could provide useful context.

Conclusion

The UMBCLU team developed two separate AI models to tackle the distinct challenges of SemEval-2024 Task 1A (semantic textual relatedness without machine translation) and Task 1C (semantic textual relatedness with machine translation). Their approach of tailoring the model architectures and training techniques to the specific requirements of each subtask appears to have been effective, though more analysis of the results and comparisons to other systems would be helpful to fully evaluate the significance of their work.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛸

SemEval Task 1: Semantic Textual Relatedness for African and Asian Languages

Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Meriem Beloucif, Christine De Kock, Oumaima Hourrane, Manish Shrivastava, Thamar Solorio, Nirmal Surange, Krishnapriya Vishnubhotla, Seid Muhie Yimam, Saif M. Mohammad

We present the first shared task on Semantic Textual Relatedness (STR). While earlier shared tasks primarily focused on semantic similarity, we instead investigate the broader phenomenon of semantic relatedness across 14 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia -- regions characterised by the relatively limited availability of NLP resources. Each instance in the datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. Participating systems were asked to rank sentence pairs by their closeness in meaning (i.e., their degree of semantic relatedness) in the 14 languages in three main tracks: (a) supervised, (b) unsupervised, and (c) crosslingual. The task attracted 163 participants. We received 70 submissions in total (across all tasks) from 51 different teams, and 38 system description papers. We report on the best-performing systems as well as the most common and the most effective approaches for the three different tracks.

4/19/2024

cs.CL

💬

MaiNLP at SemEval-2024 Task 1: Analyzing Source Language Selection in Cross-Lingual Textual Relatedness

Shijia Zhou, Huangyan Shan, Barbara Plank, Robert Litschko

This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness (STR), on Track C: Cross-lingual. The task aims to detect semantic relatedness of two sentences in a given target language without access to direct supervision (i.e. zero-shot cross-lingual transfer). To this end, we focus on different source language selection strategies on two different pre-trained languages models: XLM-R and Furina. We experiment with 1) single-source transfer and select source languages based on typological similarity, 2) augmenting English training data with the two nearest-neighbor source languages, and 3) multi-source transfer where we compare selecting on all training languages against languages from the same family. We further study machine translation-based data augmentation and the impact of script differences. Our submission achieved the first place in the C8 (Kinyarwanda) test set.

4/4/2024

cs.CL

🎯

Multilingual Evaluation of Semantic Textual Relatedness

Sharvi Endait, Srushti Sonavane, Ridhima Sinare, Pritika Rohera, Advait Naik, Dipali Kadam

The explosive growth of online content demands robust Natural Language Processing (NLP) techniques that can capture nuanced meanings and cultural context across diverse languages. Semantic Textual Relatedness (STR) goes beyond superficial word overlap, considering linguistic elements and non-linguistic factors like topic, sentiment, and perspective. Despite its pivotal role, prior NLP research has predominantly focused on English, limiting its applicability across languages. Addressing this gap, our paper dives into capturing deeper connections between sentences beyond simple word overlap. Going beyond English-centric NLP research, we explore STR in Marathi, Hindi, Spanish, and English, unlocking the potential for information retrieval, machine translation, and more. Leveraging the SemEval-2024 shared task, we explore various language models across three learning paradigms: supervised, unsupervised, and cross-lingual. Our comprehensive methodology gains promising results, demonstrating the effectiveness of our approach. This work aims to not only showcase our achievements but also inspire further research in multilingual STR, particularly for low-resourced languages.

4/16/2024

cs.CL

AAdaM at SemEval-2024 Task 1: Augmentation and Adaptation for Multilingual Semantic Textual Relatedness

Miaoran Zhang, Mingyang Wang, Jesujoba O. Alabi, Dietrich Klakow

This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages. The shared task aims at measuring the semantic textual relatedness between pairs of sentences, with a focus on a range of under-represented languages. In this work, we propose using machine translation for data augmentation to address the low-resource challenge of limited training data. Moreover, we apply task-adaptive pre-training on unlabeled task data to bridge the gap between pre-training and task adaptation. For model training, we investigate both full fine-tuning and adapter-based tuning, and adopt the adapter framework for effective zero-shot cross-lingual transfer. We achieve competitive results in the shared task: our system performs the best among all ranked teams in both subtask A (supervised learning) and subtask C (cross-lingual transfer).

6/10/2024

cs.CL