IITK at SemEval-2024 Task 1: Contrastive Learning and Autoencoders for Semantic Textual Relatedness in Multilingual Texts

2404.04513

Published 4/9/2024 by Udvas Basak, Rajarshi Dutta, Shivam Pandey, Ashutosh Modi

IITK at SemEval-2024 Task 1: Contrastive Learning and Autoencoders for Semantic Textual Relatedness in Multilingual Texts

Abstract

This paper describes our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness. The challenge is focused on automatically detecting the degree of relatedness between pairs of sentences for 14 languages including both high and low-resource Asian and African languages. Our team participated in two subtasks consisting of Track A: supervised and Track B: unsupervised. This paper focuses on a BERT-based contrastive learning and similarity metric based approach primarily for the supervised track while exploring autoencoders for the unsupervised track. It also aims on the creation of a bigram relatedness corpus using negative sampling strategy, thereby producing refined word embeddings.

Create account to get full access

Overview

This paper presents a system developed by researchers from the Indian Institute of Technology Kanpur (IITK) for the SemEval-2024 Task 1: Semantic Textual Relatedness in Multilingual Texts.
The system uses contrastive learning and autoencoders to tackle the task of measuring the semantic relatedness between pairs of texts in multiple languages.
The paper describes the dataset, the model architecture, and the experimental results, providing insights into the effectiveness of the proposed approach.

Plain English Explanation

The researchers at IITK have developed a system that can determine how related two pieces of text are, even if they're written in different languages. This is an important task for various applications, like machine translation, text summarization, and information retrieval.

The key idea is to use "contrastive learning" and "autoencoders" to capture the semantic meaning of the text, regardless of the language. Contrastive learning helps the system learn the similarities and differences between related and unrelated text pairs, while autoencoders allow it to represent the text in a more compact and meaningful way.

By combining these techniques, the IITK system can accurately measure how closely two pieces of text are related in terms of their meaning, even if they're written in different languages. This could be useful for tasks like linking related articles across languages or finding similar documents in a multilingual database.

Technical Explanation

The researchers used a dataset provided for the SemEval-2024 Task 1, which includes text pairs in 7 different languages and their corresponding relatedness scores. They split the dataset into training, validation, and test sets to evaluate their system.

The core of their approach is a neural network model that takes a pair of texts as input and outputs a score representing their semantic relatedness. The model consists of two main components:

Contrastive Learning Module: This module learns to encode the input texts into vectors that capture their semantic meaning. It does this by training on pairs of related and unrelated text, learning to push the vectors of related pairs closer together and unrelated pairs farther apart.
Autoencoder Module: This module takes the encoded text vectors and tries to reconstruct the original input texts. This helps the model learn a more compact and meaningful representation of the text.

The final relatedness score is computed by passing the encoded text vectors through a feed-forward neural network. The researchers experimented with different model architectures and training strategies to optimize the performance on the task.

Critical Analysis

The paper provides a well-designed and thorough study of the proposed system, including extensive experiments and comparisons to baselines. The authors acknowledge several limitations, such as the need for further analysis of the model's performance on specific language pairs and the potential for overfitting due to the relatively small size of the dataset.

One area that could be further explored is the interpretability of the model's predictions. While the system achieves strong performance, it would be valuable to understand the specific factors that contribute to the relatedness score, which could provide insights into the underlying semantic relationships between texts.

Additionally, the researchers could investigate the transferability of the learned representations to other related tasks, such as multilingual text classification or cross-lingual information retrieval. This could further demonstrate the generalizability and utility of the proposed approach.

Conclusion

The IITK system presents a promising approach for measuring the semantic relatedness of text pairs in a multilingual setting. By leveraging contrastive learning and autoencoders, the model is able to capture the underlying meaning of the text, regardless of the language. The strong experimental results suggest that this technique could be valuable for a wide range of applications that require understanding the semantic connections between multilingual texts.

The researchers have made a meaningful contribution to the field of multilingual natural language processing, and their work could inspire further advancements in this important area of study.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

NLU-STR at SemEval-2024 Task 1: Generative-based Augmentation and Encoder-based Scoring for Semantic Textual Relatedness

Sanad Malaysha, Mustafa Jarrar, Mohammed Khalilia

Semantic textual relatedness is a broader concept of semantic similarity. It measures the extent to which two chunks of text convey similar meaning or topics, or share related concepts or contexts. This notion of relatedness can be applied in various applications, such as document clustering and summarizing. SemRel-2024, a shared task in SemEval-2024, aims at reducing the gap in the semantic relatedness task by providing datasets for fourteen languages and dialects including Arabic. This paper reports on our participation in Track A (Algerian and Moroccan dialects) and Track B (Modern Standard Arabic). A BERT-based model is augmented and fine-tuned for regression scoring in supervised track (A), while BERT-based cosine similarity is employed for unsupervised track (B). Our system ranked 1st in SemRel-2024 for MSA with a Spearman correlation score of 0.49. We ranked 5th for Moroccan and 12th for Algerian with scores of 0.83 and 0.53, respectively.

5/2/2024

cs.CL

AAdaM at SemEval-2024 Task 1: Augmentation and Adaptation for Multilingual Semantic Textual Relatedness

Miaoran Zhang, Mingyang Wang, Jesujoba O. Alabi, Dietrich Klakow

This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages. The shared task aims at measuring the semantic textual relatedness between pairs of sentences, with a focus on a range of under-represented languages. In this work, we propose using machine translation for data augmentation to address the low-resource challenge of limited training data. Moreover, we apply task-adaptive pre-training on unlabeled task data to bridge the gap between pre-training and task adaptation. For model training, we investigate both full fine-tuning and adapter-based tuning, and adopt the adapter framework for effective zero-shot cross-lingual transfer. We achieve competitive results in the shared task: our system performs the best among all ranked teams in both subtask A (supervised learning) and subtask C (cross-lingual transfer).

6/10/2024

cs.CL

UMBCLU at SemEval-2024 Task 1A and 1C: Semantic Textual Relatedness with and without machine translation

Shubhashis Roy Dipta, Sai Vallurupalli

The aim of SemEval-2024 Task 1, Semantic Textual Relatedness for African and Asian Languages is to develop models for identifying semantic textual relatedness (STR) between two sentences using multiple languages (14 African and Asian languages) and settings (supervised, unsupervised, and cross-lingual). Large language models (LLMs) have shown impressive performance on several natural language understanding tasks such as multilingual machine translation (MMT), semantic similarity (STS), and encoding sentence embeddings. Using a combination of LLMs that perform well on these tasks, we developed two STR models, $textit{TranSem}$ and $textit{FineSem}$, for the supervised and cross-lingual settings. We explore the effectiveness of several training methods and the usefulness of machine translation. We find that direct fine-tuning on the task is comparable to using sentence embeddings and translating to English leads to better performance for some languages. In the supervised setting, our model performance is better than the official baseline for 3 languages with the remaining 4 performing on par. In the cross-lingual setting, our model performance is better than the baseline for 3 languages (leading to $1^{st}$ place for Africaans and $2^{nd}$ place for Indonesian), is on par for 2 languages and performs poorly on the remaining 7 languages. Our code is publicly available at https://github.com/dipta007/SemEval24-Task8.

4/15/2024

cs.CL cs.AI cs.LG

🛸

SemEval Task 1: Semantic Textual Relatedness for African and Asian Languages

Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Meriem Beloucif, Christine De Kock, Oumaima Hourrane, Manish Shrivastava, Thamar Solorio, Nirmal Surange, Krishnapriya Vishnubhotla, Seid Muhie Yimam, Saif M. Mohammad

We present the first shared task on Semantic Textual Relatedness (STR). While earlier shared tasks primarily focused on semantic similarity, we instead investigate the broader phenomenon of semantic relatedness across 14 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia -- regions characterised by the relatively limited availability of NLP resources. Each instance in the datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. Participating systems were asked to rank sentence pairs by their closeness in meaning (i.e., their degree of semantic relatedness) in the 14 languages in three main tracks: (a) supervised, (b) unsupervised, and (c) crosslingual. The task attracted 163 participants. We received 70 submissions in total (across all tasks) from 51 different teams, and 38 system description papers. We report on the best-performing systems as well as the most common and the most effective approaches for the three different tracks.

4/19/2024

cs.CL