XNLIeu: a dataset for cross-lingual NLI in Basque

2404.06996

Published 4/11/2024 by Maite Heredia, Julen Etxaniz, Muitze Zulaika, Xabier Saralegi, Jeremy Barnes, Aitor Soroa

XNLIeu: a dataset for cross-lingual NLI in Basque

Abstract

XNLI is a popular Natural Language Inference (NLI) benchmark widely used to evaluate cross-lingual Natural Language Understanding (NLU) capabilities across languages. In this paper, we expand XNLI to include Basque, a low-resource language that can greatly benefit from transfer-learning approaches. The new dataset, dubbed XNLIeu, has been developed by first machine-translating the English XNLI corpus into Basque, followed by a manual post-edition step. We have conducted a series of experiments using mono- and multilingual LLMs to assess a) the effect of professional post-edition on the MT system; b) the best cross-lingual strategy for NLI in Basque; and c) whether the choice of the best cross-lingual strategy is influenced by the fact that the dataset is built by translation. The results show that post-edition is necessary and that the translate-train cross-lingual strategy obtains better results overall, although the gain is lower when tested in a dataset that has been built natively from scratch. Our code and datasets are publicly available under open licenses.

Create account to get full access

Overview

This paper introduces a new dataset called XNLIeu for cross-lingual natural language inference (NLI) in Basque.
Cross-lingual NLI is the task of determining if a given hypothesis sentence can be inferred from a premise sentence in a different language.
The XNLIeu dataset provides parallel Basque and English sentence pairs annotated for entailment, contradiction, and neutral relationships.
This dataset aims to enable the development and evaluation of cross-lingual NLI models for the Basque language.

Plain English Explanation

The paper presents a new dataset called XNLIeu that can be used to train and test natural language processing models for a specific language task in Basque. The task is called "cross-lingual natural language inference" (cross-lingual NLI).

Cross-lingual NLI is about figuring out if a sentence in one language (like English) can be logically inferred from a sentence in another language (like Basque). For example, if the Basque sentence is "The cat is sleeping on the bed," can we infer that the English sentence "An animal is on the furniture" is true?

The XNLIeu dataset provides many pairs of Basque and English sentences, and for each pair, the dataset tells you whether the English sentence is entailed by (can be inferred from) the Basque sentence, contradicts the Basque sentence, or is neutral (not related either way).

This dataset can be used to train and test natural language processing models that can do cross-lingual NLI for Basque. Having this dataset available is important because it allows researchers to develop and evaluate these types of models for the Basque language, which is a less-resourced language compared to others like English.

Technical Explanation

The paper introduces a new dataset called XNLIeu for the task of cross-lingual natural language inference (NLI) in the Basque language. Cross-lingual NLI involves determining whether a hypothesis sentence in one language can be logically inferred from a premise sentence in another language.

The XNLIeu dataset consists of 15,000 Basque-English sentence pairs, with each pair annotated as representing entailment, contradiction, or neutral relationships. The dataset was created by translating and annotating a subset of the English XNLI dataset into Basque.

The authors evaluate several state-of-the-art cross-lingual NLI models on the XNLIeu dataset, including multilingual BERT and XLM-RoBERTa. The results show that these models achieve relatively high performance on the dataset, but there is still room for improvement, especially for more challenging examples.

The authors also provide an analysis of the dataset, including the distribution of label types, sentence lengths, and lexical overlap between the Basque and English sentences. This analysis sheds light on the characteristics and potential challenges of the cross-lingual NLI task for the Basque language.

Critical Analysis

The XNLIeu dataset is a valuable contribution to the field of cross-lingual natural language processing for low-resource languages like Basque. By providing a high-quality, annotated dataset for the cross-lingual NLI task, the authors enable the development and evaluation of more advanced models that can handle Basque language understanding.

One potential limitation of the dataset is the relatively small size of 15,000 sentence pairs. While this is a reasonable starting point, it may not be sufficient to fully capture the linguistic complexity and diversity of the Basque language. The authors acknowledge this and encourage the further expansion and enrichment of the dataset in the future.

Additionally, the paper focuses solely on evaluating existing cross-lingual NLI models on the XNLIeu dataset. While this provides a useful benchmark, it would be interesting to see the authors explore novel model architectures or training approaches that are specifically tailored to the Basque language and the cross-lingual NLI task.

Overall, the XNLIeu dataset and the associated analysis presented in this paper represent an important step forward in advancing the state of cross-lingual natural language processing for under-resourced languages. The dataset can serve as a valuable resource for researchers and practitioners working to develop more robust and inclusive natural language understanding systems.

Conclusion

This paper introduces a new dataset called XNLIeu for the task of cross-lingual natural language inference (NLI) in the Basque language. The dataset provides parallel Basque and English sentence pairs annotated for entailment, contradiction, and neutral relationships, enabling the development and evaluation of cross-lingual NLI models for Basque.

The availability of this dataset is a significant contribution to the field of cross-lingual natural language processing, as it helps address the lack of resources for under-resourced languages like Basque. The authors' evaluation of state-of-the-art models on the XNLIeu dataset provides a useful benchmark and highlights areas for further research and improvement.

Overall, the XNLIeu dataset and the insights presented in this paper have the potential to drive advancements in cross-lingual natural language understanding, ultimately contributing to more inclusive and accessible language technologies for a diverse range of languages.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Meta4XNLI: A Crosslingual Parallel Corpus for Metaphor Detection and Interpretation

Elisa Sanchez-Bayona, Rodrigo Agerri

Metaphors, although occasionally unperceived, are ubiquitous in our everyday language. Thus, it is crucial for Language Models to be able to grasp the underlying meaning of this kind of figurative language. In this work, we present Meta4XNLI, a novel parallel dataset for the tasks of metaphor detection and interpretation that contains metaphor annotations in both Spanish and English. We investigate language models' metaphor identification and understanding abilities through a series of monolingual and cross-lingual experiments by leveraging our proposed corpus. In order to comprehend how these non-literal expressions affect models' performance, we look over the results and perform an error analysis. Additionally, parallel data offers many potential opportunities to investigate metaphor transferability between these languages and the impact of translation on the development of multilingual annotated resources.

4/11/2024

cs.CL cs.AI cs.LG

Event Extraction in Basque: Typologically motivated Cross-Lingual Transfer-Learning Analysis

Mikel Zubillaga, Oscar Sainz, Ainara Estarrona, Oier Lopez de Lacalle, Eneko Agirre

Cross-lingual transfer-learning is widely used in Event Extraction for low-resource languages and involves a Multilingual Language Model that is trained in a source language and applied to the target language. This paper studies whether the typological similarity between source and target languages impacts the performance of cross-lingual transfer, an under-explored topic. We first focus on Basque as the target language, which is an ideal target language because it is typologically different from surrounding languages. Our experiments on three Event Extraction tasks show that the shared linguistic characteristic between source and target languages does have an impact on transfer quality. Further analysis of 72 language pairs reveals that for tasks that involve token classification such as entity and event trigger identification, common writing script and morphological features produce higher quality cross-lingual transfer. In contrast, for tasks involving structural prediction like argument extraction, common word order is the most relevant feature. In addition, we show that when increasing the training size, not all the languages scale in the same way in the cross-lingual setting. To perform the experiments we introduce EusIE, an event extraction dataset for Basque, which follows the Multilingual Event Extraction dataset (MEE). The dataset and code are publicly available.

4/10/2024

cs.CL cs.AI

👁️

EuSQuAD: Automatically Translated and Aligned SQuAD2.0 for Basque

Aitor Garc'ia-Pablos, Naiara Perez, Montse Cuadros, Jaione Bengoetxea

The widespread availability of Question Answering (QA) datasets in English has greatly facilitated the advancement of the Natural Language Processing (NLP) field. However, the scarcity of such resources for minority languages, such as Basque, poses a substantial challenge for these communities. In this context, the translation and alignment of existing QA datasets plays a crucial role in narrowing this technological gap. This work presents EuSQuAD, the first initiative dedicated to automatically translating and aligning SQuAD2.0 into Basque, resulting in more than 142k QA examples. We demonstrate EuSQuAD's value through extensive qualitative analysis and QA experiments supported with EuSQuAD as training data. These experiments are evaluated with a new human-annotated dataset.

6/5/2024

cs.CL

💬

Latxa: An Open Language Model and Evaluation Suite for Basque

Julen Etxaniz, Oscar Sainz, Naiara Perez, Itziar Aldabe, German Rigau, Eneko Agirre, Aitor Ormazabal, Mikel Artetxe, Aitor Soroa

We introduce Latxa, a family of large language models for Basque ranging from 7 to 70 billion parameters. Latxa is based on Llama 2, which we continue pretraining on a new Basque corpus comprising 4.3M documents and 4.2B tokens. Addressing the scarcity of high-quality benchmarks for Basque, we further introduce 4 multiple choice evaluation datasets: EusProficiency, comprising 5,169 questions from official language proficiency exams; EusReading, comprising 352 reading comprehension questions; EusTrivia, comprising 1,715 trivia questions from 5 knowledge areas; and EusExams, comprising 16,774 questions from public examinations. In our extensive evaluation, Latxa outperforms all previous open models we compare to by a large margin. In addition, it is competitive with GPT-4 Turbo in language proficiency and understanding, despite lagging behind in reading comprehension and knowledge-intensive tasks. Both the Latxa family of models, as well as our new pretraining corpora and evaluation datasets, are publicly available under open licenses at https://github.com/hitz-zentroa/latxa. Our suite enables reproducible research on methods to build LLMs for low-resource languages.

4/1/2024

cs.CL cs.AI cs.LG