Synthetic Dataset Creation and Fine-Tuning of Transformer Models for Question Answering in Serbian

2404.08617

Published 4/15/2024 by Aleksa Cvetanovi'c, Predrag Tadi'c

📉

Abstract

In this paper, we focus on generating a synthetic question answering (QA) dataset using an adapted Translate-Align-Retrieve method. Using this method, we created the largest Serbian QA dataset of more than 87K samples, which we name SQuAD-sr. To acknowledge the script duality in Serbian, we generated both Cyrillic and Latin versions of the dataset. We investigate the dataset quality and use it to fine-tune several pre-trained QA models. Best results were obtained by fine-tuning the BERTi'c model on our Latin SQuAD-sr dataset, achieving 73.91% Exact Match and 82.97% F1 score on the benchmark XQuAD dataset, which we translated into Serbian for the purpose of evaluation. The results show that our model exceeds zero-shot baselines, but fails to go beyond human performance. We note the advantage of using a monolingual pre-trained model over multilingual, as well as the performance increase gained by using Latin over Cyrillic. By performing additional analysis, we show that questions about numeric values or dates are more likely to be answered correctly than other types of questions. Finally, we conclude that SQuAD-sr is of sufficient quality for fine-tuning a Serbian QA model, in the absence of a manually crafted and annotated dataset.

Create account to get full access

Overview

This research paper explores the creation of a synthetic dataset and fine-tuning of transformer models for question answering in the Serbian language.
The study aims to address the lack of large-scale datasets for Serbian language question answering tasks.
The researchers develop a synthetic dataset generation approach and fine-tune transformer models like BERT and RoBERTa on this data to improve question answering performance in Serbian.

Plain English Explanation

The paper focuses on improving question answering capabilities in the Serbian language. Question answering is a natural language processing task where a system tries to answer questions based on given text. However, there is a shortage of large, high-quality datasets in Serbian to train and evaluate such question answering models.

To address this, the researchers created a synthetic dataset for Serbian question answering. They used existing techniques to automatically generate realistic-looking questions and answers based on Serbian text. This allowed them to create a much larger dataset than would be feasible with manual curation.

The researchers then took popular transformer-based language models like BERT and RoBERTa, and fine-tuned them on this synthetic Serbian dataset. Fine-tuning means further training the model on the specific task and data, which can help it perform better compared to using the base model alone.

The goal was to see if these fine-tuned transformer models could achieve good performance on Serbian question answering, even though the original models were not trained on Serbian data. The key insight is that the rich, general language understanding learned by transformer models can be adapted to new languages and tasks through fine-tuning on relevant data.

Technical Explanation

The paper first reviews related work on cross-lingual named entity corpora for Slavic languages and open-domain question answering datasets for Kazakh, highlighting the lack of such resources for Serbian.

The authors then introduce their approach for synthetic dataset creation. They leverage techniques like language model-based data augmentation and context-aware named entity recognition to automatically generate realistic-looking questions and answers from Serbian text. This allows them to build a much larger dataset than would be feasible with manual annotation.

For model fine-tuning, the researchers experiment with popular transformer-based models like BERT and RoBERTa. They fine-tune these models on the synthetic Serbian dataset, building on the general language understanding learned by the base models. This cost-efficient approach allows them to adapt the models to the Serbian question answering task.

Critical Analysis

The paper acknowledges that the synthetic dataset, while useful for training, may not fully capture the nuances and complexities of real-world Serbian language use. There could be biases or artifacts introduced by the automatic generation process that limit the model's performance on authentic Serbian text.

Additionally, the paper does not provide a detailed error analysis or qualitative assessment of the model outputs. It would be helpful to understand the types of mistakes the models make and where their strengths and weaknesses lie.

Further research could explore ways to augment the synthetic dataset with a smaller amount of human-curated data, or to combine the synthetic data with transfer learning from related languages to achieve even better performance.

Conclusion

This research paper presents a novel approach to addressing the lack of large-scale datasets for Serbian language question answering. By leveraging synthetic data generation and fine-tuning of transformer models, the authors demonstrate a path to improving question answering capabilities in Serbian, a language with limited NLP resources.

The key takeaway is that transformer-based models can be effectively adapted to low-resource languages through a combination of synthetic data and targeted fine-tuning. This could have broader implications for advancing natural language processing in other under-resourced languages, and for developing practical question answering systems that can serve diverse linguistic communities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages

Gayane Ghazaryan, Erik Arakelyan, Pasquale Minervini, Isabelle Augenstein

Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose $textbf{S}$yn$textbf{DAR}$in, a method for generating and validating QA datasets for low-resource languages. We utilize parallel content mining to obtain $textit{human-curated}$ paragraphs between English and the target language. We use the English data as context to $textit{generate}$ synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English $textit{human-curated}$ paragraphs form the final QA dataset. The method allows to maintain the content quality, reduces the likelihood of factual errors, and circumvents the need for costly annotation. To test the method, we created a QA dataset with $1.2$K samples for the Armenian language. The human evaluation shows that $98%$ of the generated English data maintains quality and diversity in the question types and topics, while the translation validation pipeline can filter out $sim70%$ of data with poor quality. We use the dataset to benchmark state-of-the-art LLMs, showing their inability to achieve human accuracy with some model performances closer to random chance. This shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.

6/21/2024

cs.CL cs.AI cs.LG

↗️

UQA: Corpus for Urdu Question Answering

Samee Arif, Sualeha Farid, Awais Athar, Agha Ali Raza

This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated context paragraphs. The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T. The paper also benchmarks several state-of-the-art multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and 74.56 EM. UQA is a valuable resource for developing and testing multilingual NLP systems for Urdu and for enhancing the cross-lingual transferability of existing models. Further, the paper demonstrates the effectiveness of EATS for creating high-quality datasets for other languages and domains. The UQA dataset and the code are publicly available at www.github.com/sameearif/UQA.

5/3/2024

cs.CL cs.AI cs.IR cs.LG

💬

LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models

Zihan Zhao, Yiyang Jiang, Heyang Liu, Yanfeng Wang, Yu Wang

While Large Language Models (LLMs) have demonstrated commendable performance across a myriad of domains and tasks, existing LLMs still exhibit a palpable deficit in handling multimodal functionalities, especially for the Spoken Question Answering (SQA) task which necessitates precise alignment and deep interaction between speech and text features. To address the SQA challenge on LLMs, we initially curated the free-form and open-ended LibriSQA dataset from Librispeech, comprising Part I with natural conversational formats and Part II encompassing multiple-choice questions followed by answers and analytical segments. Both parts collectively include 107k SQA pairs that cover various topics. Given the evident paucity of existing speech-text LLMs, we propose a lightweight, end-to-end framework to execute the SQA task on the LibriSQA, witnessing significant results. By reforming ASR into the SQA format, we further substantiate our framework's capability in handling ASR tasks. Our empirical findings bolster the LLMs' aptitude for aligning and comprehending multimodal information, paving the way for the development of universal multimodal LLMs. The dataset and demo can be found at https://github.com/ZihanZhaoSJTU/LibriSQA.

4/19/2024

cs.CL

🌿

TIGQA:An Expert Annotated Question Answering Dataset in Tigrinya

Hailay Teklehaymanot, Dren Fazlija, Niloy Ganguly, Gourab K. Patro, Wolfgang Nejdl

The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources.This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert annotated educational dataset consisting of 2.68K question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pretrained models.The notable disparities between human performance and best model performance underscore the potential for further enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.

4/29/2024

cs.CL