INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages

Read original: arXiv:2407.13522 - Published 7/19/2024 by Abhishek Kumar Singh, Rudra Murthy, Vishwajeet kumar, Jaydeep Sen, Ganesh Ramakrishnan

INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages

Overview

This paper introduces the INDIC QA BENCHMARK, a new multilingual benchmark for evaluating the question answering (QA) capabilities of large language models (LLMs) on Indic languages.
The benchmark covers 12 Indic languages, including Hindi, Bengali, Tamil, and others, and includes over 15,000 question-answer pairs.
The goal is to provide a standardized evaluation framework to assess how well LLMs can understand and answer questions in diverse Indic languages, which are underrepresented in current AI research.

Plain English Explanation

The INDIC QA BENCHMARK is a new dataset that researchers can use to test how well AI language models can answer questions in a variety of Indic languages, such as Hindi, Bengali, and Tamil. Indic languages are commonly spoken across South Asia, but they are often overlooked in AI research, which tends to focus more on languages like English.

The benchmark includes over 15,000 questions and answers covering a wide range of topics. This gives researchers a standardized way to evaluate how capable different AI language models are at understanding and responding to questions in these Indic languages. This is important because as AI continues to advance, we want to make sure the technology works well for diverse languages and cultures, not just the most common ones.

By creating this benchmark, the researchers hope to spur more work on developing AI systems that can effectively communicate in Indic languages. This could have big implications, making AI-powered tools and services more accessible to the hundreds of millions of people who speak these languages.

Technical Explanation

The INDIC QA BENCHMARK is a new dataset designed to evaluate the question answering (QA) capabilities of large language models (LLMs) on 12 Indic languages: Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Odia, Assamese, Nepali, and Sanskrit.

The benchmark contains over 15,000 question-answer pairs spanning a diverse range of topics, including general knowledge, current affairs, science, and more. The questions were crowd-sourced from native speakers and then verified by linguistic experts to ensure quality and naturalness.

To establish a strong baseline, the authors evaluated several prominent multilingual LLMs, including mT5, XLM-R, and XQUAD, on the INDIC QA BENCHMARK. The results show significant performance gaps between these models' abilities on Indic languages compared to high-resource languages like English, highlighting the need for more research in this area.

The authors also introduce a novel data augmentation technique called NativeQA, which leverages machine translation and back-translation to generate high-quality synthetic question-answer pairs in Indic languages. This method can help address the challenge of limited training data for these languages.

Critical Analysis

The INDIC QA BENCHMARK is a valuable contribution to the field of multilingual natural language processing. By focusing on Indic languages, which are often overlooked in AI research, the authors are helping to address an important gap and promote more inclusive development of language technology.

One potential limitation of the benchmark is the use of crowd-sourcing for data collection, which could introduce some inconsistencies or biases. The authors acknowledge this and emphasize the importance of expert verification, but further analysis of the dataset's quality and representativeness may be warranted.

Additionally, while the baseline results highlight significant performance gaps for Indic languages, the authors do not provide a detailed error analysis or insights into the specific linguistic challenges these models face. Such analysis could help guide future research and model development efforts.

Overall, the INDIC QA BENCHMARK is a valuable contribution that should encourage more work on improving the QA capabilities of LLMs for underrepresented languages. Continued research in this area has the potential to make AI-powered tools and services more accessible to a wider global audience.

Conclusion

The INDIC QA BENCHMARK is an important new dataset that provides a standardized way to evaluate the question answering capabilities of large language models on a diverse set of Indic languages. By focusing on these underrepresented languages, the authors are helping to promote more inclusive development of AI technology.

The baseline results reveal significant performance gaps between Indic languages and high-resource languages like English, underscoring the need for more research and investment in this area. The authors also introduce a novel data augmentation technique called NativeQA that can help address the challenge of limited training data.

Overall, the INDIC QA BENCHMARK is an important step forward in ensuring that the benefits of AI are accessible to speakers of diverse languages and cultures around the world. As AI continues to advance, it will be crucial to develop technologies that can effectively communicate in a wide range of global languages.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages

Abhishek Kumar Singh, Rudra Murthy, Vishwajeet kumar, Jaydeep Sen, Ganesh Ramakrishnan

Large Language Models (LLMs) have demonstrated remarkable zero-shot and few-shot capabilities in unseen tasks, including context-grounded question answering (QA) in English. However, the evaluation of LLMs' capabilities in non-English languages for context-based QA is limited by the scarcity of benchmarks in non-English languages. To address this gap, we introduce Indic-QA, the largest publicly available context-grounded question-answering dataset for 11 major Indian languages from two language families. The dataset comprises both extractive and abstractive question-answering tasks and includes existing datasets as well as English QA datasets translated into Indian languages. Additionally, we generate a synthetic dataset using the Gemini model to create question-answer pairs given a passage, which is then manually verified for quality assurance. We evaluate various multilingual Large Language Models and their instruction-fine-tuned variants on the benchmark and observe that their performance is subpar, particularly for low-resource languages. We hope that the release of this dataset will stimulate further research on the question-answering abilities of LLMs for low-resource languages.

7/19/2024

New!L3Cube-IndicQuest: A Benchmark Questing Answering Dataset for Evaluating Knowledge of LLMs in Indic Context

Pritika Rohera, Chaitrali Ginimav, Akanksha Salunke, Gayatri Sawant, Raviraj Joshi

Large Language Models (LLMs) have made significant progress in incorporating Indic languages within multilingual models. However, it is crucial to quantitatively assess whether these languages perform comparably to globally dominant ones, such as English. Currently, there is a lack of benchmark datasets specifically designed to evaluate the regional knowledge of LLMs in various Indic languages. In this paper, we present the L3Cube-IndicQuest, a gold-standard question-answering benchmark dataset designed to evaluate how well multilingual LLMs capture regional knowledge across various Indic languages. The dataset contains 200 question-answer pairs, each for English and 19 Indic languages, covering five domains specific to the Indic region. We aim for this dataset to serve as a benchmark, providing ground truth for evaluating the performance of LLMs in understanding and representing knowledge relevant to the Indian context. The IndicQuest can be used for both reference-based evaluation and LLM-as-a-judge evaluation. The dataset is shared publicly at https://github.com/l3cube-pune/indic-nlp .

9/16/2024

🛸

IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages

Harman Singh, Nitish Gupta, Shikhar Bharadwaj, Dinesh Tewari, Partha Talukdar

As large language models (LLMs) see increasing adoption across the globe, it is imperative for LLMs to be representative of the linguistic diversity of the world. India is a linguistically diverse country of 1.4 Billion people. To facilitate research on multilingual LLM evaluation, we release IndicGenBench - the largest benchmark for evaluating LLMs on user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families. IndicGenBench is composed of diverse generation tasks like cross-lingual summarization, machine translation, and cross-lingual question answering. IndicGenBench extends existing benchmarks to many Indic languages through human curation providing multi-way parallel evaluation data for many under-represented Indic languages for the first time. We evaluate a wide range of proprietary and open-source LLMs including GPT-3.5, GPT-4, PaLM-2, mT5, Gemma, BLOOM and LLaMA on IndicGenBench in a variety of settings. The largest PaLM-2 models performs the best on most tasks, however, there is a significant performance gap in all languages compared to English showing that further research is needed for the development of more inclusive multilingual language models. IndicGenBench is released at www.github.com/google-research-datasets/indic-gen-bench

4/26/2024

🌀

Suvach -- Generated Hindi QA benchmark

Vaishak Narayanan, Prabin Raj KP, Saifudheen Nouphal

Current evaluation benchmarks for question answering (QA) in Indic languages often rely on machine translation of existing English datasets. This approach suffers from bias and inaccuracies inherent in machine translation, leading to datasets that may not reflect the true capabilities of EQA models for Indic languages. This paper proposes a new benchmark specifically designed for evaluating Hindi EQA models and discusses the methodology to do the same for any task. This method leverages large language models (LLMs) to generate a high-quality dataset in an extractive setting, ensuring its relevance for the target language. We believe this new resource will foster advancements in Hindi NLP research by providing a more accurate and reliable evaluation tool.

5/1/2024