ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

Read original: arXiv:2402.12840 - Published 7/31/2024 by Fajri Koto, Haonan Li, Sara Shatnawi, Jad Doughman, Abdelrahman Boda Sadallah, Aisha Alraeesi, Khalid Almubarak, Zaid Alyafeai, Neha Sengupta, Shady Shehata and 3 others

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

Overview

The paper "ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic" evaluates the performance of large language models on a wide range of Arabic language understanding tasks.
The researchers introduce a new benchmark called "ArabicMMLU" that covers 27 diverse tasks, ranging from sentiment analysis to question answering.
They assess the capabilities of several state-of-the-art models, including AraBERT and mT5, on this benchmark to gain insights into the current state of Arabic language understanding.

Plain English Explanation

The paper focuses on evaluating the abilities of large language models, which are powerful AI systems trained on vast amounts of text data, when it comes to understanding and processing the Arabic language. The researchers created a new benchmark called "ArabicMMLU" that tests these models across 27 different tasks, such as determining the sentiment of a piece of text, answering questions, and more.

By assessing how well models like AraBERT and mT5 perform on this diverse set of language understanding challenges, the researchers aimed to gain insights into the current state of Arabic language AI. This can help identify areas where the technology is strong, as well as where more progress is needed to achieve human-level understanding of the Arabic language.

Technical Explanation

The paper introduces the "ArabicMMLU" benchmark, which consists of 27 different language understanding tasks covering a wide range of capabilities, such as sentiment analysis, question answering, and named entity recognition. The researchers evaluate the performance of several state-of-the-art language models, including AraBERT, mT5, and others, on this benchmark.

The results show that while the models demonstrate strong performance on certain tasks, there are still significant gaps in their ability to fully understand the complexities of the Arabic language. The paper provides detailed analyses of the models' strengths and weaknesses across the different sub-tasks, offering insights into the current state of the art in Arabic language understanding.

Critical Analysis

The paper provides a comprehensive evaluation of language models on a diverse set of Arabic language understanding tasks, which is a valuable contribution to the field. However, the authors acknowledge that the benchmark is limited to written, modern standard Arabic, and does not capture the full diversity of Arabic dialects and language use.

Additionally, the paper does not delve into the potential biases or limitations of the training data used to develop these models, which could influence their performance. Further research is needed to better understand how these models handle the nuances and complexities of the Arabic language, especially in real-world applications.

Conclusion

The "ArabicMMLU" benchmark introduced in this paper represents an important step forward in assessing the capabilities of large language models for understanding the Arabic language. The insights gained from this research can help guide the development of more robust and effective Arabic language AI systems, which could have significant implications for a wide range of applications, from customer service to educational resources.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

Fajri Koto, Haonan Li, Sara Shatnawi, Jad Doughman, Abdelrahman Boda Sadallah, Aisha Alraeesi, Khalid Almubarak, Zaid Alyafeai, Neha Sengupta, Shady Shehata, Nizar Habash, Preslav Nakov, Timothy Baldwin

The focus of language model evaluation has transitioned towards reasoning and knowledge-intensive tasks, driven by advancements in pretraining large models. While state-of-the-art models are partially trained on large Arabic texts, evaluating their performance in Arabic remains challenging due to the limited availability of relevant datasets. To bridge this gap, we present datasetname{}, the first multi-task language understanding benchmark for the Arabic language, sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our comprehensive evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models. Notably, BLOOMZ, mT0, LLaMA2, and Falcon struggle to achieve a score of 50%, while even the top-performing Arabic-centric model only achieves a score of 62.3%.

7/31/2024

$ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language Models$

ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language Models

Faris Hijazi (THIQAH), Somayah AlHarbi (THIQAH), Abdulaziz AlHussein (THIQAH), Harethah Abu Shairah (KAUST), Reem AlZahrani (KAUST), Hebah AlShamlan (THIQAH), Omar Knio (KAUST), George Turkiyyah (KAUST)

The rapid advancements in Large Language Models (LLMs) have led to significant improvements in various natural language processing tasks. However, the evaluation of LLMs' legal knowledge, particularly in non-English languages such as Arabic, remains under-explored. To address this gap, we introduce ArabLegalEval, a multitask benchmark dataset for assessing the Arabic legal knowledge of LLMs. Inspired by the MMLU and LegalBench datasets, ArabLegalEval consists of multiple tasks sourced from Saudi legal documents and synthesized questions. In this work, we aim to analyze the capabilities required to solve legal problems in Arabic and benchmark the performance of state-of-the-art LLMs. We explore the impact of in-context learning and investigate various evaluation methods. Additionally, we explore workflows for generating questions with automatic validation to enhance the dataset's quality. We benchmark multilingual and Arabic-centric LLMs, such as GPT-4 and Jais, respectively. We also share our methodology for creating the dataset and validation, which can be generalized to other domains. We hope to accelerate AI research in the Arabic Legal domain by releasing the ArabLegalEval dataset and code: https://github.com/Thiqah/ArabLegalEval

8/16/2024

TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish

Arda Yuksel, Abdullatif Koksal, Lutfi Kerem c{S}enel, Anna Korhonen, Hinrich Schutze

Multiple choice question answering tasks evaluate the reasoning, comprehension, and mathematical abilities of Large Language Models (LLMs). While existing benchmarks employ automatic translation for multilingual evaluation, this approach is error-prone and potentially introduces culturally biased questions, especially in social sciences. We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU, to evaluate LLMs' understanding of the Turkish language. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. These questions are written by curriculum experts, suitable for the high-school curricula in Turkey, covering subjects ranging from natural sciences and math questions to more culturally representative topics such as Turkish Literature and the history of the Turkish Republic. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models. We provide an extensive evaluation, including zero-shot and few-shot evaluation of LLMs, chain-of-thought reasoning, and question difficulty analysis along with model performance. We provide an in-depth analysis of the Turkish capabilities and limitations of current LLMs to provide insights for future LLMs for the Turkish language. We publicly release our code for the dataset and evaluation: https://github.com/ArdaYueksel/TurkishMMLU.

7/18/2024

101 Billion Arabic Words Dataset

Manel Aloui, Hasna Chouikhi, Ghaith Chaabane, Haithem Kchaou, Chehir Dhaouadi

In recent years, Large Language Models have revolutionized the field of natural language processing, showcasing an impressive rise predominantly in English-centric domains. These advancements have set a global benchmark, inspiring significant efforts toward developing Arabic LLMs capable of understanding and generating the Arabic language with remarkable accuracy. Despite these advancements, a critical challenge persists: the potential bias in Arabic LLMs, primarily attributed to their reliance on datasets comprising English data that has been translated into Arabic. This reliance not only compromises the authenticity of the generated content but also reflects a broader issue -the scarcity of original quality Arabic linguistic data. This study aims to address the data scarcity in the Arab world and to encourage the development of Arabic Language Models that are true to both the linguistic and nuances of the region. We undertook a large-scale data mining project, extracting a substantial volume of text from the Common Crawl WET files, specifically targeting Arabic content. The extracted data underwent a rigorous cleaning and deduplication process, using innovative techniques to ensure the integrity and uniqueness of the dataset. The result is the 101 Billion Arabic Words Dataset, the largest Arabic dataset available to date, which can significantly contribute to the development of authentic Arabic LLMs. This study not only highlights the potential for creating linguistically and culturally accurate Arabic LLMs but also sets a precedent for future research in enhancing the authenticity of Arabic language models.

5/6/2024