Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?

Read original: arXiv:2404.06644 - Published 4/11/2024 by Omid Ghahroodi, Marzia Nouri, Mohammad Vali Sanian, Alireza Sahebi, Doratossadat Dastgheib, Ehsaneddin Asgari, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?

Overview

This paper presents the Khayyam Challenge, a benchmark to evaluate the performance of large language models (LLMs) on the Persian language.
The goal is to assess whether current LLMs are truly capable of understanding and processing the nuances of the Persian language, which has a rich cultural and linguistic heritage.
The authors develop a multi-task dataset covering various aspects of Persian language understanding, including literary analysis, conversational question answering, and mathematical reasoning.

Plain English Explanation

The paper explores whether current large language models (LLMs) are truly capable of understanding and using the Persian language effectively. The Persian language has a rich cultural and linguistic heritage, with unique characteristics that may pose challenges for LLMs trained primarily on other languages.

To assess the performance of LLMs on Persian language tasks, the authors have developed the Khayyam Challenge, a benchmark that covers various aspects of Persian language understanding. This includes literary analysis, where the model needs to interpret and analyze Persian poetry and literature, conversational question answering, where the model must engage in natural dialogue and answer questions about Persian culture and history, and mathematical reasoning, where the model must apply its language understanding to solve problems involving Persian mathematical concepts and terminology.

By evaluating LLMs on this diverse set of Persian language tasks, the researchers aim to gain insights into the strengths and limitations of current language models when it comes to truly mastering the nuances and complexities of the Persian language. This work is important for advancing the development of LLMs that can effectively process and understand a wider range of the world's languages, beyond the predominant focus on English and a few other major languages.

Technical Explanation

The paper introduces the Khayyam Challenge, a benchmark for evaluating the performance of large language models (LLMs) on the Persian language. The authors argue that while LLMs have made significant progress in natural language processing, their capabilities may be limited when it comes to less-commonly studied languages like Persian, which has a rich cultural and linguistic heritage.

To address this, the Khayyam Challenge comprises a multi-task dataset covering various aspects of Persian language understanding, including literary analysis, conversational question answering, and mathematical reasoning. The dataset is designed to assess whether LLMs can truly grasp the nuances and complexities of the Persian language, going beyond surface-level understanding.

The authors evaluate several state-of-the-art LLMs on the Khayyam Challenge and provide a detailed analysis of their performance across the different tasks. The results suggest that while LLMs demonstrate some capability in processing Persian language content, they still struggle with various aspects, such as understanding cultural references, handling complex grammatical structures, and solving domain-specific problems involving Persian mathematical concepts.

Critical Analysis

The Khayyam Challenge is a valuable contribution to the field of multilingual natural language processing, as it highlights the need to go beyond the predominant focus on a few major languages and develop LLMs that can effectively process a wider range of the world's languages.

One potential limitation of the study is the size and diversity of the dataset used for the Khayyam Challenge. While the authors have made efforts to cover a range of Persian language tasks, the dataset may not fully capture the breadth and complexity of the Persian language. Expanding the dataset with more diverse content and task types could provide a more comprehensive evaluation of LLM capabilities.

Additionally, the paper does not delve into the specific architectural choices or training approaches that may contribute to the observed performance limitations of LLMs on the Persian language tasks. Further research could investigate how LLM design and training strategies can be improved to better handle languages like Persian, which have unique grammatical structures, vocabulary, and cultural nuances.

Conclusion

The Khayyam Challenge presented in this paper is an important step towards understanding the limitations of current large language models when it comes to processing and understanding the Persian language. The multi-task dataset and the evaluation results provide valuable insights into the strengths and weaknesses of LLMs in this domain.

By highlighting the need to expand the focus of natural language processing research beyond the predominant focus on English and a few other major languages, this work underscores the importance of developing LLMs that can truly master a wider range of the world's languages. This could have significant implications for applications such as machine translation, conversational AI, and cross-cultural communication, ultimately contributing to more inclusive and accessible language technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?

Omid Ghahroodi, Marzia Nouri, Mohammad Vali Sanian, Alireza Sahebi, Doratossadat Dastgheib, Ehsaneddin Asgari, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

Evaluating Large Language Models (LLMs) is challenging due to their generative nature, necessitating precise evaluation methodologies. Additionally, non-English LLM evaluation lags behind English, resulting in the absence or weakness of LLMs for many languages. In response to this necessity, we introduce Khayyam Challenge (also known as PersianMMLU), a meticulously curated collection comprising 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations, spanning a wide spectrum of subjects, complexities, and ages. The primary objective of the Khayyam Challenge is to facilitate the rigorous evaluation of LLMs that support the Persian language. Distinctive features of the Khayyam Challenge are (i) its comprehensive coverage of various topics, including literary comprehension, mathematics, sciences, logic, intelligence testing, etc., aimed at assessing different facets of LLMs such as language comprehension, reasoning, and information retrieval across various educational stages, from lower primary school to upper secondary school (ii) its inclusion of rich metadata such as human response rates, difficulty levels, and descriptive answers (iii) its utilization of new data to avoid data contamination issues prevalent in existing frameworks (iv) its use of original, non-translated data tailored for Persian speakers, ensuring the framework is free from translation challenges and errors while encompassing cultural nuances (v) its inherent scalability for future data updates and evaluations without requiring special human effort. Previous works lacked an evaluation framework that combined all of these features into a single comprehensive benchmark. Furthermore, we evaluate a wide range of existing LLMs that support the Persian language, with statistical analyses and interpretations of their outputs.

4/11/2024

TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish

Arda Yuksel, Abdullatif Koksal, Lutfi Kerem c{S}enel, Anna Korhonen, Hinrich Schutze

Multiple choice question answering tasks evaluate the reasoning, comprehension, and mathematical abilities of Large Language Models (LLMs). While existing benchmarks employ automatic translation for multilingual evaluation, this approach is error-prone and potentially introduces culturally biased questions, especially in social sciences. We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU, to evaluate LLMs' understanding of the Turkish language. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. These questions are written by curriculum experts, suitable for the high-school curricula in Turkey, covering subjects ranging from natural sciences and math questions to more culturally representative topics such as Turkish Literature and the history of the Turkish Republic. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models. We provide an extensive evaluation, including zero-shot and few-shot evaluation of LLMs, chain-of-thought reasoning, and question difficulty analysis along with model performance. We provide an in-depth analysis of the Turkish capabilities and limitations of current LLMs to provide insights for future LLMs for the Turkish language. We publicly release our code for the dataset and evaluation: https://github.com/ArdaYueksel/TurkishMMLU.

7/18/2024

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

Md Tahmid Rahman Laskar, Sawsan Alqahtani, M Saiful Bari, Mizanur Rahman, Mohammad Abdullah Matin Khan, Haidar Khan, Israt Jahan, Amran Bhuiyan, Chee Wei Tan, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty, Jimmy Huang

Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the complexity of the evaluation process has led to varied evaluation setups, causing inconsistencies in findings and interpretations. To address this, we systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations in various steps of LLM evaluation. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust.

7/8/2024

PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data

Ishaan Watts, Varun Gumma, Aditya Yadavalli, Vivek Seshadri, Manohar Swaminathan, Sunayana Sitaram

Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors -- the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. In this work, we study human and LLM-based evaluation in a multilingual, multi-cultural setting. We evaluate 30 models across 10 Indic languages by conducting 90K human evaluations and 30K LLM-based evaluations and find that models such as GPT-4o and Llama-3 70B consistently perform best for most Indic languages. We build leaderboards for two evaluation settings - pairwise comparison and direct assessment and analyse the agreement between humans and LLMs. We find that humans and LLMs agree fairly well in the pairwise setting but the agreement drops for direct assessment evaluation especially for languages such as Bengali and Odia. We also check for various biases in human and LLM-based evaluation and find evidence of self-bias in the GPT-based evaluator. Our work presents a significant step towards scaling up multilingual evaluation of LLMs.

6/24/2024