TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish

Read original: arXiv:2407.12402 - Published 7/18/2024 by Arda Yuksel, Abdullatif Koksal, Lutfi Kerem c{S}enel, Anna Korhonen, Hinrich Schutze

TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish

Overview

This paper introduces TurkishMMLU, a benchmark for measuring massive multitask language understanding in Turkish.
The benchmark covers a wide range of tasks, including natural language inference, question answering, reading comprehension, and more.
The authors evaluate several large language models on the TurkishMMLU benchmark and provide insights into the current state of Turkish language understanding.

Plain English Explanation

The researchers have created a new tool to test how well large language models can understand Turkish. This tool, called TurkishMMLU, covers a variety of tasks like answering questions, understanding the relationship between sentences, and comprehending written passages. By testing different language models on this benchmark, the researchers can see how good these models are at truly understanding Turkish, not just memorizing patterns.

The key idea is to have a comprehensive test that covers many different language understanding skills, rather than just focusing on one or two tasks. This gives a more complete picture of a model's capabilities in the Turkish language. The researchers evaluated several large language models on the TurkishMMLU benchmark and reported their findings, providing insights into the current state of Turkish language understanding technology.

Technical Explanation

The authors of this paper introduce TurkishMMLU, a new benchmark for assessing massive multitask language understanding (MMLU) in the Turkish language. The benchmark covers a wide range of tasks, including natural language inference, question answering, reading comprehension, and more.

The authors evaluate several large language models on the TurkishMMLU benchmark, including monolingual Turkish models and multilingual models with Turkish support. They analyze the models' performance across the different tasks and provide insights into the current state of Turkish language understanding capabilities.

Key findings from the paper include:

Multilingual models generally outperform monolingual Turkish models on the TurkishMMLU benchmark, demonstrating the benefits of cross-lingual transfer learning.
There is still significant room for improvement in Turkish language understanding, with even the best-performing models struggling on certain tasks.
The authors identify specific areas, such as logical reasoning and multi-hop question answering, where Turkish language models lag behind their performance on other languages.

Critical Analysis

The authors of this paper have made a valuable contribution by creating the TurkishMMLU benchmark, which provides a comprehensive assessment of language understanding capabilities in Turkish. This is an important step forward, as previous benchmarks have often focused on a narrower set of tasks or been limited to higher-resource languages.

However, the authors acknowledge several limitations of their work. For example, the dataset sizes for some tasks are relatively small, which could impact the reliability of the results. Additionally, the authors note that the benchmark does not cover certain aspects of language understanding, such as multi-modal or grounded language tasks.

Another potential concern is the reliance on large language models, which can be opaque "black boxes" whose inner workings are not well understood. It would be valuable to see further research exploring the specific strengths, weaknesses, and biases of these models in the context of Turkish language understanding.

Overall, the TurkishMMLU benchmark represents an important step forward, but there is still significant room for improvement and further exploration in this area. Researchers and practitioners should continue to push the boundaries of Turkish language understanding, both by developing more robust benchmarks and by pursuing innovative approaches to model design and training.

Conclusion

The TurkishMMLU benchmark introduced in this paper provides a comprehensive assessment of language understanding capabilities in Turkish. By evaluating several large language models on a diverse set of tasks, the authors have shed light on the current state of Turkish language understanding technology.

The results suggest that while multilingual models show promise, there is still significant room for improvement in this area. The authors have identified specific challenges, such as logical reasoning and multi-hop question answering, where Turkish language models lag behind their performance on other languages.

This research highlights the importance of developing robust benchmarks and pushing the boundaries of language understanding, not just for high-resource languages, but for underrepresented languages like Turkish as well. As the field of natural language processing continues to evolve, benchmarks like TurkishMMLU will be crucial for driving progress and ensuring that advancements in language technology are inclusive and beneficial for all.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish

Arda Yuksel, Abdullatif Koksal, Lutfi Kerem c{S}enel, Anna Korhonen, Hinrich Schutze

Multiple choice question answering tasks evaluate the reasoning, comprehension, and mathematical abilities of Large Language Models (LLMs). While existing benchmarks employ automatic translation for multilingual evaluation, this approach is error-prone and potentially introduces culturally biased questions, especially in social sciences. We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU, to evaluate LLMs' understanding of the Turkish language. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. These questions are written by curriculum experts, suitable for the high-school curricula in Turkey, covering subjects ranging from natural sciences and math questions to more culturally representative topics such as Turkish Literature and the history of the Turkish Republic. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models. We provide an extensive evaluation, including zero-shot and few-shot evaluation of LLMs, chain-of-thought reasoning, and question difficulty analysis along with model performance. We provide an in-depth analysis of the Turkish capabilities and limitations of current LLMs to provide insights for future LLMs for the Turkish language. We publicly release our code for the dataset and evaluation: https://github.com/ArdaYueksel/TurkishMMLU.

7/18/2024

💬

KMMLU: Measuring Massive Multitask Language Understanding in Korean

Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, Taekyoon Choi, Cheonbok Park, Kang Min Yoo, Stella Biderman

We propose KMMLU, a new Korean benchmark with 35,030 expert-level multiple-choice questions across 45 subjects ranging from humanities to STEM. While prior Korean benchmarks are translated from existing English benchmarks, KMMLU is collected from original Korean exams, capturing linguistic and cultural aspects of the Korean language. We test 27 public and proprietary LLMs and observe the best public model to score 50.5%, leaving significant room for improvement. This model was primarily trained for English and Chinese, not Korean. Current LLMs tailored to Korean, such as Polyglot-Ko, perform far worse. Surprisingly, even the most capable proprietary LLMs, e.g., GPT-4 and HyperCLOVA X do not exceed 60%. This suggests that further work is needed to improve LLMs for Korean, and we believe KMMLU offers the appropriate tool to track this progress. We make our dataset publicly available on the Hugging Face Hub and integrate the benchmark into EleutherAI's Language Model Evaluation Harness.

6/7/2024

Spanish and LLM Benchmarks: is MMLU Lost in Translation?

Irene Plaza, Nina Melero, Cristina del Pozo, Javier Conde, Pedro Reviriego, Marina Mayor-Rocher, Mar'ia Grandury

The evaluation of Large Language Models (LLMs) is a key element in their continuous improvement process and many benchmarks have been developed to assess the performance of LLMs in different tasks and topics. As LLMs become adopted worldwide, evaluating them in languages other than English is increasingly important. However, most LLM benchmarks are simply translated using an automated tool and then run in the target language. This means that the results depend not only on the LLM performance in that language but also on the quality of the translation. In this paper, we consider the case of the well-known Massive Multitask Language Understanding (MMLU) benchmark. Selected categories of the benchmark are translated into Spanish using Azure Translator and ChatGPT4 and run on ChatGPT4. Next, the results are processed to identify the test items that produce different answers in Spanish and English. Those are then analyzed manually to understand if the automatic translation caused the change. The results show that a significant fraction of the failing items can be attributed to mistakes in the translation of the benchmark. These results make a strong case for improving benchmarks in languages other than English by at least revising the translations of the items and preferably by adapting the tests to the target language by experts.

6/27/2024

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

Fajri Koto, Haonan Li, Sara Shatnawi, Jad Doughman, Abdelrahman Boda Sadallah, Aisha Alraeesi, Khalid Almubarak, Zaid Alyafeai, Neha Sengupta, Shady Shehata, Nizar Habash, Preslav Nakov, Timothy Baldwin

The focus of language model evaluation has transitioned towards reasoning and knowledge-intensive tasks, driven by advancements in pretraining large models. While state-of-the-art models are partially trained on large Arabic texts, evaluating their performance in Arabic remains challenging due to the limited availability of relevant datasets. To bridge this gap, we present datasetname{}, the first multi-task language understanding benchmark for the Arabic language, sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our comprehensive evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models. Notably, BLOOMZ, mT0, LLaMA2, and Falcon struggle to achieve a score of 50%, while even the top-performing Arabic-centric model only achieves a score of 62.3%.

7/31/2024