KMMLU: Measuring Massive Multitask Language Understanding in Korean

Read original: arXiv:2402.11548 - Published 6/7/2024 by Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, Taekyoon Choi, Cheonbok Park, Kang Min Yoo, Stella Biderman

💬

Overview

Researchers propose a new Korean language benchmark called KMMLU
KMMLU contains 35,030 expert-level multiple-choice questions across 45 subjects
Unlike prior Korean benchmarks, KMMLU is based on original Korean exams, capturing linguistic and cultural aspects
They test 27 public and proprietary large language models (LLMs) on KMMLU
The best public model scores 50.5%, suggesting significant room for improvement
Current Korean-focused LLMs like Polyglot-Ko perform poorly, and even top proprietary models like GPT-4 do not exceed 60%
This highlights the need for further work to improve LLMs for the Korean language

Plain English Explanation

Researchers have developed a new benchmark called KMMLU to evaluate how well large language models (LLMs) can understand the Korean language. Unlike previous Korean benchmarks that were translated from English, KMMLU is based on original Korean exam questions, which helps capture the unique linguistic and cultural aspects of the Korean language.

The KMMLU benchmark contains 35,030 expert-level multiple-choice questions across 45 different subjects, ranging from humanities to science and technology. The researchers tested 27 different public and proprietary LLMs on this benchmark. They found that the best-performing public model scored around 50.5%, leaving significant room for improvement.

Interestingly, even the most capable proprietary LLMs, such as GPT-4 and HyperCLOVA X, did not exceed 60% on the KMMLU benchmark. This suggests that current LLMs, including those specifically tailored for Korean like Polyglot-Ko, still struggle to fully understand the complexities of the Korean language.

The researchers believe that the KMMLU benchmark provides a valuable tool to track progress in improving LLM performance on Korean language understanding. By making the dataset publicly available on the Hugging Face Hub and integrating it into the EleutherAI Language Model Evaluation Harness, they hope to encourage further research and development in this area.

Technical Explanation

The researchers developed the KMMLU (Korean Multi-Modal Language Understanding) benchmark, which contains 35,030 expert-level multiple-choice questions across 45 subjects. Unlike previous Korean benchmarks that were translated from existing English benchmarks, such as MMLU and CFLUE, KMMLU is collected from original Korean exams, capturing the linguistic and cultural aspects of the Korean language.

The researchers tested 27 public and proprietary large language models (LLMs) on the KMMLU benchmark. They found that the best-performing public model scored 50.5%, suggesting significant room for improvement. This model was primarily trained for English and Chinese, not Korean.

Current LLMs tailored for Korean, such as Polyglot-Ko, performed far worse on the KMMLU benchmark. Surprisingly, even the most capable proprietary LLMs, like GPT-4 and HyperCLOVA X, did not exceed 60% on the benchmark.

These results highlight the need for further work to improve LLM performance on the Korean language. The researchers believe that the KMMLU benchmark provides a valuable tool to track progress in this area and have made the dataset publicly available on the Hugging Face Hub and integrated it into the EleutherAI Language Model Evaluation Harness.

Critical Analysis

The researchers have acknowledged several limitations and areas for further research in their paper. One key limitation is the reliance on multiple-choice questions, which may not fully capture the depth of language understanding required in real-world applications. Additionally, the benchmark is focused solely on the Korean language, and it would be interesting to see how LLMs perform on cross-lingual or multilingual tasks involving Korean.

Another potential concern is the subjectivity inherent in the expert-level questions used in the KMMLU benchmark. While the researchers have taken steps to ensure the quality and consistency of the questions, there may be room for further refinement or validation of the benchmark.

Furthermore, the researchers have not delved deeply into the specific reasons why current LLMs struggle with the KMMLU benchmark. Investigating the linguistic and cultural factors that contribute to this performance gap could provide valuable insights for future model development.

Overall, the KMMLU benchmark represents an important step forward in the evaluation of language understanding for the Korean language. By making the dataset publicly available and encouraging further research, the researchers have laid the groundwork for improving LLM performance in this domain.

Conclusion

The KMMLU benchmark developed by the researchers offers a valuable tool for assessing the language understanding capabilities of large language models in the context of the Korean language. The benchmark's focus on original Korean exam questions, rather than translated content, helps capture the unique linguistic and cultural aspects of the Korean language.

The researchers' findings suggest that current LLMs, even the most capable proprietary models, still struggle to achieve high performance on the KMMLU benchmark. This highlights the need for further research and development to improve LLM performance on the Korean language.

By making the KMMLU dataset publicly available and integrating it into the EleutherAI Language Model Evaluation Harness, the researchers have created opportunities for the broader research community to contribute to advancing the state of the art in Korean language understanding. This work has the potential to lead to more accurate and culturally-aware language models, which could have significant implications for a wide range of applications involving the Korean language.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

KMMLU: Measuring Massive Multitask Language Understanding in Korean

Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, Taekyoon Choi, Cheonbok Park, Kang Min Yoo, Stella Biderman

We propose KMMLU, a new Korean benchmark with 35,030 expert-level multiple-choice questions across 45 subjects ranging from humanities to STEM. While prior Korean benchmarks are translated from existing English benchmarks, KMMLU is collected from original Korean exams, capturing linguistic and cultural aspects of the Korean language. We test 27 public and proprietary LLMs and observe the best public model to score 50.5%, leaving significant room for improvement. This model was primarily trained for English and Chinese, not Korean. Current LLMs tailored to Korean, such as Polyglot-Ko, perform far worse. Surprisingly, even the most capable proprietary LLMs, e.g., GPT-4 and HyperCLOVA X do not exceed 60%. This suggests that further work is needed to improve LLMs for Korean, and we believe KMMLU offers the appropriate tool to track this progress. We make our dataset publicly available on the Hugging Face Hub and integrate the benchmark into EleutherAI's Language Model Evaluation Harness.

6/7/2024

TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish

Arda Yuksel, Abdullatif Koksal, Lutfi Kerem c{S}enel, Anna Korhonen, Hinrich Schutze

Multiple choice question answering tasks evaluate the reasoning, comprehension, and mathematical abilities of Large Language Models (LLMs). While existing benchmarks employ automatic translation for multilingual evaluation, this approach is error-prone and potentially introduces culturally biased questions, especially in social sciences. We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU, to evaluate LLMs' understanding of the Turkish language. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. These questions are written by curriculum experts, suitable for the high-school curricula in Turkey, covering subjects ranging from natural sciences and math questions to more culturally representative topics such as Turkish Literature and the history of the Turkish Republic. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models. We provide an extensive evaluation, including zero-shot and few-shot evaluation of LLMs, chain-of-thought reasoning, and question difficulty analysis along with model performance. We provide an in-depth analysis of the Turkish capabilities and limitations of current LLMs to provide insights for future LLMs for the Turkish language. We publicly release our code for the dataset and evaluation: https://github.com/ArdaYueksel/TurkishMMLU.

7/18/2024

Open Ko-LLM Leaderboard: Evaluating Large Language Models in Korean with Ko-H5 Benchmark

Chanjun Park, Hyeonwoo Kim, Dahyun Kim, Seonghwan Cho, Sanghoon Kim, Sukyung Lee, Yungi Kim, Hwalsuk Lee

This paper introduces the Open Ko-LLM Leaderboard and the Ko-H5 Benchmark as vital tools for evaluating Large Language Models (LLMs) in Korean. Incorporating private test sets while mirroring the English Open LLM Leaderboard, we establish a robust evaluation framework that has been well integrated in the Korean LLM community. We perform data leakage analysis that shows the benefit of private test sets along with a correlation study within the Ko-H5 benchmark and temporal analyses of the Ko-H5 score. Moreover, we present empirical support for the need to expand beyond set benchmarks. We hope the Open Ko-LLM Leaderboard sets precedent for expanding LLM evaluation to foster more linguistic diversity.

8/20/2024

An Improved Traditional Chinese Evaluation Suite for Foundation Model

Zhi-Rui Tam, Ya-Ting Pai, Yen-Wei Lee, Jun-Da Chen, Wei-Min Chu, Sega Cheng, Hong-Han Shuai

We present TMMLU+, a new benchmark designed for Traditional Chinese language understanding. TMMLU+ is a multi-choice question-answering dataset with 66 subjects from elementary to professional level. It is six times larger and boasts a more balanced subject distribution than its predecessor, Taiwan Massive Multitask Language Understanding (TMMLU). We also benchmark closed-source models and 26 open-weight Chinese large language models (LLMs) of parameters ranging from 1.8B to 72B on the proposed TMMLU+. Our findings reveal that (1.) Traditional Chinese models still trail behind their Simplified Chinese counterparts, highlighting a need for more focused advancements in LLMs catering to Traditional Chinese. (2.) Current LLMs still fall short of human performance in average scores, indicating a potential need for future research to delve deeper into social science and humanities subjects. (3.) Among all the tokenization compression metrics examined, we identify that only the fertility score uniquely demonstrates strong correlations with our benchmark results. We foresee that TMMLU+ will pinpoint areas for future model improvement, thereby narrowing the gap between machine and human linguistic capabilities and supporting researchers in developing Traditional Chinese LLMs. Our dataset, along with the benchmark source code, is accessible at huggingface.co/datasets/ikala/tmmluplus.

7/12/2024