Measuring Taiwanese Mandarin Language Understanding

2403.20180

Published 4/1/2024 by Po-Heng Chen, Sijia Cheng, Wei-Lin Chen, Yen-Ting Lin, Yun-Nung Chen

Measuring Taiwanese Mandarin Language Understanding

Abstract

The evaluation of large language models (LLMs) has drawn substantial attention in the field recently. This work focuses on evaluating LLMs in a Chinese context, specifically, for Traditional Chinese which has been largely underrepresented in existing benchmarks. We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in LLMs, under the context of Taiwanese Mandarin. TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels. In addition, we curate chain-of-thought-like few-shot explanations for each subject to facilitate the evaluation of complex reasoning skills. To establish a comprehensive baseline, we conduct extensive experiments and analysis on 24 advanced LLMs. The results suggest that Chinese open-weight models demonstrate inferior performance comparing to multilingual proprietary ones, and open-weight models tailored for Taiwanese Mandarin lag behind the Simplified-Chinese counterparts. The findings indicate great headrooms for improvement, and emphasize the goal of TMLU to foster the development of localized Taiwanese-Mandarin LLMs. We release the benchmark and evaluation scripts for the community to promote future research.

Create account to get full access

Introduction

The emergence of large language models (LLMs) has revolutionized natural language processing (NLP). Evaluating the performance of LLMs is crucial as they drive the development of AI. Conventional NLP benchmarks are no longer adequate as LLMs demonstrate human-level performance on these tasks. Recent focus in LLM evaluation is on assessing advanced world knowledge and complex reasoning capabilities.

Evaluation benchmarks for languages beyond English have also been introduced, parallel to the rise of multilingual LLMs and regionally-optimized LLMs. Many benchmarks have been proposed for assessing Chinese LLMs, but these have been focused on Simplified Chinese used in Mainland China.

In contrast, Traditional Chinese, including the Taiwanese Mandarin variant, has been underrepresented. Taiwanese Mandarin has linguistic, cultural, and written form differences from Chinese used in China, presenting unique challenges. The need to develop benchmarks specifically for Taiwanese Mandarin LLMs is highlighted.

This work presents TMLU, a comprehensive evaluation suite for assessing advanced knowledge and reasoning in Taiwanese Mandarin LLMs. TMLU covers a broad spectrum of subjects and includes manually curated Chain-of-Thought-inspired explanations. An evaluation of 24 advanced LLMs on TMLU establishes a baseline, indicating that proprietary multilingual models outperform open-source models developed primarily for Simplified Chinese. The findings suggest ample room for improvement in Taiwanese Mandarin LLMs.

Related Work

This text discusses the limitations of existing language model benchmarks, such as GLUE and SuperGLUE, in evaluating the advanced knowledge and reasoning capabilities of large language models (LLMs). It highlights the introduction of new benchmarks, such as MMLU, BIG-bench, C-Eval, and CMMLU, which aim to assess LLMs' abilities in a more comprehensive manner, including subjects across STEM and social sciences.

However, the text notes that these benchmarks are primarily constructed in English, limiting their usefulness in evaluating LLMs developed in the context of other languages, such as Chinese. To address this, the text introduces two Chinese benchmarks: TC-Eval and TMMLU-plus.

The key differences between the proposed TMLU benchmark and the existing TC-Eval and TMMLU-plus are:

TMLU questions are originally in Traditional Chinese, whereas TC-Eval includes translated datasets that may have Western biases.
TMLU minimizes the risk of test data contamination by collecting data from PDFs and Word documents, unlike TMMLU-plus which sources from a single website.
TMLU provides manual-constructed few-shot demonstrations to elicit reasoning and standardized mathematical expressions in LaTeX format, features not available in the other benchmarks.

In conclusion, the text argues that TMLU is better suited for objectively evaluating LLM capabilities due to its improved transparency, localization, and robustness compared to the existing Chinese benchmarks.

TMLU Benchmark

The TMLU benchmark is designed to evaluate the knowledge and reasoning capabilities of large language models in the context of Taiwanese Mandarin. It contains a wide range of multiple-choice questions spanning various disciplines and difficulty levels, from middle school to professional. The goal is to provide an accessible and easy-to-use evaluation suite for developers to gauge their models' performance in real-world Taiwanese Mandarin scenarios.

The benchmark data is sourced from reputable websites in Taiwan, such as those hosting standardized exams for Taiwanese students and professional/technical exams. To mitigate the risk of data contamination, the data is collected in the format of PDF or Microsoft Word documents instead of readily available website text. The questions are manually processed and validated for quality assurance.

The resulting TMLU benchmark consists of 2,981 multiple-choice questions across 37 subjects, categorized into five disciplines: social science, STEM, humanities, Taiwan-specific, and others. The questions are further divided into development and test sets. To facilitate the evaluation of reasoning capabilities, the benchmark also includes manually curated Chain-of-Thought-like explanations for the development set instances, sourced from various resources such as textbook websites and question-and-answer forums.

Experiments

This paper conducted a comprehensive evaluation of 24 different large language models (LLMs) on the Taiwan Mandarin Language Understanding (TMLU) benchmark. The models varied in size and were developed by different organizations, including both closed-source proprietary models accessed via APIs and open-source models. The evaluation used a few-shot setup, with five examples provided for each task, and compared performance on direct answer and Chain-of-Thought (CoT) prompting.

The results show that proprietary models generally outperformed open-source models, and models trained on Chinese data (Simplified Chinese, Taiwanese Mandarin, and multilingual) exhibited substantial improvements over models focused on English. For direct answer prompting, the Breeze model tailored for Taiwanese Mandarin outperformed the larger GPT-3.5 model. The top-performing open-source model, Yi, was competitive with proprietary models like Gemini-Pro and Claude-Instant.

In the CoT prompting, proprietary models outperformed the best open-source model by a significant margin of 7% to 27%, suggesting an opportunity to enhance the reasoning capabilities of open-source models. The authors conclude that TMLU provides a holistic benchmark for evaluating LLMs in the context of Taiwanese Mandarin, and the current performance of Taiwanese Mandarin LLMs still lags behind their Simplified Chinese counterparts, highlighting the need to foster the development of localized Taiwanese Mandarin LLMs.

Analysis

The text discusses analyzing the robustness of test data for large language models (LLMs). The researchers employ a reference-free method called Min-k% Prob to detect whether test data was present in the models' pre-training data. This method looks at the minimum probability of tokens in the input text as an indicator of whether the text was seen during pre-training.

The results show that the TMLU dataset has higher Min-k% Prob values than the TMMLU-plus dataset across various models, suggesting TMLU is less likely to contain data that was in the pre-training data. This aligns with the authors' hypothesis that sourcing data from downloaded documents, as done for TMLU, can reduce the risk of data contamination compared to web crawling, as was done for TMMLU-plus.

The text also discusses comparing direct answer prompting to chain-of-thought (CoT) prompting for evaluating LLM capabilities. The results indicate that larger, proprietary models like GPT-4 and Claude-2 benefit more from CoT prompting, perhaps due to the emergent nature of reasoning capabilities in LLMs. Interestingly, the Taiwan-LLM-7B-Chat model also showed improvement with CoT despite its smaller size.

Finally, the text examines model performance across questions from different years. Most models did not exhibit clear temporal trends, but GPT-4-turbo and Yi-34B-Chat showed improving performance on more recent questions.

Conclusion

This paper introduces TMLU, an evaluation suite designed to assess the advanced knowledge and reasoning abilities of large language models (LLMs) in the context of Taiwanese Mandarin. The experimental results and analysis suggest great opportunities for future developers, particularly for open-source models. The authors aim to address the scarcity of benchmarks for the Traditional Chinese community and envision TMLU to serve as a grounded testbed, fostering the development of localized Taiwanese-Mandarin LLMs.

Appendix A Limitations and Future Work

The text acknowledges that the TMLU benchmark, which evaluates language models' understanding capabilities in Taiwanese Mandarin, has limitations in generalizing to language generation tasks. To investigate this, the authors conduct a preliminary experiment calculating the perplexity (PPL) of generated answers for several models. The models trained exclusively on Taiwanese Mandarin, such as Taiwan-LLM, have lower PPL values, suggesting more consistent performance in generating fluent Taiwanese Mandarin text compared to multilingual models. The authors conclude that while TMLU provides valuable insights into language models' understanding capabilities, it should not be the sole indicator of their performance in generation tasks. Future research should focus on developing generation-oriented benchmarks and exploring the relationship between understanding and generation abilities to gain a more comprehensive view of language models in the context of Taiwanese Mandarin.

Appendix B Implementation Details of Data Contamination Analysis

The researchers constructed the input example x_x_x_italic_x by combining the question and corresponding choices for each sampled instance from the TMLU and TMMLU-plus datasets. This mirrors the actual scenario where the model would be queried.

To account for differences in input example length when computing the Min-k% Prob metric, the researchers filtered out the longest examples within each subject subset of the dataset. This was done because the average text length of TMLU is significantly longer than TMMLU-plus. This step ensures a fair and comparable analysis.

(a)

Figure 7: An example prompt for few-shot CoT evaluation on TMLU.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine

Wenjing Yue, Xiaoling Wang, Wei Zhu, Ming Guan, Huanran Zheng, Pengfei Wang, Changzhi Sun, Xin Ma

Large language models (LLMs) have performed remarkably well in various natural language processing tasks by benchmarking, including in the Western medical domain. However, the professional evaluation benchmarks for LLMs have yet to be covered in the traditional Chinese medicine(TCM) domain, which has a profound history and vast influence. To address this research gap, we introduce TCM-Bench, an comprehensive benchmark for evaluating LLM performance in TCM. It comprises the TCM-ED dataset, consisting of 5,473 questions sourced from the TCM Licensing Exam (TCMLE), including 1,300 questions with authoritative analysis. It covers the core components of TCMLE, including TCM basis and clinical practice. To evaluate LLMs beyond accuracy of question answering, we propose TCMScore, a metric tailored for evaluating the quality of answers generated by LLMs for TCM related questions. It comprehensively considers the consistency of TCM semantics and knowledge. After conducting comprehensive experimental analyses from diverse perspectives, we can obtain the following findings: (1) The unsatisfactory performance of LLMs on this benchmark underscores their significant room for improvement in TCM. (2) Introducing domain knowledge can enhance LLMs' performance. However, for in-domain models like ZhongJing-TCM, the quality of generated analysis text has decreased, and we hypothesize that their fine-tuning process affects the basic LLM capabilities. (3) Traditional metrics for text generation quality like Rouge and BertScore are susceptible to text length and surface semantic ambiguity, while domain-specific metrics such as TCMScore can further supplement and explain their evaluation results. These findings highlight the capabilities and limitations of LLMs in the TCM and aim to provide a more profound assistance to medical research.

6/4/2024

cs.CL cs.AI

Are LLMs Effective Backbones for Fine-tuning? An Experimental Investigation of Supervised LLMs on Chinese Short Text Matching

Shulin Liu, Chengcheng Xu, Hao Liu, Tinghao Yu, Tao Yang

The recent success of Large Language Models (LLMs) has garnered significant attention in both academia and industry. Prior research on LLMs has primarily focused on enhancing or leveraging their generalization capabilities in zero- and few-shot settings. However, there has been limited investigation into effectively fine-tuning LLMs for a specific natural language understanding task in supervised settings. In this study, we conduct an experimental analysis by fine-tuning LLMs for the task of Chinese short text matching. We explore various factors that influence performance when fine-tuning LLMs, including task modeling methods, prompt formats, and output formats.

4/1/2024

cs.CL

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

Xinrun Du, Zhouliang Yu, Songyang Gao, Ding Pan, Yuyang Cheng, Ziyang Ma, Ruibin Yuan, Xingwei Qu, Jiaheng Liu, Tianyu Zheng, Xinchen Luo, Guorui Zhou, Binhang Yuan, Wenhu Chen, Jie Fu, Ge Zhang

In this study, we introduce CT-LLM, a 2B large language model (LLM) that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques. Demonstrating remarkable performance on the CHC-Bench, CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages, broadening the horizons for LLM training methodologies. By open-sourcing the full process of training a Chinese LLM, including a detailed data processing procedure with the obtained Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a well-chosen multidisciplinary Chinese Hard Case Benchmark (CHC-Bench), and the 2B-size Chinese Tiny LLM (CT-LLM), we aim to foster further exploration and innovation in both academia and industry, paving the way for more inclusive and versatile language models.

4/10/2024

cs.CL cs.AI

Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset

Jie Zhu, Junhui Li, Yalong Wen, Lifan Guo

In light of recent breakthroughs in large language models (LLMs) that have revolutionized natural language processing (NLP), there is an urgent need for new benchmarks to keep pace with the fast development of LLMs. In this paper, we propose CFLUE, the Chinese Financial Language Understanding Evaluation benchmark, designed to assess the capability of LLMs across various dimensions. Specifically, CFLUE provides datasets tailored for both knowledge assessment and application assessment. In knowledge assessment, it consists of 38K+ multiple-choice questions with associated solution explanations. These questions serve dual purposes: answer prediction and question reasoning. In application assessment, CFLUE features 16K+ test instances across distinct groups of NLP tasks such as text classification, machine translation, relation extraction, reading comprehension, and text generation. Upon CFLUE, we conduct a thorough evaluation of representative LLMs. The results reveal that only GPT-4 and GPT-4-turbo achieve an accuracy exceeding 60% in answer prediction for knowledge assessment, suggesting that there is still substantial room for improvement in current LLMs. In application assessment, although GPT-4 and GPT-4-turbo are the top two performers, their considerable advantage over lightweight LLMs is noticeably diminished. The datasets and scripts associated with CFLUE are openly accessible at https://github.com/aliyun/cflue.

5/20/2024

cs.CL cs.AI