An Improved Traditional Chinese Evaluation Suite for Foundation Model

Read original: arXiv:2403.01858 - Published 7/12/2024 by Zhi-Rui Tam, Ya-Ting Pai, Yen-Wei Lee, Jun-Da Chen, Wei-Min Chu, Sega Cheng, Hong-Han Shuai

An Improved Traditional Chinese Evaluation Suite for Foundation Model

Overview

This paper introduces an improved Traditional Chinese evaluation suite for foundation language models.
The suite includes new datasets and tasks to better assess the performance of language models in understanding and generating Traditional Chinese text.
The researchers aim to provide a comprehensive benchmark to drive the development of more capable Chinese language models.

Plain English Explanation

The paper presents an enhanced set of evaluation tasks and datasets for assessing the capabilities of large language models when working with Traditional Chinese. Traditional Chinese is a writing system used in regions like Taiwan, Hong Kong, and Macau, and it differs from the Simplified Chinese used in mainland China.

The researchers developed this improved evaluation suite, called TCEQA, to more thoroughly test how well foundation models can understand and generate Traditional Chinese text. TCEQA includes new datasets and tasks beyond what previous benchmarks like CFLUE and FoundaBench have covered.

The goal is to provide a more comprehensive way to evaluate the capabilities of large language models when it comes to Traditional Chinese, which has unique linguistic properties compared to Simplified Chinese. This should help drive the development of more advanced models that can handle Traditional Chinese effectively across a range of real-world applications.

Technical Explanation

The paper introduces the Traditional Chinese Evaluation and Query Answering (TCEQA) suite, a new benchmark for assessing the performance of foundation language models on Traditional Chinese tasks.

TCEQA includes the following key components:

Task Overview: The suite covers a diverse range of tasks, including question answering, sentiment analysis, named entity recognition, and text generation. These tasks are designed to test a model's understanding and generation capabilities for Traditional Chinese text.
Datasets: The researchers curated new datasets for each task, drawing from high-quality sources like news articles, online forums, and government documents. These datasets contain text in Traditional Chinese script and cover a variety of domains.
Evaluation Metrics: The suite uses standard NLP metrics like accuracy, F1 score, and BLEU to measure model performance across the different tasks.

Through extensive experiments, the researchers demonstrate that existing foundation models struggle on the TCEQA benchmark compared to their performance on Simplified Chinese tasks. This highlights the need for more specialized training and evaluation of language models for the Traditional Chinese language.

Critical Analysis

The TCEQA benchmark fills an important gap in the field of Chinese language model evaluation. By focusing specifically on Traditional Chinese, the researchers have created a more relevant and challenging test suite than previous benchmarks that primarily used Simplified Chinese.

However, the paper does acknowledge some limitations of TCEQA. For example, the datasets are still relatively small compared to the massive scale of real-world Traditional Chinese text. Additionally, the benchmark does not cover multimodal tasks that combine text with other modalities like images or audio.

Further research is needed to expand the scope and scale of TCEQA, potentially incorporating more diverse data sources and task types. Exploring the specific linguistic and cultural differences between Traditional and Simplified Chinese, and how they impact model performance, could also yield valuable insights.

Conclusion

This paper presents TCEQA, an improved evaluation suite for assessing the capabilities of foundation language models when working with Traditional Chinese. By introducing new datasets and tasks, the researchers have created a more comprehensive benchmark to drive the development of more capable Chinese language models.

The results demonstrate that existing models struggle on Traditional Chinese compared to Simplified Chinese, highlighting the need for specialized training and evaluation. As the use of Traditional Chinese continues in regions like Taiwan, Hong Kong, and Macau, tools like TCEQA will be increasingly important for building language AI systems that can effectively serve these communities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

An Improved Traditional Chinese Evaluation Suite for Foundation Model

Zhi-Rui Tam, Ya-Ting Pai, Yen-Wei Lee, Jun-Da Chen, Wei-Min Chu, Sega Cheng, Hong-Han Shuai

We present TMMLU+, a new benchmark designed for Traditional Chinese language understanding. TMMLU+ is a multi-choice question-answering dataset with 66 subjects from elementary to professional level. It is six times larger and boasts a more balanced subject distribution than its predecessor, Taiwan Massive Multitask Language Understanding (TMMLU). We also benchmark closed-source models and 26 open-weight Chinese large language models (LLMs) of parameters ranging from 1.8B to 72B on the proposed TMMLU+. Our findings reveal that (1.) Traditional Chinese models still trail behind their Simplified Chinese counterparts, highlighting a need for more focused advancements in LLMs catering to Traditional Chinese. (2.) Current LLMs still fall short of human performance in average scores, indicating a potential need for future research to delve deeper into social science and humanities subjects. (3.) Among all the tokenization compression metrics examined, we identify that only the fertility score uniquely demonstrates strong correlations with our benchmark results. We foresee that TMMLU+ will pinpoint areas for future model improvement, thereby narrowing the gap between machine and human linguistic capabilities and supporting researchers in developing Traditional Chinese LLMs. Our dataset, along with the benchmark source code, is accessible at huggingface.co/datasets/ikala/tmmluplus.

7/12/2024

Measuring Taiwanese Mandarin Language Understanding

Po-Heng Chen, Sijia Cheng, Wei-Lin Chen, Yen-Ting Lin, Yun-Nung Chen

The evaluation of large language models (LLMs) has drawn substantial attention in the field recently. This work focuses on evaluating LLMs in a Chinese context, specifically, for Traditional Chinese which has been largely underrepresented in existing benchmarks. We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in LLMs, under the context of Taiwanese Mandarin. TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels. In addition, we curate chain-of-thought-like few-shot explanations for each subject to facilitate the evaluation of complex reasoning skills. To establish a comprehensive baseline, we conduct extensive experiments and analysis on 24 advanced LLMs. The results suggest that Chinese open-weight models demonstrate inferior performance comparing to multilingual proprietary ones, and open-weight models tailored for Taiwanese Mandarin lag behind the Simplified-Chinese counterparts. The findings indicate great headrooms for improvement, and emphasize the goal of TMLU to foster the development of localized Taiwanese-Mandarin LLMs. We release the benchmark and evaluation scripts for the community to promote future research.

4/1/2024

TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine

Wenjing Yue, Xiaoling Wang, Wei Zhu, Ming Guan, Huanran Zheng, Pengfei Wang, Changzhi Sun, Xin Ma

Large language models (LLMs) have performed remarkably well in various natural language processing tasks by benchmarking, including in the Western medical domain. However, the professional evaluation benchmarks for LLMs have yet to be covered in the traditional Chinese medicine(TCM) domain, which has a profound history and vast influence. To address this research gap, we introduce TCM-Bench, an comprehensive benchmark for evaluating LLM performance in TCM. It comprises the TCM-ED dataset, consisting of 5,473 questions sourced from the TCM Licensing Exam (TCMLE), including 1,300 questions with authoritative analysis. It covers the core components of TCMLE, including TCM basis and clinical practice. To evaluate LLMs beyond accuracy of question answering, we propose TCMScore, a metric tailored for evaluating the quality of answers generated by LLMs for TCM related questions. It comprehensively considers the consistency of TCM semantics and knowledge. After conducting comprehensive experimental analyses from diverse perspectives, we can obtain the following findings: (1) The unsatisfactory performance of LLMs on this benchmark underscores their significant room for improvement in TCM. (2) Introducing domain knowledge can enhance LLMs' performance. However, for in-domain models like ZhongJing-TCM, the quality of generated analysis text has decreased, and we hypothesize that their fine-tuning process affects the basic LLM capabilities. (3) Traditional metrics for text generation quality like Rouge and BertScore are susceptible to text length and surface semantic ambiguity, while domain-specific metrics such as TCMScore can further supplement and explain their evaluation results. These findings highlight the capabilities and limitations of LLMs in the TCM and aim to provide a more profound assistance to medical research.

6/4/2024

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning

Zheqi He, Xinya Wu, Pengfei Zhou, Richeng Xuan, Guang Liu, Xi Yang, Qiannan Zhu, Hua Huang

Multi-modal large language models(MLLMs) have achieved remarkable progress and demonstrated powerful knowledge comprehension and reasoning abilities. However, the mastery of domain-specific knowledge, which is essential for evaluating the intelligence of MLLMs, continues to be a challenge. Current multi-modal benchmarks for domain-specific knowledge concentrate on multiple-choice questions and are predominantly available in English, which imposes limitations on the comprehensiveness of the evaluation. To this end, we introduce CMMU, a novel benchmark for multi-modal and multi-type question understanding and reasoning in Chinese. CMMU consists of 3,603 questions in 7 subjects, covering knowledge from primary to high school. The questions can be categorized into 3 types: multiple-choice, multiple-response, and fill-in-the-blank, bringing greater challenges to MLLMs. In addition, we propose an evaluation strategy called Positional Error Variance for assessing multiple-choice questions. The strategy aims to perform a quantitative analysis of position bias. We evaluate seven open-source MLLMs along with GPT4-V, Gemini-Pro, and Qwen-VL-Plus. The results demonstrate that CMMU poses a significant challenge to the recent MLLMs. The data and code are available at https://github.com/FlagOpen/CMMU.

5/9/2024