CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models

Read original: arXiv:2405.12174 - Published 5/21/2024 by Haoxiang Shi, Jiaan Wang, Jiarong Xu, Cen Wang, Tetsuya Sakai

Overview

• This paper, CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models, introduces a new benchmark called CT-Eval for evaluating the text-to-table generation capabilities of large language models (LLMs) in the Chinese language.

• The authors develop a dataset of Chinese tables and corresponding text descriptions, and use this to assess the performance of several LLMs on the task of generating table content from text.

• The paper also compares the Chinese text-to-table performance of LLMs to their English counterparts, providing insights into the challenges and opportunities for this task in the Chinese language.

Plain English Explanation

Large language models (LLMs) like GPT-3 and BERT have become incredibly powerful at understanding and generating human language. However, most of the research and development on these models has focused on English, leaving open questions about their capabilities in other languages like Chinese.

This paper aims to address this gap by introducing a new benchmark called CT-Eval (Chinese Text-to-Table Evaluation) for assessing how well LLMs can generate table content from Chinese text descriptions. The authors created a dataset of Chinese tables and their corresponding text descriptions, and used this to test the performance of several prominent LLMs.

The key finding is that while LLMs demonstrate strong capabilities for Chinese text-to-table generation, there is still room for improvement compared to their English counterparts. The authors identify several factors that contribute to this, such as the complexities of Chinese language and the relative lack of training data.

Overall, this research provides valuable insights into the state of Chinese language understanding in LLMs, and lays the groundwork for future advancements in this important area of natural language processing.

Technical Explanation

The paper first reviews relevant prior work on text-to-table generation, benchmarks for Chinese language understanding, and techniques for multilingual model evaluation. This includes discussions of benchmarks like METAL and HELM, as well as efforts to develop Chinese-centric language models like Chinese Tiny LLM and CFLUE.

The authors then describe the CT-Eval benchmark in detail. They constructed a dataset of over 40,000 Chinese tables and their corresponding text descriptions, covering a diverse range of domains. This dataset was used to evaluate the performance of several prominent LLMs, including GPT-3, PanGu-α, and ERNIE, on the task of generating table content from Chinese text.

The key metrics used in the evaluation were table generation accuracy, table generation faithfulness, and overall table generation quality. The results showed that while the LLMs were able to generate tables with reasonable accuracy, there was a notable gap in performance compared to their English counterparts. The authors attribute this to factors like the complex grammar and writing conventions of the Chinese language, as well as the relative scarcity of Chinese training data for these models.

The paper also includes an analysis of the types of errors made by the models, as well as the relationship between model size/training data and performance. The authors conclude by discussing the implications of their findings and outlining potential directions for future research, such as benchmarking large language models on CFLUE.

Critical Analysis

The CT-Eval benchmark and the authors' analysis of Chinese text-to-table generation in LLMs represent a valuable contribution to the field of natural language processing. By focusing on the unique challenges of the Chinese language, the research helps to shed light on the capabilities and limitations of current LLMs, and provides a robust framework for future evaluation and improvement.

That said, the paper does acknowledge several caveats and limitations to the work. For example, the dataset, while large, may not be fully representative of the diversity and complexity of real-world Chinese text-to-table scenarios. Additionally, the evaluation metrics, while well-designed, may not capture all the nuances of table generation quality.

Furthermore, the comparison to English text-to-table performance, while insightful, may be limited by differences in the underlying datasets and evaluation methodologies used. It would be helpful to see a more direct, apples-to-apples comparison to better understand the relative strengths and weaknesses of Chinese and English LLMs for this task.

Overall, the CT-Eval benchmark and the authors' findings represent an important step forward in understanding the state of Chinese language understanding in LLMs. However, there is still much work to be done to fully realize the potential of these models for real-world Chinese language applications.

Conclusion

This paper introduces a new benchmark called CT-Eval for evaluating the text-to-table generation capabilities of large language models in the Chinese language. The authors develop a comprehensive dataset of Chinese tables and their corresponding text descriptions, and use this to assess the performance of several prominent LLMs.

The key finding is that while LLMs demonstrate strong capabilities for Chinese text-to-table generation, there is still a gap in performance compared to their English counterparts. The authors attribute this to the unique challenges of the Chinese language, as well as the relative scarcity of Chinese training data for these models.

By providing a robust evaluation framework and detailed insights into the strengths and limitations of current LLMs for Chinese text-to-table generation, this research lays the groundwork for future advancements in this important area of natural language processing. The CT-Eval benchmark and the authors' analysis offer valuable guidance for researchers and developers working to enhance the Chinese language capabilities of large language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models

Haoxiang Shi, Jiaan Wang, Jiarong Xu, Cen Wang, Tetsuya Sakai

Text-to-Table aims to generate structured tables to convey the key information from unstructured documents. Existing text-to-table datasets are typically oriented English, limiting the research in non-English languages. Meanwhile, the emergence of large language models (LLMs) has shown great success as general task solvers in multi-lingual settings (e.g., ChatGPT), theoretically enabling text-to-table in other languages. In this paper, we propose a Chinese text-to-table dataset, CT-Eval, to benchmark LLMs on this task. Our preliminary analysis of English text-to-table datasets highlights two key factors for dataset construction: data diversity and data hallucination. Inspired by this, the CT-Eval dataset selects a popular Chinese multidisciplinary online encyclopedia as the source and covers 28 domains to ensure data diversity. To minimize data hallucination, we first train an LLM to judge and filter out the task samples with hallucination, then employ human annotators to clean the hallucinations in the validation and testing sets. After this process, CT-Eval contains 88.6K task samples. Using CT-Eval, we evaluate the performance of open-source and closed-source LLMs. Our results reveal that zero-shot LLMs (including GPT-4) still have a significant performance gap compared with human judgment. Furthermore, after fine-tuning, open-source LLMs can significantly improve their text-to-table ability, outperforming GPT-4 by a large margin. In short, CT-Eval not only helps researchers evaluate and quickly understand the Chinese text-to-table ability of existing LLMs but also serves as a valuable resource to significantly improve the text-to-table performance of LLMs.

5/21/2024

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xinrun Du, Di Liang, Daixin Shu, Xianfu Cheng, Tianzhen Sun, Guanglin Niu, Tongliang Li, Zhoujun Li

Recent advancements in Large Language Models (LLMs) have markedly enhanced the interpretation and processing of tabular data, introducing previously unimaginable capabilities. Despite these achievements, LLMs still encounter significant challenges when applied in industrial scenarios, particularly due to the increased complexity of reasoning required with real-world tabular data, underscoring a notable disparity between academic benchmarks and practical applications. To address this discrepancy, we conduct a detailed investigation into the application of tabular data in industrial scenarios and propose a comprehensive and complex benchmark TableBench, including 18 fields within four major categories of table question answering (TableQA) capabilities. Furthermore, we introduce TableLLM, trained on our meticulously constructed training set TableInstruct, achieving comparable performance with GPT-3.5. Massive experiments conducted on TableBench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT-4, achieves only a modest score compared to humans.

8/20/2024

Measuring Taiwanese Mandarin Language Understanding

Po-Heng Chen, Sijia Cheng, Wei-Lin Chen, Yen-Ting Lin, Yun-Nung Chen

The evaluation of large language models (LLMs) has drawn substantial attention in the field recently. This work focuses on evaluating LLMs in a Chinese context, specifically, for Traditional Chinese which has been largely underrepresented in existing benchmarks. We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in LLMs, under the context of Taiwanese Mandarin. TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels. In addition, we curate chain-of-thought-like few-shot explanations for each subject to facilitate the evaluation of complex reasoning skills. To establish a comprehensive baseline, we conduct extensive experiments and analysis on 24 advanced LLMs. The results suggest that Chinese open-weight models demonstrate inferior performance comparing to multilingual proprietary ones, and open-weight models tailored for Taiwanese Mandarin lag behind the Simplified-Chinese counterparts. The findings indicate great headrooms for improvement, and emphasize the goal of TMLU to foster the development of localized Taiwanese-Mandarin LLMs. We release the benchmark and evaluation scripts for the community to promote future research.

4/1/2024

METAL: Towards Multilingual Meta-Evaluation

Rishav Hada, Varun Gumma, Mohamed Ahmed, Kalika Bali, Sunayana Sitaram

With the rising human-like precision of Large Language Models (LLMs) in numerous tasks, their utilization in a variety of real-world applications is becoming more prevalent. Several studies have shown that LLMs excel on many standard NLP benchmarks. However, it is challenging to evaluate LLMs due to test dataset contamination and the limitations of traditional metrics. Since human evaluations are difficult to collect, there is a growing interest in the community to use LLMs themselves as reference-free evaluators for subjective metrics. However, past work has shown that LLM-based evaluators can exhibit bias and have poor alignment with human judgments. In this study, we propose a framework for an end-to-end assessment of LLMs as evaluators in multilingual scenarios. We create a carefully curated dataset, covering 10 languages containing native speaker judgments for the task of summarization. This dataset is created specifically to evaluate LLM-based evaluators, which we refer to as meta-evaluation (METAL). We compare the performance of LLM-based evaluators created using GPT-3.5-Turbo, GPT-4, and PaLM2. Our results indicate that LLM-based evaluators based on GPT-4 perform the best across languages, while GPT-3.5-Turbo performs poorly. Additionally, we perform an analysis of the reasoning provided by LLM-based evaluators and find that it often does not match the reasoning provided by human judges.

4/3/2024