TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine

2406.01126

Published 6/4/2024 by Wenjing Yue, Xiaoling Wang, Wei Zhu, Ming Guan, Huanran Zheng, Pengfei Wang, Changzhi Sun, Xin Ma

TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine

Abstract

Large language models (LLMs) have performed remarkably well in various natural language processing tasks by benchmarking, including in the Western medical domain. However, the professional evaluation benchmarks for LLMs have yet to be covered in the traditional Chinese medicine(TCM) domain, which has a profound history and vast influence. To address this research gap, we introduce TCM-Bench, an comprehensive benchmark for evaluating LLM performance in TCM. It comprises the TCM-ED dataset, consisting of 5,473 questions sourced from the TCM Licensing Exam (TCMLE), including 1,300 questions with authoritative analysis. It covers the core components of TCMLE, including TCM basis and clinical practice. To evaluate LLMs beyond accuracy of question answering, we propose TCMScore, a metric tailored for evaluating the quality of answers generated by LLMs for TCM related questions. It comprehensively considers the consistency of TCM semantics and knowledge. After conducting comprehensive experimental analyses from diverse perspectives, we can obtain the following findings: (1) The unsatisfactory performance of LLMs on this benchmark underscores their significant room for improvement in TCM. (2) Introducing domain knowledge can enhance LLMs' performance. However, for in-domain models like ZhongJing-TCM, the quality of generated analysis text has decreased, and we hypothesize that their fine-tuning process affects the basic LLM capabilities. (3) Traditional metrics for text generation quality like Rouge and BertScore are susceptible to text length and surface semantic ambiguity, while domain-specific metrics such as TCMScore can further supplement and explain their evaluation results. These findings highlight the capabilities and limitations of LLMs in the TCM and aim to provide a more profound assistance to medical research.

Create account to get full access

Overview

This paper introduces TCMBench, a comprehensive benchmark for evaluating large language models in the domain of Traditional Chinese Medicine (TCM).
The benchmark covers a wide range of TCM-specific tasks, including diagnosis, herbal prescription, and syndrome differentiation.
The goal is to provide a standardized way to assess the performance of large language models in the context of TCM, which is an important and understudied area of healthcare.

Plain English Explanation

The paper introduces a new benchmark called TCMBench that is designed to evaluate how well large language models (LLMs) can handle tasks related to Traditional Chinese Medicine (TCM). TCM is a centuries-old medical system that is still widely used in many parts of the world, particularly in East Asia. However, it can be challenging for AI systems to understand and work with the concepts and terminology used in TCM.

The TCMBench benchmark includes a variety of tasks that are common in TCM, such as diagnosing patients, prescribing herbal remedies, and identifying syndromes (patterns of symptoms). By testing LLMs on these tasks, the researchers aim to provide a standardized way to measure how well these models can understand and apply TCM knowledge. This is important because LLMs are increasingly being used in healthcare applications, and it's crucial to ensure they can handle the unique aspects of traditional medicine systems like TCM.

Overall, the TCMBench benchmark is a valuable tool for advancing the use of AI in the field of traditional medicine, which could ultimately lead to improved healthcare outcomes for patients who rely on TCM.

Technical Explanation

The paper introduces the TCMBench benchmark, which is designed to evaluate the performance of large language models (LLMs) on a variety of tasks related to Traditional Chinese Medicine (TCM). The benchmark covers three main categories of TCM-specific tasks:

Diagnosis: Classifying patient symptoms into TCM diagnoses.
Herbal Prescription: Recommending appropriate herbal remedies based on a patient's condition.
Syndrome Differentiation: Identifying the underlying patterns of symptoms (known as "syndromes") that characterize a patient's health state.

To create the benchmark, the authors curated a large dataset of TCM-related texts, including clinical case records, herbal prescription manuals, and TCM textbooks. They then developed a set of evaluation tasks and metrics that can be used to assess the performance of LLMs on these TCM-specific challenges.

The researchers evaluated several state-of-the-art LLMs on the TCMBench tasks, including models pre-trained on general Chinese text as well as models fine-tuned on TCM-specific data. Their results show that while the fine-tuned models generally outperform the general-purpose LLMs, there is still significant room for improvement, particularly on more complex tasks like syndrome differentiation.

The authors also discuss the limitations of their approach and suggest directions for future research, such as incorporating more diverse TCM data sources and exploring the use of multi-modal inputs (e.g., combining text with images of herbs or acupuncture points).

Critical Analysis

The TCMBench benchmark represents an important step forward in the field of AI-powered traditional Chinese medicine. By providing a standardized way to evaluate the performance of LLMs on TCM-specific tasks, the authors have laid the groundwork for more rigorous and comprehensive assessments of these models' capabilities in the context of traditional medicine.

One potential limitation of the current benchmark is the reliance on textual data alone. In practice, TCM practitioners often rely on a range of inputs, including patient observations, physical examinations, and traditional diagnostic techniques like pulse palpation and tongue inspection. Incorporating these multimodal inputs could lead to more realistic and meaningful evaluations of LLMs in the TCM domain.

Additionally, the authors acknowledge that their dataset, while extensive, may not fully capture the diversity of TCM knowledge and practices across different regions and traditions. Expanding the benchmark to include a wider range of TCM data sources and use cases could help ensure its relevance and applicability to a broader range of healthcare settings.

Overall, the TCMBench benchmark represents an important step forward in the development of AI systems for traditional medicine. By providing a standardized way to assess the performance of LLMs in this domain, the authors have opened the door to more robust and meaningful research on the integration of AI and traditional Chinese medicine.

Conclusion

The TCMBench benchmark introduced in this paper is a valuable tool for advancing the use of large language models (LLMs) in the field of Traditional Chinese Medicine (TCM). By providing a standardized way to evaluate the performance of these models on a range of TCM-specific tasks, the benchmark can help researchers and developers better understand the capabilities and limitations of LLMs in this important healthcare domain.

The benchmark's comprehensive coverage of TCM-related tasks, including diagnosis, herbal prescription, and syndrome differentiation, makes it a valuable resource for the broader medical AI community. As LLMs continue to be applied in healthcare settings, tools like TCMBench will be essential for ensuring these models can effectively handle the unique challenges and nuances of traditional medical systems.

Overall, the TCMBench benchmark represents an important step forward in the integration of AI and traditional Chinese medicine, with the potential to ultimately improve healthcare outcomes for patients who rely on TCM.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

📶

CMB: A Comprehensive Medical Benchmark in Chinese

Xidong Wang, Guiming Hardy Chen, Dingjie Song, Zhiyi Zhang, Zhihong Chen, Qingying Xiao, Feng Jiang, Jianquan Li, Xiang Wan, Benyou Wang, Haizhou Li

Large Language Models (LLMs) provide a possibility to make a great breakthrough in medicine. The establishment of a standardized medical benchmark becomes a fundamental cornerstone to measure progression. However, medical environments in different regions have their local characteristics, e.g., the ubiquity and significance of traditional Chinese medicine within China. Therefore, merely translating English-based medical evaluation may result in textit{contextual incongruities} to a local region. To solve the issue, we propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety. Using this benchmark, we have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. We hope this benchmark provide first-hand experience in existing LLMs for medicine and also facilitate the widespread adoption and enhancement of medical LLMs within China. Our data and code are publicly available at https://github.com/FreedomIntelligence/CMB.

4/5/2024

cs.CL cs.AI

TCMD: A Traditional Chinese Medicine QA Dataset for Evaluating Large Language Models

Ping Yu, Kaitao Song, Fengchen He, Ming Chen, Jianfeng Lu

The recently unprecedented advancements in Large Language Models (LLMs) have propelled the medical community by establishing advanced medical-domain models. However, due to the limited collection of medical datasets, there are only a few comprehensive benchmarks available to gauge progress in this area. In this paper, we introduce a new medical question-answering (QA) dataset that contains massive manual instruction for solving Traditional Chinese Medicine examination tasks, called TCMD. Specifically, our TCMD collects massive questions across diverse domains with their annotated medical subjects and thus supports us in comprehensively assessing the capability of LLMs in the TCM domain. Extensive evaluation of various general LLMs and medical-domain-specific LLMs is conducted. Moreover, we also analyze the robustness of current LLMs in solving TCM QA tasks by introducing randomness. The inconsistency of the experimental results also reveals the shortcomings of current LLMs in solving QA tasks. We also expect that our dataset can further facilitate the development of LLMs in the TCM area.

6/10/2024

cs.CL

$Qibo: A Large Language Model for Traditional Chinese Medicine$

Qibo: A Large Language Model for Traditional Chinese Medicine

Heyi Zhang, Xin Wang, Zhaopeng Meng, Zhe Chen, Pengwei Zhuang, Yongzhe Jia, Dawei Xu, Wenbin Guo

Large Language Models (LLMs) has made significant progress in a number of professional fields, including medicine, law, and finance. However, in traditional Chinese medicine (TCM), there are challenges such as the essential differences between theory and modern medicine, the lack of specialized corpus resources, and the fact that relying only on supervised fine-tuning may lead to overconfident predictions. To address these challenges, we propose a two-stage training approach that combines continuous pre-training and supervised fine-tuning. A notable contribution of our study is the processing of a 2GB corpus dedicated to TCM, constructing pre-training and instruction fine-tuning datasets for TCM, respectively. In addition, we have developed Qibo-Benchmark, a tool that evaluates the performance of LLM in the TCM on multiple dimensions, including subjective, objective, and three TCM NLP tasks. The medical LLM trained with our pipeline, named $textbf{Qibo}$, exhibits significant performance boosts. Compared to the baselines, the average subjective win rate is 63%, the average objective accuracy improved by 23% to 58%, and the Rouge-L scores for the three TCM NLP tasks are 0.72, 0.61, and 0.55. Finally, we propose a pipline to apply Qibo to TCM consultation and demonstrate the model performance through the case study.

6/26/2024

cs.CL cs.AI

Large Language Models in Healthcare: A Comprehensive Benchmark

Andrew Liu, Hongjian Zhou, Yining Hua, Omid Rohanian, Anshul Thakur, Lei Clifton, David A. Clifton

The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and complex clinical tasks that are close to real-world practice, i.e., referral QA, treatment recommendation, hospitalization (long document) summarization, patient education, pharmacology QA and drug interaction for emerging drugs. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs.

6/27/2024

cs.CL cs.AI