TCMD: A Traditional Chinese Medicine QA Dataset for Evaluating Large Language Models

2406.04941

Published 6/10/2024 by Ping Yu, Kaitao Song, Fengchen He, Ming Chen, Jianfeng Lu

TCMD: A Traditional Chinese Medicine QA Dataset for Evaluating Large Language Models

Abstract

The recently unprecedented advancements in Large Language Models (LLMs) have propelled the medical community by establishing advanced medical-domain models. However, due to the limited collection of medical datasets, there are only a few comprehensive benchmarks available to gauge progress in this area. In this paper, we introduce a new medical question-answering (QA) dataset that contains massive manual instruction for solving Traditional Chinese Medicine examination tasks, called TCMD. Specifically, our TCMD collects massive questions across diverse domains with their annotated medical subjects and thus supports us in comprehensively assessing the capability of LLMs in the TCM domain. Extensive evaluation of various general LLMs and medical-domain-specific LLMs is conducted. Moreover, we also analyze the robustness of current LLMs in solving TCM QA tasks by introducing randomness. The inconsistency of the experimental results also reveals the shortcomings of current LLMs in solving QA tasks. We also expect that our dataset can further facilitate the development of LLMs in the TCM area.

Create account to get full access

Overview

This paper introduces TCMD, a new dataset for evaluating large language models (LLMs) on Traditional Chinese Medicine (TCM) knowledge.
The dataset consists of multiple-choice questions about TCM concepts, diagnoses, and treatments.
The goal is to provide a benchmark for assessing the TCM-specific capabilities of LLMs.

Plain English Explanation

This research paper describes a new dataset called TCMD that is designed to test how well large language models (LLMs) can understand and reason about Traditional Chinese Medicine (TCM). TCM is a complex medical system with its own unique concepts, diagnoses, and treatments. The TCMD dataset includes a large number of multiple-choice questions that cover various aspects of TCM knowledge.

By using this dataset to evaluate LLMs, the researchers aim to gain insights into the models' capabilities when it comes to TCM-specific information. This is important because LLMs are increasingly being used in healthcare applications, and it's crucial to understand how well they can handle specialized medical domains like TCM. The TCMD dataset provides a standardized way to assess and compare the TCM-related performance of different LLMs.

Technical Explanation

The paper introduces the TCMD (Traditional Chinese Medicine QA Dataset) for evaluating the TCM-specific capabilities of large language models (LLMs). The dataset contains over 10,000 multiple-choice questions covering a wide range of TCM topics, including concepts, diagnoses, and treatments.

The questions were sourced from TCM textbooks and online resources, and carefully curated by domain experts to ensure accuracy and relevance. The dataset is designed to serve as a standardized benchmark for assessing how well LLMs can understand and reason about TCM knowledge, which is an important capability for medical applications using these models.

The authors evaluate several state-of-the-art LLMs on the TCMD dataset and provide detailed performance analysis. The results show that while LLMs can achieve reasonable accuracy on the task, there is still significant room for improvement, especially when it comes to more complex TCM concepts and reasoning.

Critical Analysis

The TCMD dataset represents an important step forward in evaluating the TCM-specific capabilities of large language models. By providing a standardized and comprehensive benchmark, the researchers have created a valuable tool for the research community.

However, the paper does acknowledge some limitations of the dataset, such as the potential for biases in the question selection and the need for further expansion to cover a broader range of TCM topics. Additionally, the evaluation is limited to multiple-choice questions, and it would be valuable to also explore the models' performance on more open-ended TCM-related tasks.

Further research could explore ways to incorporate additional modalities, such as TCM diagnostic information (e.g., pulse, tongue) or TCM treatment descriptions, into the evaluation framework. This could provide a more holistic assessment of the models' understanding of TCM.

Conclusion

The TCMD dataset introduced in this paper represents a significant contribution to the field of large language model evaluation, particularly in the context of specialized medical domains like Traditional Chinese Medicine. By providing a standardized benchmark, the researchers have paved the way for more rigorous and meaningful assessments of LLMs' TCM-related capabilities.

The results of the evaluation suggest that while current LLMs can perform reasonably well on TCM-related tasks, there is still substantial room for improvement. Continued research and development in this area could lead to significant advancements in the integration of LLMs into healthcare applications, ultimately benefiting both practitioners and patients.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine

Wenjing Yue, Xiaoling Wang, Wei Zhu, Ming Guan, Huanran Zheng, Pengfei Wang, Changzhi Sun, Xin Ma

Large language models (LLMs) have performed remarkably well in various natural language processing tasks by benchmarking, including in the Western medical domain. However, the professional evaluation benchmarks for LLMs have yet to be covered in the traditional Chinese medicine(TCM) domain, which has a profound history and vast influence. To address this research gap, we introduce TCM-Bench, an comprehensive benchmark for evaluating LLM performance in TCM. It comprises the TCM-ED dataset, consisting of 5,473 questions sourced from the TCM Licensing Exam (TCMLE), including 1,300 questions with authoritative analysis. It covers the core components of TCMLE, including TCM basis and clinical practice. To evaluate LLMs beyond accuracy of question answering, we propose TCMScore, a metric tailored for evaluating the quality of answers generated by LLMs for TCM related questions. It comprehensively considers the consistency of TCM semantics and knowledge. After conducting comprehensive experimental analyses from diverse perspectives, we can obtain the following findings: (1) The unsatisfactory performance of LLMs on this benchmark underscores their significant room for improvement in TCM. (2) Introducing domain knowledge can enhance LLMs' performance. However, for in-domain models like ZhongJing-TCM, the quality of generated analysis text has decreased, and we hypothesize that their fine-tuning process affects the basic LLM capabilities. (3) Traditional metrics for text generation quality like Rouge and BertScore are susceptible to text length and surface semantic ambiguity, while domain-specific metrics such as TCMScore can further supplement and explain their evaluation results. These findings highlight the capabilities and limitations of LLMs in the TCM and aim to provide a more profound assistance to medical research.

6/4/2024

cs.CL cs.AI

$Qibo: A Large Language Model for Traditional Chinese Medicine$

Qibo: A Large Language Model for Traditional Chinese Medicine

Heyi Zhang, Xin Wang, Zhaopeng Meng, Zhe Chen, Pengwei Zhuang, Yongzhe Jia, Dawei Xu, Wenbin Guo

Large Language Models (LLMs) has made significant progress in a number of professional fields, including medicine, law, and finance. However, in traditional Chinese medicine (TCM), there are challenges such as the essential differences between theory and modern medicine, the lack of specialized corpus resources, and the fact that relying only on supervised fine-tuning may lead to overconfident predictions. To address these challenges, we propose a two-stage training approach that combines continuous pre-training and supervised fine-tuning. A notable contribution of our study is the processing of a 2GB corpus dedicated to TCM, constructing pre-training and instruction fine-tuning datasets for TCM, respectively. In addition, we have developed Qibo-Benchmark, a tool that evaluates the performance of LLM in the TCM on multiple dimensions, including subjective, objective, and three TCM NLP tasks. The medical LLM trained with our pipeline, named $textbf{Qibo}$, exhibits significant performance boosts. Compared to the baselines, the average subjective win rate is 63%, the average objective accuracy improved by 23% to 58%, and the Rouge-L scores for the three TCM NLP tasks are 0.72, 0.61, and 0.55. Finally, we propose a pipline to apply Qibo to TCM consultation and demonstrate the model performance through the case study.

6/26/2024

cs.CL cs.AI

A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations

Jinqiang Wang, Huansheng Ning, Yi Peng, Qikai Wei, Daniel Tesfai, Wenwei Mao, Tao Zhu, Runhe Huang

Large Language Models (LLMs) have demonstrated surprising performance across various natural language processing tasks. Recently, medical LLMs enhanced with domain-specific knowledge have exhibited excellent capabilities in medical consultation and diagnosis. These models can smoothly simulate doctor-patient dialogues and provide professional medical advice. Most medical LLMs are developed through continued training of open-source general LLMs, which require significantly fewer computational resources than training LLMs from scratch. Additionally, this approach offers better protection of patient privacy compared to API-based solutions. This survey systematically explores how to train medical LLMs based on general LLMs. It covers: (a) how to acquire training corpus and construct customized medical training sets, (b) how to choose a appropriate training paradigm, (c) how to choose a suitable evaluation benchmark, and (d) existing challenges and promising future research directions are discussed. This survey can provide guidance for the development of LLMs focused on various medical applications, such as medical education, diagnostic planning, and clinical assistants.

6/18/2024

cs.CL cs.AI

📶

CMB: A Comprehensive Medical Benchmark in Chinese

Xidong Wang, Guiming Hardy Chen, Dingjie Song, Zhiyi Zhang, Zhihong Chen, Qingying Xiao, Feng Jiang, Jianquan Li, Xiang Wan, Benyou Wang, Haizhou Li

Large Language Models (LLMs) provide a possibility to make a great breakthrough in medicine. The establishment of a standardized medical benchmark becomes a fundamental cornerstone to measure progression. However, medical environments in different regions have their local characteristics, e.g., the ubiquity and significance of traditional Chinese medicine within China. Therefore, merely translating English-based medical evaluation may result in textit{contextual incongruities} to a local region. To solve the issue, we propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety. Using this benchmark, we have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. We hope this benchmark provide first-hand experience in existing LLMs for medicine and also facilitate the widespread adoption and enhancement of medical LLMs within China. Our data and code are publicly available at https://github.com/FreedomIntelligence/CMB.

4/5/2024

cs.CL cs.AI