MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Read original: arXiv:2407.10990 - Published 7/17/2024 by Mianxin Liu, Jinru Ding, Jie Xu, Weiguo Hu, Xiaoyang Li, Lifeng Zhu, Zhian Bai, Xiaoming Shi, Benyou Wang, Haitao Song and 9 others

💬

Overview

The paper introduces MedBench, a comprehensive benchmarking system for evaluating Chinese medical large language models (LLMs) before real-world deployment.
MedBench assembles the largest evaluation dataset (300,901 questions) covering 43 clinical specialties and performs multi-faceted evaluation on medical LLMs.
MedBench provides a standardized and fully automated cloud-based evaluation infrastructure with physical separations for questions and ground truth.
MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer remembering.
Applying MedBench to popular general and medical LLMs yields unbiased and reproducible evaluation results aligned with medical professionals' perspectives.

Plain English Explanation

Before medical large language models (LLMs) can be used in the real world, it's crucial to ensure they work well and are good for humans. However, there isn't a widely accepted and accessible way to evaluate these models, especially in the Chinese context.

This research introduces a tool called MedBench. MedBench is a comprehensive, standardized, and reliable system for evaluating Chinese medical LLMs.

First, MedBench collects the largest dataset of medical questions (300,901 questions) covering 43 different medical specialties. This allows MedBench to thoroughly test the capabilities of medical LLMs.

Second, MedBench provides an automated cloud-based system to evaluate these models. This system keeps the questions and the correct answers physically separate, to prevent the models from simply memorizing the answers.

Third, MedBench has built-in mechanisms to stop the models from finding shortcuts or just remembering the answers. This ensures the evaluation results are unbiased and reliable.

When the researchers used MedBench to test popular general and medical LLMs, the results matched the perspectives of medical professionals. This shows MedBench is an important tool for preparing medical LLMs for real-world use.

Technical Explanation

The paper introduces MedBench, a comprehensive benchmarking system for evaluating Chinese medical large language models (LLMs) before real-world deployment.

MedBench first assembles the currently largest evaluation dataset, comprising 300,901 questions covering 43 clinical specialties. This broad coverage allows for multi-faceted evaluation of medical LLMs.

The paper then describes the standardized, cloud-based evaluation infrastructure of MedBench. This infrastructure physically separates the questions and ground truth, preventing models from simply memorizing the answers. MedBench also implements dynamic evaluation mechanisms to preclude shortcut learning.

Applying MedBench to evaluate popular general and medical LLMs, the researchers observe unbiased and reproducible results that largely align with medical professionals' perspectives. This establishes a strong foundation for the practical application of Chinese medical LLMs.

Critical Analysis

The paper provides a comprehensive and well-designed solution for evaluating Chinese medical LLMs through the MedBench system. However, the authors acknowledge that MedBench's evaluation dataset, while the largest of its kind, may still not cover the full breadth of medical knowledge.

Additionally, while MedBench's dynamic evaluation mechanisms aim to prevent shortcut learning, there could be other potential biases or blind spots in the system that the authors have not addressed. Further research may be needed to fully understand the limitations and edge cases of the MedBench evaluation.

The paper also does not discuss the potential ethical implications of deploying medical LLMs, such as issues around data privacy, algorithmic bias, and the accountability of AI systems in healthcare. These are important considerations that should be explored in future work.

Conclusion

This paper presents MedBench, a comprehensive and standardized benchmarking system for evaluating Chinese medical large language models (LLMs) before real-world deployment. MedBench assembles a large evaluation dataset, provides a reliable cloud-based infrastructure, and implements dynamic mechanisms to ensure unbiased and reproducible results.

By establishing this robust evaluation framework, the researchers have laid a strong foundation for the practical application of Chinese medical LLMs. This work is a significant step towards ensuring the general efficacy and safety of these models for use in healthcare settings.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Mianxin Liu, Jinru Ding, Jie Xu, Weiguo Hu, Xiaoyang Li, Lifeng Zhu, Zhian Bai, Xiaoming Shi, Benyou Wang, Haitao Song, Pengfei Liu, Xiaofan Zhang, Shanshan Wang, Kang Li, Haofen Wang, Tong Ruan, Xuanjing Huang, Xin Sun, Shaoting Zhang

Ensuring the general efficacy and goodness for human beings from medical large language models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we introduce MedBench, a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM. First, MedBench assembles the currently largest evaluation dataset (300,901 questions) to cover 43 clinical specialties and performs multi-facet evaluation on medical LLM. Second, MedBench provides a standardized and fully automatic cloud-based evaluation infrastructure, with physical separations for question and ground truth. Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer remembering. Applying MedBench to popular general and medical LLMs, we observe unbiased, reproducible evaluation results largely aligning with medical professionals' perspectives. This study establishes a significant foundation for preparing the practical applications of Chinese medical LLMs. MedBench is publicly accessible at https://medbench.opencompass.org.cn.

7/17/2024

📶

CMB: A Comprehensive Medical Benchmark in Chinese

Xidong Wang, Guiming Hardy Chen, Dingjie Song, Zhiyi Zhang, Zhihong Chen, Qingying Xiao, Feng Jiang, Jianquan Li, Xiang Wan, Benyou Wang, Haizhou Li

Large Language Models (LLMs) provide a possibility to make a great breakthrough in medicine. The establishment of a standardized medical benchmark becomes a fundamental cornerstone to measure progression. However, medical environments in different regions have their local characteristics, e.g., the ubiquity and significance of traditional Chinese medicine within China. Therefore, merely translating English-based medical evaluation may result in textit{contextual incongruities} to a local region. To solve the issue, we propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety. Using this benchmark, we have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. We hope this benchmark provide first-hand experience in existing LLMs for medicine and also facilitate the widespread adoption and enhancement of medical LLMs within China. Our data and code are publicly available at https://github.com/FreedomIntelligence/CMB.

4/5/2024

TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine

Wenjing Yue, Xiaoling Wang, Wei Zhu, Ming Guan, Huanran Zheng, Pengfei Wang, Changzhi Sun, Xin Ma

Large language models (LLMs) have performed remarkably well in various natural language processing tasks by benchmarking, including in the Western medical domain. However, the professional evaluation benchmarks for LLMs have yet to be covered in the traditional Chinese medicine(TCM) domain, which has a profound history and vast influence. To address this research gap, we introduce TCM-Bench, an comprehensive benchmark for evaluating LLM performance in TCM. It comprises the TCM-ED dataset, consisting of 5,473 questions sourced from the TCM Licensing Exam (TCMLE), including 1,300 questions with authoritative analysis. It covers the core components of TCMLE, including TCM basis and clinical practice. To evaluate LLMs beyond accuracy of question answering, we propose TCMScore, a metric tailored for evaluating the quality of answers generated by LLMs for TCM related questions. It comprehensively considers the consistency of TCM semantics and knowledge. After conducting comprehensive experimental analyses from diverse perspectives, we can obtain the following findings: (1) The unsatisfactory performance of LLMs on this benchmark underscores their significant room for improvement in TCM. (2) Introducing domain knowledge can enhance LLMs' performance. However, for in-domain models like ZhongJing-TCM, the quality of generated analysis text has decreased, and we hypothesize that their fine-tuning process affects the basic LLM capabilities. (3) Traditional metrics for text generation quality like Rouge and BertScore are susceptible to text length and surface semantic ambiguity, while domain-specific metrics such as TCMScore can further supplement and explain their evaluation results. These findings highlight the capabilities and limitations of LLMs in the TCM and aim to provide a more profound assistance to medical research.

6/4/2024

Towards Evaluating and Building Versatile Large Language Models for Medicine

Chaoyi Wu, Pengcheng Qiu, Jinxin Liu, Hongfei Gu, Na Li, Ya Zhang, Yanfeng Wang, Weidi Xie

In this study, we present MedS-Bench, a comprehensive benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. Unlike existing benchmarks that focus on multiple-choice question answering, MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation, among others. We evaluated six leading LLMs, e.g., MEDITRON, Mistral, InternLM 2, Llama 3, GPT-4, and Claude-3.5 using few-shot prompting, and found that even the most sophisticated models struggle with these complex tasks. To address these limitations, we developed MedS-Ins, a large-scale instruction tuning dataset for medicine. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks. To demonstrate the dataset's utility, we conducted a proof-of-concept experiment by performing instruction tuning on a lightweight, open-source medical language model. The resulting model, MMedIns-Llama 3, significantly outperformed existing models across nearly all clinical tasks. To promote further advancements in the application of LLMs to clinical challenges, we have made the MedS-Ins dataset fully accessible and invite the research community to contribute to its expansion.Additionally, we have launched a dynamic leaderboard for MedS-Bench, which we plan to regularly update the test set to track progress and enhance the adaptation of general LLMs to the medical domain. Leaderboard: https://henrychur.github.io/MedS-Bench/. Github: https://github.com/MAGIC-AI4Med/MedS-Ins.

9/6/2024