CollectiveSFT: Scaling Large Language Models for Chinese Medical Benchmark with Collective Instructions in Healthcare

Read original: arXiv:2407.19705 - Published 7/31/2024 by Jingwei Zhu, Minghuan Tan, Min Yang, Ruixue Li, Hamid Alinejad-Rokny

💬

Overview

CollectiveSFT is a research paper that explores scaling large language models (LLMs) for a Chinese medical benchmark using collective instructions in healthcare.
The paper focuses on improving the performance of LLMs on the Comprehensive Medical Benchmark for Chinese (CMB-C), a benchmark for evaluating the capabilities of LLMs in the Chinese medical domain.
The researchers propose a novel approach called "Collective Supervised Fine-Tuning (CollectiveSFT)" to enhance the performance of LLMs on the CMB-C benchmark.

Plain English Explanation

The CollectiveSFT paper tackles the challenge of making large language models (LLMs) more effective at tackling medical tasks in the Chinese language. LLMs are powerful AI models that can understand and generate human-like text, but they often struggle with specialized domains like healthcare.

To address this, the researchers developed a new technique called "Collective Supervised Fine-Tuning" (CollectiveSFT). The key idea is to provide the LLM with a collection of diverse healthcare-related instructions during the training process, rather than just a single task. This "collective" approach helps the model learn a more comprehensive understanding of medical concepts and tasks, which translates to better performance on the Comprehensive Medical Benchmark for Chinese (CMB-C).

The CMB-C benchmark is designed to evaluate how well LLMs can handle a range of medical-related tasks in the Chinese language, such as disease diagnosis, treatment recommendations, and medical document summarization. By applying the CollectiveSFT technique, the researchers were able to significantly improve the LLM's performance on this benchmark, making it better equipped to handle real-world Chinese medical tasks.

Technical Explanation

The CollectiveSFT paper presents a novel approach for scaling large language models (LLMs) to perform well on the Comprehensive Medical Benchmark for Chinese (CMB-C), a benchmark for evaluating the capabilities of LLMs in the Chinese medical domain.

The key innovation of the CollectiveSFT technique is the use of "collective instructions" during the fine-tuning process. Instead of fine-tuning the LLM on a single task, the researchers exposed the model to a diverse set of healthcare-related instructions, covering a wide range of medical topics and tasks. This collective approach helps the LLM develop a more comprehensive understanding of medical concepts and reasoning, leading to better performance on the CMB-C benchmark.

The researchers conducted extensive experiments to evaluate the efficacy of the CollectiveSFT approach. They fine-tuned several state-of-the-art LLMs, including PanGu-Alpha and CPM-2, using the CollectiveSFT method and compared their performance on the CMB-C benchmark to other fine-tuning techniques. The results demonstrate that the CollectiveSFT approach significantly outperforms conventional fine-tuning methods, highlighting the benefits of the collective instruction approach.

Critical Analysis

The CollectiveSFT paper presents a compelling approach to improving the performance of large language models on Chinese medical tasks. The key strength of the research is the novel CollectiveSFT technique, which leverages the collective power of diverse healthcare-related instructions to enhance the LLM's understanding of medical concepts and reasoning.

However, the paper does not provide a detailed analysis of the specific types of instructions used during the CollectiveSFT process. It would be helpful to understand the breadth and depth of the instruction set, as well as how the researchers selected and curated the instructions to ensure they cover a comprehensive range of medical tasks and topics.

Additionally, the paper could have explored the potential limitations of the CollectiveSFT approach, such as the computational and resource requirements needed to fine-tune LLMs with a large collection of instructions, or the potential challenges in scaling the technique to even larger language models or more diverse medical benchmarks.

Overall, the CollectiveSFT paper makes a valuable contribution to the field of large language model adaptation for specialized domains, particularly in the context of Chinese medical AI. The proposed technique represents a promising step forward in enhancing the capabilities of LLMs in the healthcare sector.

Conclusion

The CollectiveSFT paper presents a novel approach for scaling large language models to perform well on the Comprehensive Medical Benchmark for Chinese (CMB-C). The key innovation is the use of "collective instructions" during the fine-tuning process, which helps the LLM develop a more comprehensive understanding of medical concepts and tasks.

The experimental results demonstrate that the CollectiveSFT technique significantly outperforms conventional fine-tuning methods, highlighting the benefits of the collective instruction approach. This research represents an important step forward in enhancing the capabilities of LLMs in the Chinese medical domain, and the insights from this work may also be applicable to other specialized domains and benchmarks.

As the field of large language models continues to evolve, the CollectiveSFT paper serves as a valuable contribution, showcasing the potential of collective instruction-based fine-tuning to scale LLMs for complex, domain-specific tasks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

CollectiveSFT: Scaling Large Language Models for Chinese Medical Benchmark with Collective Instructions in Healthcare

Jingwei Zhu, Minghuan Tan, Min Yang, Ruixue Li, Hamid Alinejad-Rokny

The rapid progress in Large Language Models (LLMs) has prompted the creation of numerous benchmarks to evaluate their capabilities.This study focuses on the Comprehensive Medical Benchmark in Chinese (CMB), showcasing how dataset diversity and distribution in supervised fine-tuning (SFT) may enhance LLM performance.Remarkably, We successfully trained a smaller base model to achieve scores comparable to larger models, indicating that a diverse and well-distributed dataset can optimize performance regardless of model size.This study suggests that even smaller models may reach high performance levels with carefully curated and varied datasets. By integrating a wide range of instructional content, our approach addresses potential issues such as data quality inconsistencies. Our results imply that a broader spectrum of training data may enhance a model's ability to generalize and perform effectively across different medical scenarios, highlighting the importance of dataset quality and diversity in fine-tuning processes. We open-source the model for future research at https://github.com/CAS-SIAT-XinHai/CollectiveSFT

7/31/2024

📶

CMB: A Comprehensive Medical Benchmark in Chinese

Xidong Wang, Guiming Hardy Chen, Dingjie Song, Zhiyi Zhang, Zhihong Chen, Qingying Xiao, Feng Jiang, Jianquan Li, Xiang Wan, Benyou Wang, Haizhou Li

Large Language Models (LLMs) provide a possibility to make a great breakthrough in medicine. The establishment of a standardized medical benchmark becomes a fundamental cornerstone to measure progression. However, medical environments in different regions have their local characteristics, e.g., the ubiquity and significance of traditional Chinese medicine within China. Therefore, merely translating English-based medical evaluation may result in textit{contextual incongruities} to a local region. To solve the issue, we propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety. Using this benchmark, we have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. We hope this benchmark provide first-hand experience in existing LLMs for medicine and also facilitate the widespread adoption and enhancement of medical LLMs within China. Our data and code are publicly available at https://github.com/FreedomIntelligence/CMB.

4/5/2024

Towards Evaluating and Building Versatile Large Language Models for Medicine

Chaoyi Wu, Pengcheng Qiu, Jinxin Liu, Hongfei Gu, Na Li, Ya Zhang, Yanfeng Wang, Weidi Xie

In this study, we present MedS-Bench, a comprehensive benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. Unlike existing benchmarks that focus on multiple-choice question answering, MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation, among others. We evaluated six leading LLMs, e.g., MEDITRON, Mistral, InternLM 2, Llama 3, GPT-4, and Claude-3.5 using few-shot prompting, and found that even the most sophisticated models struggle with these complex tasks. To address these limitations, we developed MedS-Ins, a large-scale instruction tuning dataset for medicine. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks. To demonstrate the dataset's utility, we conducted a proof-of-concept experiment by performing instruction tuning on a lightweight, open-source medical language model. The resulting model, MMedIns-Llama 3, significantly outperformed existing models across nearly all clinical tasks. To promote further advancements in the application of LLMs to clinical challenges, we have made the MedS-Ins dataset fully accessible and invite the research community to contribute to its expansion.Additionally, we have launched a dynamic leaderboard for MedS-Bench, which we plan to regularly update the test set to track progress and enhance the adaptation of general LLMs to the medical domain. Leaderboard: https://henrychur.github.io/MedS-Bench/. Github: https://github.com/MAGIC-AI4Med/MedS-Ins.

9/6/2024

💬

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Mianxin Liu, Jinru Ding, Jie Xu, Weiguo Hu, Xiaoyang Li, Lifeng Zhu, Zhian Bai, Xiaoming Shi, Benyou Wang, Haitao Song, Pengfei Liu, Xiaofan Zhang, Shanshan Wang, Kang Li, Haofen Wang, Tong Ruan, Xuanjing Huang, Xin Sun, Shaoting Zhang

Ensuring the general efficacy and goodness for human beings from medical large language models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we introduce MedBench, a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM. First, MedBench assembles the currently largest evaluation dataset (300,901 questions) to cover 43 clinical specialties and performs multi-facet evaluation on medical LLM. Second, MedBench provides a standardized and fully automatic cloud-based evaluation infrastructure, with physical separations for question and ground truth. Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer remembering. Applying MedBench to popular general and medical LLMs, we observe unbiased, reproducible evaluation results largely aligning with medical professionals' perspectives. This study establishes a significant foundation for preparing the practical applications of Chinese medical LLMs. MedBench is publicly accessible at https://medbench.opencompass.org.cn.

7/17/2024