AlpaCare:Instruction-tuned Large Language Models for Medical Application

2310.14558

Published 6/11/2024 by Xinlu Zhang, Chenxin Tian, Xianjun Yang, Lichang Chen, Zekun Li, Linda Ruth Petzold

💬

Abstract

Instruction-finetuning (IFT) has become crucial in aligning Large Language Models (LLMs) with diverse human needs and has shown great potential in medical applications. However, previous studies mainly fine-tune LLMs on biomedical datasets with limited diversity, which often rely on benchmarks or narrow task scopes, and hence significantly limit the effectiveness on their medical instruction-following ability and generalizability. To bridge this gap, we propose creating a diverse, machine-generated medical IFT dataset, MedInstruct-52k, using GPT-4 and ChatGPT with a high-quality expert-curated seed set. We then fine-tune LLaMA-series models on the dataset to develop AlpaCare. Despite using a smaller domain-specific dataset than previous medical LLMs, AlpaCare not only demonstrates superior performance on medical applications, with up to 38.1% absolute gain over best baselines in medical free-form instruction evaluations, but also achieves 6.7% absolute gains averaged over multiple general domain benchmarks. Human evaluation further shows that AlpaCare consistently outperforms best baselines in terms of both correctness and helpfulness. We offer public access to our data, model, and codebase in https://github.com/XZhang97666/AlpaCare.

Create account to get full access

Overview

Instruction-finetuning (IFT) has become crucial in aligning Large Language Models (LLMs) with diverse human needs, especially in medical applications.
Previous studies have mainly fine-tuned LLMs on limited biomedical datasets, which often rely on narrow benchmarks or task scopes, limiting their medical instruction-following ability and generalizability.
To address this, the researchers propose creating a diverse, machine-generated medical IFT dataset, MedInstruct-52k, using GPT-4 and ChatGPT with a high-quality expert-curated seed set.
They then fine-tune LLaMA-series models on this dataset to develop AlpaCare, a medical LLM that outperforms previous models on medical applications and general domain benchmarks.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can understand and generate human-like text. Researchers have been working to align these models with diverse human needs, especially in the medical field, through a process called instruction-finetuning (IFT).

In the past, researchers have mainly fine-tuned LLMs on limited biomedical datasets, which often focused on specific benchmarks or tasks. This has limited the models' ability to follow a wide range of medical instructions and apply their knowledge more broadly.

To address this, the researchers in this study created a larger, more diverse medical IFT dataset called MedInstruct-52k. They used advanced language models like GPT-4 and ChatGPT, along with a high-quality set of expert-curated examples, to generate this dataset.

The researchers then used this dataset to fine-tune LLaMA-series models, creating a new medical LLM called AlpaCare. Despite using a smaller dataset than previous medical LLMs, AlpaCare demonstrates superior performance on medical applications, outperforming other models by up to 38.1%. It also achieves better results on general-domain benchmarks, showing its broad capabilities.

Human evaluation further confirms that AlpaCare consistently outperforms other models in terms of both correctness and helpfulness when following medical instructions. The researchers have made the data, model, and code publicly available, which could be valuable for developing healthcare language models and localizing large language models for specific domains, such as healthcare.

Technical Explanation

The researchers first identified the limitations of previous studies that fine-tuned LLMs on narrow biomedical datasets, which often relied on specific benchmarks or task scopes. This significantly limited the models' medical instruction-following ability and generalizability.

To address this, the researchers created a diverse, machine-generated medical IFT dataset called MedInstruct-52k. They used advanced language models like GPT-4 and ChatGPT, along with a high-quality set of expert-curated examples, to generate a wide range of medical instructions and responses. This dataset was designed to better capture the nuances and diversity of real-world medical scenarios.

The researchers then fine-tuned LLaMA-series models on the MedInstruct-52k dataset to develop AlpaCare, a medical LLM. Despite using a smaller domain-specific dataset than previous medical LLMs, AlpaCare demonstrates superior performance on medical applications, with up to a 38.1% absolute gain over the best baseline models in medical free-form instruction evaluations.

Furthermore, AlpaCare also achieves a 6.7% absolute gain averaged over multiple general-domain benchmarks, showcasing its broad capabilities beyond just medical tasks. The researchers attribute this to the diverse and high-quality nature of the MedInstruct-52k dataset, which allowed AlpaCare to learn more generalizable medical knowledge and reasoning skills.

Human evaluation further confirms the effectiveness of AlpaCare, as it consistently outperforms the best baseline models in terms of both correctness and helpfulness when following medical instructions. This suggests that the researchers' approach of leveraging advanced language models and a curated dataset can effectively develop healthcare-focused language models that are better aligned with diverse human needs in the medical domain.

Critical Analysis

While the researchers have made a significant contribution by creating the MedInstruct-52k dataset and developing the AlpaCare model, there are a few potential areas for further research and improvement:

Dataset Limitations: The researchers acknowledge that the MedInstruct-52k dataset, although diverse, is still machine-generated and may not fully capture the nuances and complexities of real-world medical scenarios. Incorporating more human-generated data or collaborating with medical experts to further refine the dataset could be an area for future work.
Model Generalization: While AlpaCare demonstrates strong performance on medical applications and general-domain benchmarks, it would be interesting to explore its performance on even more diverse datasets, such as MALA-500, to further validate its generalization capabilities.
Ethical Considerations: As with any powerful language model, there are potential ethical concerns around the use of AlpaCare in sensitive medical domains. The researchers should consider addressing issues such as data privacy, bias, and the responsible deployment of the model in clinical settings.
Comparison to Specialized Medical LLMs: It would be valuable to conduct a more in-depth comparison between AlpaCare and other specialized medical language models, such as those developed by major healthcare organizations, to better understand the strengths and weaknesses of each approach.

Overall, the researchers have presented a compelling approach to developing a more capable and generalizable medical language model. By addressing the limitations of previous studies and leveraging advanced language models and curated datasets, they have taken an important step forward in aligning LLMs with diverse human needs in the medical domain.

Conclusion

The researchers in this study have made a significant contribution to the field of large language model (LLM) development for medical applications. By creating the diverse, machine-generated MedInstruct-52k dataset and fine-tuning LLaMA-series models to develop AlpaCare, they have demonstrated a novel approach to improving the medical instruction-following ability and generalizability of LLMs.

Despite using a smaller domain-specific dataset than previous medical LLMs, AlpaCare outperforms the best baseline models by a substantial margin on medical applications, while also achieving strong results on general-domain benchmarks. This suggests that the researchers' approach of leveraging advanced language models and curated datasets can effectively create healthcare-focused language models that are better aligned with diverse human needs in the medical domain.

The public release of the MedInstruct-52k dataset, the AlpaCare model, and the accompanying codebase could have far-reaching implications, enabling further research and development in localizing large language models for specific domains, such as healthcare. By continuing to push the boundaries of LLM capabilities in medical applications, the researchers have taken an important step towards more effective and personalized healthcare solutions powered by advanced language technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks

Yanis Labrak, Mickael Rouvier, Richard Dufour

We evaluate four state-of-the-art instruction-tuned large language models (LLMs) -- ChatGPT, Flan-T5 UL2, Tk-Instruct, and Alpaca -- on a set of 13 real-world clinical and biomedical natural language processing (NLP) tasks in English, such as named-entity recognition (NER), question-answering (QA), relation extraction (RE), etc. Our overall results demonstrate that the evaluated LLMs begin to approach performance of state-of-the-art models in zero- and few-shot scenarios for most tasks, and particularly well for the QA task, even though they have never seen examples from these tasks before. However, we observed that the classification and RE tasks perform below what can be achieved with a specifically trained model for the medical field, such as PubMedBERT. Finally, we noted that no LLM outperforms all the others on all the studied tasks, with some models being better suited for certain tasks than others.

6/11/2024

cs.CL cs.AI cs.LG

💬

BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Language Processing

Hieu Tran, Zhichao Yang, Zonghai Yao, Hong Yu

To enhance the performance of large language models (LLMs) in biomedical natural language processing (BioNLP) by introducing a domain-specific instruction dataset and examining its impact when combined with multi-task learning principles. We created the BioInstruct, comprising 25,005 instructions to instruction-tune LLMs(LLaMA 1 & 2, 7B & 13B version). The instructions were created by prompting the GPT-4 language model with three-seed samples randomly drawn from an 80 human curated instructions. We employed Low-Rank Adaptation(LoRA) for parameter-efficient fine-tuning. We then evaluated these instruction-tuned LLMs on several BioNLP tasks, which can be grouped into three major categories: question answering(QA), information extraction(IE), and text generation(GEN). We also examined whether categories(e.g., QA, IE, and generation) of instructions impact model performance. Comparing with LLMs without instruction-tuned, our instruction-tuned LLMs demonstrated marked performance gains: 17.3% in QA, 5.7% in IE, and 96% in Generation tasks. Our 7B-parameter instruction-tuned LLaMA 1 model was competitive or even surpassed other LLMs in the biomedical domain that were also fine-tuned from LLaMA 1 with vast domain-specific data or a variety of tasks. Our results also show that the performance gain is significantly higher when instruction fine-tuning is conducted with closely related tasks. Our findings align with the observations of multi-task learning, suggesting the synergies between two tasks. The BioInstruct dataset serves as a valuable resource and instruction tuned LLMs lead to the best performing BioNLP applications.

6/10/2024

cs.CL cs.AI

💬

LlamaCare: A Large Medical Language Model for Enhancing Healthcare Knowledge Sharing

Maojun Sun

Large language models (LLMs) have shown amazing capabilities in knowledge memorization and the present. However, when it comes to domain-specific knowledge and downstream tasks like medical, general LLMs are often unable to give precise answers. In addition, when people want LLMs to answer classification questions, they usually go through instruction tuning first. However, LLMs do not always give a direct index of the categorization after instruction tuning. In this paper, we proposed LlamaCare, a fine-tuned medical language model, and Extended Classification Integration(ECI), a module to handle classification problems of LLMs. Our contributions are : (i) We fine-tuned a large language model of medical knowledge with very low carbon emissions and achieved similar performance with ChatGPT by a 24G GPU. (ii) We solved the problem of redundant categorical answers and improved the performance of LLMs by proposing a new module called Extended Classification Integration. (iii) We released our processed data for one-shot and few-shot training for some benchmarks such as PubMedQA and USMLE 1-3 step. Our method achieves a close performance comparable to some state-of-the-art models with the same quantity of parameters on benchmarks, while being more environmentally friendly by using less GPU computation time. Our models, codes, and datasets can be found at url{https://github.com/Stephen-SMJ/LLamaCare}.

6/6/2024

cs.CL cs.AI

💬

Me LLaMA: Foundation Large Language Models for Medical Applications

Qianqian Xie, Qingyu Chen, Aokun Chen, Cheng Peng, Yan Hu, Fongci Lin, Xueqing Peng, Jimin Huang, Jeffrey Zhang, Vipina Keloth, Xinyu Zhou, Huan He, Lucila Ohno-Machado, Yonghui Wu, Hua Xu, Jiang Bian

Recent advancements in large language models (LLMs) such as ChatGPT and LLaMA have hinted at their potential to revolutionize medical applications, yet their application in clinical settings often reveals limitations due to a lack of specialized training on medical-specific data. In response to this challenge, this study introduces Me-LLaMA, a novel medical LLM family that includes foundation models - Me-LLaMA 13/70B, along with their chat-enhanced versions - Me-LLaMA 13/70B-chat, developed through continual pre-training and instruction tuning of LLaMA2 using large medical datasets. Our methodology leverages a comprehensive domain-specific data suite, including a large-scale, continual pre-training dataset with 129B tokens, an instruction tuning dataset with 214k samples, and a new medical evaluation benchmark (MIBE) across six critical medical tasks with 12 datasets. Our extensive evaluation using the MIBE shows that Me-LLaMA models achieve overall better performance than existing open-source medical LLMs in zero-shot, few-shot and supervised learning abilities. With task-specific instruction tuning, Me-LLaMA models outperform ChatGPT on 7 out of 8 datasets and GPT-4 on 5 out of 8 datasets. In addition, we investigated the catastrophic forgetting problem, and our results show that Me-LLaMA models outperform other open-source medical LLMs in mitigating this issue. Me-LLaMA is one of the largest open-source medical foundation LLMs that use both biomedical and clinical data. It exhibits superior performance across both general and medical tasks compared to other open-source medical LLMs, rendering it an attractive choice for medical AI applications. We release our models, datasets, and evaluation scripts at: https://github.com/BIDS-Xu-Lab/Me-LLaMA.

4/12/2024

cs.CL cs.AI