BioInstruct: Instruction Tuning of Large Language Models for Biomedical Natural Language Processing

2310.19975

Published 6/10/2024 by Hieu Tran, Zhichao Yang, Zonghai Yao, Hong Yu

💬

Abstract

To enhance the performance of large language models (LLMs) in biomedical natural language processing (BioNLP) by introducing a domain-specific instruction dataset and examining its impact when combined with multi-task learning principles. We created the BioInstruct, comprising 25,005 instructions to instruction-tune LLMs(LLaMA 1 & 2, 7B & 13B version). The instructions were created by prompting the GPT-4 language model with three-seed samples randomly drawn from an 80 human curated instructions. We employed Low-Rank Adaptation(LoRA) for parameter-efficient fine-tuning. We then evaluated these instruction-tuned LLMs on several BioNLP tasks, which can be grouped into three major categories: question answering(QA), information extraction(IE), and text generation(GEN). We also examined whether categories(e.g., QA, IE, and generation) of instructions impact model performance. Comparing with LLMs without instruction-tuned, our instruction-tuned LLMs demonstrated marked performance gains: 17.3% in QA, 5.7% in IE, and 96% in Generation tasks. Our 7B-parameter instruction-tuned LLaMA 1 model was competitive or even surpassed other LLMs in the biomedical domain that were also fine-tuned from LLaMA 1 with vast domain-specific data or a variety of tasks. Our results also show that the performance gain is significantly higher when instruction fine-tuning is conducted with closely related tasks. Our findings align with the observations of multi-task learning, suggesting the synergies between two tasks. The BioInstruct dataset serves as a valuable resource and instruction tuned LLMs lead to the best performing BioNLP applications.

Create account to get full access

Overview

The researchers aimed to enhance the performance of large language models (LLMs) in biomedical natural language processing (BioNLP) tasks.
They introduced a domain-specific instruction dataset called BioInstruct, comprising 25,005 instructions to instruction-tune LLMs (LLaMA 1 & 2, 7B & 13B versions).
They employed Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning.
They evaluated the instruction-tuned LLMs on various BioNLP tasks, including question answering (QA), information extraction (IE), and text generation (GEN).
They also examined the impact of different instruction categories on model performance.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can understand and generate human-like text. However, their performance in specific domains, like biomedical natural language processing (BioNLP), can be limited.

The researchers in this study wanted to improve the performance of LLMs in BioNLP tasks. To do this, they created a dataset of instructions related to biomedical topics, called BioInstruct. They used this dataset to "instruction-tune" two versions of the LLaMA language model (7B and 13B parameters).

Instruction-tuning is a technique where the model is trained on a set of instructions, which helps it better understand and follow specific types of tasks. The researchers used a parameter-efficient fine-tuning method called LoRA to tune the LLaMA models.

After instruction-tuning, the researchers tested the models on various BioNLP tasks, such as answering questions, extracting information, and generating text. They found that the instruction-tuned models significantly outperformed the original LLaMA models, with a 17.3% improvement in question answering, 5.7% in information extraction, and 96% in text generation.

Interestingly, the researchers also discovered that the performance gains were higher when the instructions were more closely related to the specific tasks being evaluated. This suggests that the instruction-tuning process helps the model to better understand and apply the relevant biomedical knowledge.

Overall, this study shows that instruction-tuning can be a powerful way to enhance the performance of LLMs in specialized domains, like biomedical natural language processing. The BioInstruct dataset and the instruction-tuned LLaMA models developed in this research can be valuable resources for future BioNLP applications.

Technical Explanation

The researchers created the BioInstruct dataset, which contains 25,005 instructions related to biomedical topics. They generated these instructions by prompting the GPT-4 language model with three-seed samples randomly drawn from an 80 human-curated instructions.

To fine-tune the LLaMA 1 and 2 models (7B and 13B versions) with the BioInstruct dataset, the researchers employed Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method. This allowed them to update only a small subset of the model's parameters, reducing the computational cost and memory footprint of the fine-tuning process.

The researchers then evaluated the instruction-tuned LLMs on various BioNLP tasks, which they grouped into three major categories: question answering (QA), information extraction (IE), and text generation (GEN). They also examined whether the category of instructions (e.g., QA, IE, and generation) had an impact on the model's performance.

Compared to LLMs without instruction-tuning, the researchers' instruction-tuned LLMs demonstrated significant performance gains: 17.3% in QA, 5.7% in IE, and 96% in Generation tasks. Their 7B-parameter instruction-tuned LLaMA 1 model was competitive or even surpassed other LLMs in the biomedical domain that were fine-tuned from LLaMA 1 with vast domain-specific data or a variety of tasks.

The researchers' findings also show that the performance gain is significantly higher when instruction fine-tuning is conducted with closely related tasks, aligning with the observations of multi-task learning. This suggests that the synergies between the instructions and the target tasks can lead to greater performance improvements.

Critical Analysis

The researchers provide a comprehensive evaluation of their instruction-tuned LLMs on various BioNLP tasks, demonstrating the effectiveness of their approach. However, the paper does not discuss potential limitations or caveats of the study.

One potential area for further research could be to explore the generalization of the instruction-tuning approach to other specialized domains beyond biomedical natural language processing. It would be interesting to see if similar performance gains can be achieved in other fields, such as legal or financial text processing.

Additionally, the researchers do not provide much insight into the specific types of instructions that were most effective in improving the models' performance. A deeper analysis of the instruction categories and their impact on different BioNLP tasks could yield valuable insights for future instruction-tuning efforts.

Despite these minor limitations, the researchers' work represents an important contribution to the field of large language model adaptation and customization for domain-specific applications. The BioInstruct dataset and the instruction-tuned LLaMA models developed in this study can serve as valuable resources for researchers and practitioners in the biomedical natural language processing community.

Conclusion

This study demonstrates the effectiveness of instruction-tuning in enhancing the performance of large language models (LLMs) in the domain of biomedical natural language processing (BioNLP). By introducing the BioInstruct dataset and employing a parameter-efficient fine-tuning method (LoRA), the researchers were able to significantly improve the performance of LLaMA models on various BioNLP tasks, including question answering, information extraction, and text generation.

The findings of this research suggest that instruction-tuning can be a powerful technique for adapting LLMs to specialized domains, leveraging the synergies between the instructions and the target tasks. The BioInstruct dataset and the instruction-tuned LLaMA models developed in this study can serve as valuable resources for future BioNLP applications, potentially leading to more accurate and efficient biomedical text processing systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Towards Robust Instruction Tuning on Multimodal Large Language Models

Wei Han, Hui Chen, Soujanya Poria

Fine-tuning large language models (LLMs) on multi-task instruction-following data has been proven to be a powerful learning paradigm for improving their zero-shot capabilities on new tasks. Recent works about high-quality instruction-following data generation and selection require amounts of human labor to conceive model-understandable instructions for the given tasks and carefully filter the LLM-generated data. In this work, we introduce an automatic instruction augmentation method named INSTRAUG in multimodal tasks. It starts from a handful of basic and straightforward meta instructions but can expand an instruction-following dataset by 30 times. Results on two popular multimodal instructionfollowing benchmarks MULTIINSTRUCT and InstructBLIP show that INSTRAUG can significantly improve the alignment of multimodal large language models (MLLMs) across 12 multimodal tasks, which is even equivalent to the benefits of scaling up training data multiple times.

6/17/2024

cs.CL cs.AI

💬

A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks

Yanis Labrak, Mickael Rouvier, Richard Dufour

We evaluate four state-of-the-art instruction-tuned large language models (LLMs) -- ChatGPT, Flan-T5 UL2, Tk-Instruct, and Alpaca -- on a set of 13 real-world clinical and biomedical natural language processing (NLP) tasks in English, such as named-entity recognition (NER), question-answering (QA), relation extraction (RE), etc. Our overall results demonstrate that the evaluated LLMs begin to approach performance of state-of-the-art models in zero- and few-shot scenarios for most tasks, and particularly well for the QA task, even though they have never seen examples from these tasks before. However, we observed that the classification and RE tasks perform below what can be achieved with a specifically trained model for the medical field, such as PubMedBERT. Finally, we noted that no LLM outperforms all the others on all the studied tasks, with some models being better suited for certain tasks than others.

6/11/2024

cs.CL cs.AI cs.LG

💬

AlpaCare:Instruction-tuned Large Language Models for Medical Application

Xinlu Zhang, Chenxin Tian, Xianjun Yang, Lichang Chen, Zekun Li, Linda Ruth Petzold

Instruction-finetuning (IFT) has become crucial in aligning Large Language Models (LLMs) with diverse human needs and has shown great potential in medical applications. However, previous studies mainly fine-tune LLMs on biomedical datasets with limited diversity, which often rely on benchmarks or narrow task scopes, and hence significantly limit the effectiveness on their medical instruction-following ability and generalizability. To bridge this gap, we propose creating a diverse, machine-generated medical IFT dataset, MedInstruct-52k, using GPT-4 and ChatGPT with a high-quality expert-curated seed set. We then fine-tune LLaMA-series models on the dataset to develop AlpaCare. Despite using a smaller domain-specific dataset than previous medical LLMs, AlpaCare not only demonstrates superior performance on medical applications, with up to 38.1% absolute gain over best baselines in medical free-form instruction evaluations, but also achieves 6.7% absolute gains averaged over multiple general domain benchmarks. Human evaluation further shows that AlpaCare consistently outperforms best baselines in terms of both correctness and helpfulness. We offer public access to our data, model, and codebase in https://github.com/XZhang97666/AlpaCare.

6/11/2024

cs.CL cs.AI

💬

From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning

Xuansheng Wu, Wenlin Yao, Jianshu Chen, Xiaoman Pan, Xiaoyang Wang, Ninghao Liu, Dong Yu

Large Language Models (LLMs) have achieved remarkable success, where instruction tuning is the critical step in aligning LLMs with user intentions. In this work, we investigate how the instruction tuning adjusts pre-trained models with a focus on intrinsic changes. Specifically, we first develop several local and global explanation methods, including a gradient-based method for input-output attribution, and techniques for interpreting patterns and concepts in self-attention and feed-forward layers. The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models. This approach provides an internal perspective of the model shifts on a human-comprehensible level. Our findings reveal three significant impacts of instruction tuning: 1) It empowers LLMs to recognize the instruction parts of user prompts, and promotes the response generation constantly conditioned on the instructions. 2) It encourages the self-attention heads to capture more word-word relationships about instruction verbs. 3) It encourages the feed-forward networks to rotate their pre-trained knowledge toward user-oriented tasks. These insights contribute to a more comprehensive understanding of instruction tuning and lay the groundwork for future work that aims at explaining and optimizing LLMs for various applications. Our code and data are publicly available at https://github.com/JacksonWuxs/Interpret_Instruction_Tuning_LLMs.

4/5/2024

cs.CL cs.AI cs.LG