Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction

Read original: arXiv:2405.06729 - Published 5/14/2024 by Aleix Lafita, Ferran Gonzalez, Mahmoud Hossam, Paul Smyth, Jacob Deasy, Ari Allyn-Feuer, Daniel Seaton, Stephen Young

Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction

Overview

This paper explores how fine-tuning protein language models with deep mutational scanning data can improve the accuracy of predicting the effects of genetic variants.
The researchers developed a method to fine-tune large language models for protein sequences using experimental data on the functional impacts of mutations.
This approach led to significant improvements in predicting the effects of genetic variants compared to existing methods.

Plain English Explanation

Proteins are the molecular machines that carry out most of the essential functions in our cells. Slight changes in the DNA sequence that encodes a protein, known as genetic variants, can alter how the protein functions. Predicting the effects of these genetic variants is crucial for understanding diseases and developing new treatments.

In this study, the researchers used a technique called "deep mutational scanning" to systematically measure the functional impacts of thousands of mutations in various proteins. They then used this experimental data to fine-tune large language models that had been pre-trained on a vast amount of protein sequence data.

The key insight is that by incorporating this direct experimental data on mutation effects, the language models could learn more accurate representations of protein structure and function. This allowed the models to make better predictions about the consequences of genetic variants, outperforming existing computational methods that rely more on indirect information.

The ability to automate the synthesis of research insights from large language models could accelerate progress in fields like molecular biology and drug discovery, where understanding the impacts of genetic changes is so important.

Technical Explanation

The researchers started with a pre-trained protein language model, specifically the ESM-1b transformer model. They then fine-tuned this model using deep mutational scanning data from several proteins, which provides detailed experimental measurements of how each possible amino acid substitution affects protein function.

By incorporating this direct, high-quality data on mutation effects, the fine-tuned models were able to learn more nuanced representations of protein structure and dynamics. The researchers then evaluated the models' ability to predict the functional impacts of genetic variants, comparing them to existing computational methods.

The results showed that the fine-tuned models significantly outperformed the baseline methods, demonstrating the value of integrating experimental data into language model training. The authors suggest that this approach could be applied more broadly to enhance the predictive power of protein language models, with implications for fields like targeted molecule generation and automated research synthesis.

Critical Analysis

The paper provides a compelling demonstration of how fine-tuning language models with high-quality experimental data can substantially improve their performance on key tasks. However, a few caveats and areas for further research are worth noting:

The study focused on a relatively small number of proteins, so it will be important to validate the approach on a wider range of proteins and protein families to assess its generalizability.
The deep mutational scanning data used for fine-tuning may not be available for many proteins of interest, so techniques for leveraging sparse or noisy data could be important for expanding the applicability of this approach.
The paper did not explore the potential for selective fine-tuning approaches, which may be able to further enhance the models' performance while minimizing the amount of fine-tuning data required.

Overall, this work represents an important step forward in integrating experimental data and language modeling for protein engineering and design. Further research to address the limitations and expand the scope of this approach could yield significant benefits for fields that rely on accurate prediction of protein function.

Conclusion

This paper demonstrates that fine-tuning protein language models with deep mutational scanning data can substantially improve the accuracy of predicting the functional effects of genetic variants. By incorporating high-quality experimental measurements of mutation impacts, the language models were able to learn more nuanced representations of protein structure and dynamics, leading to better variant effect predictions compared to existing computational methods.

This approach has the potential to accelerate progress in fields like molecular biology and drug discovery, where understanding the consequences of genetic changes is crucial. The ability to automate the synthesis of research insights from large language models could also aid in the rapid advancement of these areas. While some limitations and avenues for further research remain, this work represents an important step forward in leveraging the power of language modeling for protein engineering and design.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction

Aleix Lafita, Ferran Gonzalez, Mahmoud Hossam, Paul Smyth, Jacob Deasy, Ari Allyn-Feuer, Daniel Seaton, Stephen Young

Protein Language Models (PLMs) have emerged as performant and scalable tools for predicting the functional impact and clinical significance of protein-coding variants, but they still lag experimental accuracy. Here, we present a novel fine-tuning approach to improve the performance of PLMs with experimental maps of variant effects from Deep Mutational Scanning (DMS) assays using a Normalised Log-odds Ratio (NLR) head. We find consistent improvements in a held-out protein test set, and on independent DMS and clinical variant annotation benchmarks from ProteinGym and ClinVar. These findings demonstrate that DMS is a promising source of sequence diversity and supervised training data for improving the performance of PLMs for variant effect prediction.

5/14/2024

Design Proteins Using Large Language Models: Enhancements and Comparative Analyses

Kamyar Zeinalipour, Neda Jamshidi, Monica Bianchini, Marco Maggini, Marco Gori

Pre-trained LLMs have demonstrated substantial capabilities across a range of conventional natural language processing (NLP) tasks, such as summarization and entity recognition. In this paper, we explore the application of LLMs in the generation of high-quality protein sequences. Specifically, we adopt a suite of pre-trained LLMs, including Mistral-7B1, Llama-2-7B2, Llama-3-8B3, and gemma-7B4, to produce valid protein sequences. All of these models are publicly available.5 Unlike previous work in this field, our approach utilizes a relatively small dataset comprising 42,000 distinct human protein sequences. We retrain these models to process protein-related data, ensuring the generation of biologically feasible protein structures. Our findings demonstrate that even with limited data, the adapted models exhibit efficiency comparable to established protein-focused models such as ProGen varieties, ProtGPT2, and ProLLaMA, which were trained on millions of protein sequences. To validate and quantify the performance of our models, we conduct comparative analyses employing standard metrics such as pLDDT, RMSD, TM-score, and REU. Furthermore, we commit to making the trained versions of all four models publicly available, fostering greater transparency and collaboration in the field of computational biology.

8/14/2024

A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding

Yiqing Shen, Zan Chen, Michail Mamalakis, Luhan He, Haiyang Xia, Tianbin Li, Yanzhou Su, Junjun He, Yu Guang Wang

The parallels between protein sequences and natural language in their sequential structures have inspired the application of large language models (LLMs) to protein understanding. Despite the success of LLMs in NLP, their effectiveness in comprehending protein sequences remains an open question, largely due to the absence of datasets linking protein sequences to descriptive text. Researchers have then attempted to adapt LLMs for protein understanding by integrating a protein sequence encoder with a pre-trained LLM. However, this adaptation raises a fundamental question: Can LLMs, originally designed for NLP, effectively comprehend protein sequences as a form of language? Current datasets fall short in addressing this question due to the lack of a direct correlation between protein sequences and corresponding text descriptions, limiting the ability to train and evaluate LLMs for protein understanding effectively. To bridge this gap, we introduce ProteinLMDataset, a dataset specifically designed for further self-supervised pretraining and supervised fine-tuning (SFT) of LLMs to enhance their capability for protein sequence comprehension. Specifically, ProteinLMDataset includes 17.46 billion tokens for pretraining and 893,000 instructions for SFT. Additionally, we present ProteinLMBench, the first benchmark dataset consisting of 944 manually verified multiple-choice questions for assessing the protein understanding capabilities of LLMs. ProteinLMBench incorporates protein-related details and sequences in multiple languages, establishing a new standard for evaluating LLMs' abilities in protein comprehension. The large language model InternLM2-7B, pretrained and fine-tuned on the ProteinLMDataset, outperforms GPT-4 on ProteinLMBench, achieving the highest accuracy score.

7/9/2024

Enhancing Fault Detection for Large Language Models via Mutation-Based Confidence Smoothing

Qiang Hu, Jin Wen, Maxime Cordy, Yuheng Huang, Xiaofei Xie, Lei Ma

Large language models (LLMs) achieved great success in multiple application domains and attracted huge attention from different research communities recently. Unfortunately, even for the best LLM, there still exist many faults that LLM cannot correctly predict. Such faults will harm the usability of LLMs. How to quickly reveal them in LLMs is important, but challenging. The reasons are twofold, 1) the heavy labeling effort for preparing the test data, and 2) accessing closed-source LLMs such as GPT4 is money-required. To handle this problem, in the traditional deep learning testing field, test selection methods have been proposed for efficiently testing deep learning models by prioritizing faults. However, the usefulness of these methods on LLMs is unclear and under exploration. In this paper, we first study the effectiveness of existing fault detection methods for LLMs. Experimental results on four different tasks~(including both code tasks and natural language processing tasks) and four LLMs (e.g., LLaMA and GPT4) demonstrated that existing fault detection methods cannot perform well on LLMs (e.g., seven out of eight methods perform worse than random selection on LLaMA). To enhance existing fault detection methods, we propose MuCS, a prompt Mutation-based prediction Confidence Smoothing method for LLMs. Concretely, we mutate the prompts and compute the average prediction confidence of all mutants as the input of fault detection methods. The results show that our proposed solution significantly enhances existing methods with the improvement of test relative coverage by up to 97.64%.

4/24/2024