On Recovering Higher-order Interactions from Protein Language Models

Read original: arXiv:2405.06645 - Published 5/14/2024 by Darin Tsui, Amirali Aghazadeh

💬

Overview

Protein language models can predict 3D protein structures and variant properties with state-of-the-art performance by leveraging evolutionary information.
However, understanding the complex mutational interactions that drive these model predictions is challenging, as it requires querying a vast sequence space.
While approaches to reduce computational complexity exist, they often limit the interpretability of the model to single and pairwise interactions.
This paper presents a framework to systematically analyze the Fourier properties of the protein language model ESM2 and demonstrate its ability to recover complex interactions in a computationally efficient manner.

Plain English Explanation

Proteins are the building blocks of life, and their 3D structures and properties are crucial for understanding how they function. Protein language models have become a powerful tool for predicting these properties, as they can leverage the wealth of evolutionary information stored in protein sequences.

However, the complex web of interactions between different parts of a protein (known as mutational interactions) that govern the model's predictions can be difficult to understand. Traditionally, researchers have had to exhaustively test all possible combinations of mutations, which quickly becomes computationally intractable as the number of sites increases.

While some approaches have been developed to reduce the computational complexity, they often come at the cost of interpretability, limiting the analysis to only single or pairwise interactions.

In this paper, the researchers present a new framework that allows them to systematically analyze the Fourier properties of the protein language model ESM2. By doing so, they can identify regions where the model's predictions are dominated by sparse, easy-to-interpret interactions, as well as regions with more complex, dense interactions.

Through validation on two sample proteins, the researchers demonstrate that they can recover all of the relevant interactions using just a tiny fraction of the full sequence space, reducing the computational time by a factor of 15,000. This breakthrough could pave the way for a better understanding of how these powerful language models make their predictions, ultimately leading to more robust and interpretable models for protein engineering and design.

Technical Explanation

The researchers developed a framework to perform a systematic Fourier analysis of the protein language model ESM2, which has demonstrated state-of-the-art performance in 3D structure and zero-shot variant prediction.

Extracting and explaining the complex mutational interactions that govern ESM2's predictions is challenging, as it would require querying the entire amino acid space for n sites using 20^n sequences, which is computationally expensive even for moderate values of n (e.g., n~10).

While approaches to lower the sample complexity exist, such as those used for protein binding affinity prediction under multiple substitutions, they often limit the interpretability of the model to just single and pairwise interactions.

The researchers hypothesized that by analyzing the Fourier properties of ESM2, they could identify regions in the sparsity-ruggedness plane where the model's predictions are dominated by sparse, easy-to-interpret interactions, as well as regions with more complex, dense interactions.

To test this, they applied the ESM2 model to three different proteins (GFP, TP53, and GB1) across various sites, covering a total of 228 experiments. They found that ESM2's predictions were dominated by three distinct regions in the sparsity-ruggedness plane, two of which were better suited for sparse Fourier transforms.

Validations on two sample proteins demonstrated that the researchers could recover all relevant interactions with R^2=0.72 in the more sparse region and R^2=0.66 in the more dense region, using only 7 million out of ~10^13 ESM2 samples. This represents a staggering reduction in computational time by a factor of 15,000.

Critical Analysis

The researchers provide a compelling approach to systematically analyze the Fourier properties of protein language models like ESM2, allowing them to identify regions where the model's predictions are dominated by sparse, interpretable interactions. This is a significant advancement over previous methods that were limited to single and pairwise interactions.

However, the researchers acknowledge that the assumption of sparsity may not always hold, and there may be other metrics beyond sparsity that are needed to assess the utility of Fourier algorithms for extracting interactions from language models.

Additionally, the validation experiments were conducted on a limited set of three proteins, and it would be valuable to see the framework applied to a wider range of proteins to further assess its generalizability.

Future research could also explore the potential of this approach to inform the design of more interpretable protein language models, perhaps by incorporating the Fourier properties directly into the model architecture or training process.

Conclusion

This paper presents a novel framework for systematically analyzing the Fourier properties of protein language models, which enables the efficient recovery of complex mutational interactions that drive the models' predictions. By identifying regions in the sparsity-ruggedness plane where the model is dominated by sparse, interpretable interactions, the researchers were able to dramatically reduce the computational complexity required to extract these insights.

The implications of this work are significant, as it paves the way for a better understanding of how these powerful language models make their predictions, ultimately leading to more robust and interpretable models for protein engineering and design. The ability to quickly and accurately recover the underlying interactions that govern a model's behavior is a crucial step towards unlocking the full potential of protein language models in various applications, from drug discovery to synthetic biology.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

On Recovering Higher-order Interactions from Protein Language Models

Darin Tsui, Amirali Aghazadeh

Protein language models leverage evolutionary information to perform state-of-the-art 3D structure and zero-shot variant prediction. Yet, extracting and explaining all the mutational interactions that govern model predictions remains difficult as it requires querying the entire amino acid space for $n$ sites using $20^n$ sequences, which is computationally expensive even for moderate values of $n$ (e.g., $nsim10$). Although approaches to lower the sample complexity exist, they often limit the interpretability of the model to just single and pairwise interactions. Recently, computationally scalable algorithms relying on the assumption of sparsity in the Fourier domain have emerged to learn interactions from experimental data. However, extracting interactions from language models poses unique challenges: it's unclear if sparsity is always present or if it is the only metric needed to assess the utility of Fourier algorithms. Herein, we develop a framework to do a systematic Fourier analysis of the protein language model ESM2 applied on three proteins-green fluorescent protein (GFP), tumor protein P53 (TP53), and G domain B1 (GB1)-across various sites for 228 experiments. We demonstrate that ESM2 is dominated by three regions in the sparsity-ruggedness plane, two of which are better suited for sparse Fourier transforms. Validations on two sample proteins demonstrate recovery of all interactions with $R^2=0.72$ in the more sparse region and $R^2=0.66$ in the more dense region, using only 7 million out of $20^{10}sim10^{13}$ ESM2 samples, reducing the computational time by a staggering factor of 15,000. All codes and data are available on our GitHub repository https://github.com/amirgroup-codes/InteractionRecovery.

5/14/2024

💬

Ranking protein-protein models with large language models and graph neural networks

Xiaotong Xu, Alexandre M. J. J. Bonvin

Protein-protein interactions (PPIs) are associated with various diseases, including cancer, infections, and neurodegenerative disorders. Obtaining three-dimensional structural information on these PPIs serves as a foundation to interfere with those or to guide drug design. Various strategies can be followed to model those complexes, all typically resulting in a large number of models. A challenging step in this process is the identification of good models (near-native PPI conformations) from the large pool of generated models. To address this challenge, we previously developed DeepRank-GNN-esm, a graph-based deep learning algorithm for ranking modelled PPI structures harnessing the power of protein language models. Here, we detail the use of our software with examples. DeepRank-GNN-esm is freely available at https://github.com/haddocking/DeepRank-GNN-esm

7/24/2024

🧪

Boltzmann machine learning and regularization methods for inferring evolutionary fields and couplings from a multiple sequence alignment

Sanzo Miyazawa

The inverse Potts problem to infer a Boltzmann distribution for homologous protein sequences from their single-site and pairwise amino acid frequencies recently attracts a great deal of attention in the studies of protein structure and evolution. We study regularization and learning methods and how to tune regularization parameters to correctly infer interactions in Boltzmann machine learning. Using $L_2$ regularization for fields, group $L_1$ for couplings is shown to be very effective for sparse couplings in comparison with $L_2$ and $L_1$. Two regularization parameters are tuned to yield equal values for both the sample and ensemble averages of evolutionary energy. Both averages smoothly change and converge, but their learning profiles are very different between learning methods. The Adam method is modified to make stepsize proportional to the gradient for sparse couplings. It is shown by first inferring interactions from protein sequences and then from Monte Carlo samples that the fields and couplings can be well recovered, but that recovering the pairwise correlations in the resolution of a total energy is harder for the natural proteins than for the protein-like sequences. Selective temperature for folding/structural constrains in protein evolution is also estimated.

7/23/2024

Multi-level Interaction Modeling for Protein Mutational Effect Prediction

Yuanle Mo, Xin Hong, Bowen Gao, Yinjun Jia, Yanyan Lan

Protein-protein interactions are central mediators in many biological processes. Accurately predicting the effects of mutations on interactions is crucial for guiding the modulation of these interactions, thereby playing a significant role in therapeutic development and drug discovery. Mutations generally affect interactions hierarchically across three levels: mutated residues exhibit different sidechain conformations, which lead to changes in the backbone conformation, eventually affecting the binding affinity between proteins. However, existing methods typically focus only on sidechain-level interaction modeling, resulting in suboptimal predictions. In this work, we propose a self-supervised multi-level pre-training framework, ProMIM, to fully capture all three levels of interactions with well-designed pretraining objectives. Experiments show ProMIM outperforms all the baselines on the standard benchmark, especially on mutations where significant changes in backbone conformations may occur. In addition, leading results from zero-shot evaluations for SARS-CoV-2 mutational effect prediction and antibody optimization underscore the potential of ProMIM as a powerful next-generation tool for developing novel therapeutic approaches and new drugs.

5/29/2024