Boltzmann machine learning and regularization methods for inferring evolutionary fields and couplings from a multiple sequence alignment

Read original: arXiv:1909.05006 - Published 7/23/2024 by Sanzo Miyazawa

🧪

Overview

The paper focuses on the inverse Potts problem, which is used to infer a Boltzmann distribution for homologous protein sequences from their single-site and pairwise amino acid frequencies.
The researchers studied regularization and learning methods, and how to tune regularization parameters to correctly infer interactions in Boltzmann machine learning.
They found that using L2 regularization for fields and group L1 for couplings is very effective for sparse couplings, compared to L2 and L1.
The paper also discusses how the Adam method can be modified to make step size proportional to the gradient for sparse couplings.

Plain English Explanation

The paper looks at a problem in biology called the "inverse Potts problem." This problem is about figuring out the underlying Boltzmann distribution for a group of related protein sequences, based on the frequency of amino acids at individual positions and how often pairs of amino acids appear together.

The researchers tested different regularization techniques and learning methods to see which ones work best for accurately determining the interactions between amino acids in these protein sequences. They found that using a combination of L2 regularization for the overall fields and group L1 regularization for the couplings (the interactions between pairs of amino acids) was very effective, especially when the couplings were sparse (i.e., there were only a few important interactions).

The paper also describes how they modified a popular optimization algorithm called Adam to work better for these sparse couplings. By making the step size proportional to the gradient, they were able to improve the algorithm's performance.

Overall, this research helps biologists better understand the complex evolutionary forces that shape the structures and functions of proteins, which are the building blocks of life.

Technical Explanation

The researchers used Boltzmann machine learning to infer the underlying Boltzmann distribution for a set of homologous protein sequences. This distribution is characterized by single-site amino acid frequencies and pairwise amino acid frequencies.

To learn the fields and couplings that define this Boltzmann distribution, the researchers experimented with different regularization techniques, including L2 and group L1 regularization. They found that using L2 regularization for the fields and group L1 for the couplings was very effective, especially for sparse couplings.

The researchers also modified the Adam optimization algorithm to make the step size proportional to the gradient for these sparse couplings, which improved its performance.

By first inferring the interactions from the protein sequences and then from Monte Carlo samples, the researchers were able to show that the fields and couplings could be well recovered. However, they found that recovering the pairwise correlations in the resolution of a total energy was harder for the natural proteins than for the protein-like sequences.

The paper also discusses the concept of selective temperature for folding/structural constraints in protein evolution, which the researchers were able to estimate.

Critical Analysis

The paper presents a thorough investigation of regularization and learning methods for the inverse Potts problem, which is an important problem in the study of protein structure and evolution. The researchers' use of both protein sequences and Monte Carlo samples to test their methods is a strength, as it allows them to evaluate the performance of their approach on both real-world and simulated data.

One potential limitation of the study is that the researchers only focused on pairwise interactions between amino acids, rather than higher-order interactions. While pairwise interactions are often the most important, higher-order interactions can also play a significant role in protein structure and function. Future research could explore methods for inferring these higher-order interactions as well.

Additionally, the paper does not provide much detail on the specific protein sequences used in the experiments or the characteristics of the Monte Carlo samples. More information on the data sources and their properties could help readers better understand the context and generalizability of the results.

Overall, this paper makes a valuable contribution to the field of protein structure and evolution by demonstrating effective techniques for inferring the underlying Boltzmann distribution from protein sequence data. The insights and methods presented could be useful for researchers working on a variety of problems in computational biology and biophysics.

Conclusion

This paper addresses the inverse Potts problem, which is a key challenge in understanding the evolutionary forces that shape protein structures and functions. The researchers explored different regularization and learning methods to accurately infer the interactions between amino acids in homologous protein sequences.

Their findings suggest that using a combination of L2 regularization for fields and group L1 for couplings is a highly effective approach, especially for sparse couplings. The researchers also demonstrated how modifying the Adam optimization algorithm can improve its performance for these sparse interactions.

By testing their methods on both real-world protein sequences and simulated data, the researchers were able to gain valuable insights into the strengths and limitations of their approach. While the paper focuses on pairwise interactions, future research could explore methods for inferring higher-order interactions as well.

Overall, this work represents an important step forward in our understanding of protein evolution and structure, with potential applications in fields like computational biology, biophysics, and drug discovery.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧪

Boltzmann machine learning and regularization methods for inferring evolutionary fields and couplings from a multiple sequence alignment

Sanzo Miyazawa

The inverse Potts problem to infer a Boltzmann distribution for homologous protein sequences from their single-site and pairwise amino acid frequencies recently attracts a great deal of attention in the studies of protein structure and evolution. We study regularization and learning methods and how to tune regularization parameters to correctly infer interactions in Boltzmann machine learning. Using $L_2$ regularization for fields, group $L_1$ for couplings is shown to be very effective for sparse couplings in comparison with $L_2$ and $L_1$. Two regularization parameters are tuned to yield equal values for both the sample and ensemble averages of evolutionary energy. Both averages smoothly change and converge, but their learning profiles are very different between learning methods. The Adam method is modified to make stepsize proportional to the gradient for sparse couplings. It is shown by first inferring interactions from protein sequences and then from Monte Carlo samples that the fields and couplings can be well recovered, but that recovering the pairwise correlations in the resolution of a total energy is harder for the natural proteins than for the protein-like sequences. Selective temperature for folding/structural constrains in protein evolution is also estimated.

7/23/2024

🏷️

Conditional Normalizing Flows for Active Learning of Coarse-Grained Molecular Representations

Henrik Schopmans, Pascal Friederich

Efficient sampling of the Boltzmann distribution of molecular systems is a long-standing challenge. Recently, instead of generating long molecular dynamics simulations, generative machine learning methods such as normalizing flows have been used to learn the Boltzmann distribution directly, without samples. However, this approach is susceptible to mode collapse and thus often does not explore the full configurational space. In this work, we address this challenge by separating the problem into two levels, the fine-grained and coarse-grained degrees of freedom. A normalizing flow conditioned on the coarse-grained space yields a probabilistic connection between the two levels. To explore the configurational space, we employ coarse-grained simulations with active learning which allows us to update the flow and make all-atom potential energy evaluations only when necessary. Using alanine dipeptide as an example, we show that our methods obtain a speedup to molecular dynamics simulations of approximately 15.9 to 216.2 compared to the speedup of 4.5 of the current state-of-the-art machine learning approach.

5/27/2024

Physics-Informed Weakly Supervised Learning for Interatomic Potentials

Makoto Takamoto, Viktor Zaverkin, Mathias Niepert

Machine learning plays an increasingly important role in computational chemistry and materials science, complementing computationally intensive ab initio and first-principles methods. Despite their utility, machine-learning models often lack generalization capability and robustness during atomistic simulations, yielding unphysical energy and force predictions that hinder their real-world applications. We address this challenge by introducing a physics-informed, weakly supervised approach for training machine-learned interatomic potentials (MLIPs). We introduce two novel loss functions, extrapolating the potential energy via a Taylor expansion and using the concept of conservative forces. Our approach improves the accuracy of MLIPs applied to training tasks with sparse training data sets and reduces the need for pre-training computationally demanding models with large data sets. Particularly, we perform extensive experiments demonstrating reduced energy and force errors -- often lower by a factor of two -- for various baseline models and benchmark data sets. Finally, we show that our approach facilitates MLIPs' training in a setting where the computation of forces is infeasible at the reference level, such as those employing complete-basis-set extrapolation.

8/13/2024

💬

On Recovering Higher-order Interactions from Protein Language Models

Darin Tsui, Amirali Aghazadeh

Protein language models leverage evolutionary information to perform state-of-the-art 3D structure and zero-shot variant prediction. Yet, extracting and explaining all the mutational interactions that govern model predictions remains difficult as it requires querying the entire amino acid space for $n$ sites using $20^n$ sequences, which is computationally expensive even for moderate values of $n$ (e.g., $nsim10$). Although approaches to lower the sample complexity exist, they often limit the interpretability of the model to just single and pairwise interactions. Recently, computationally scalable algorithms relying on the assumption of sparsity in the Fourier domain have emerged to learn interactions from experimental data. However, extracting interactions from language models poses unique challenges: it's unclear if sparsity is always present or if it is the only metric needed to assess the utility of Fourier algorithms. Herein, we develop a framework to do a systematic Fourier analysis of the protein language model ESM2 applied on three proteins-green fluorescent protein (GFP), tumor protein P53 (TP53), and G domain B1 (GB1)-across various sites for 228 experiments. We demonstrate that ESM2 is dominated by three regions in the sparsity-ruggedness plane, two of which are better suited for sparse Fourier transforms. Validations on two sample proteins demonstrate recovery of all interactions with $R^2=0.72$ in the more sparse region and $R^2=0.66$ in the more dense region, using only 7 million out of $20^{10}sim10^{13}$ ESM2 samples, reducing the computational time by a staggering factor of 15,000. All codes and data are available on our GitHub repository https://github.com/amirgroup-codes/InteractionRecovery.

5/14/2024