Importance Weighted Expectation-Maximization for Protein Sequence Design

Read original: arXiv:2305.00386 - Published 7/18/2024 by Zhenqiao Song, Lei Li

🧠

Overview

Designing proteins with desired biological function is crucial in biology and chemistry
Recent machine learning methods use a surrogate sequence-function model to replace expensive wet-lab validation
The paper proposes IsEM-Pro, an approach to efficiently generate diverse and novel protein sequences with high fitness

Plain English Explanation

Proteins are the building blocks of life, and being able to design new proteins with specific functions is very important in fields like biology and chemistry. Traditional methods for designing new proteins are time-consuming and expensive, requiring a lot of lab work. Recent machine learning techniques have tried to address this by developing models that can predict the function of a protein based on its sequence.

The paper introduces a new approach called IsEM-Pro that can generate novel protein sequences that are predicted to have high "fitness" or function. IsEM-Pro uses a combination of a generative model to create diverse sequences, and a set of features based on Markov random fields to guide the generation towards high-fitness regions. The result is a system that can efficiently produce many new protein sequences that are predicted to be functional, without requiring as much expensive wet-lab testing.

Technical Explanation

At the core, IsEM-Pro is a latent generative model that can generate novel protein sequences. This is augmented by incorporating combinatorial structure features learned from a separate Markov random field (MRF) model. The authors develop a Monte Carlo Expectation-Maximization (MCEM) method to train the overall IsEM-Pro model.

During inference, IsEM-Pro samples from the latent space of the generative model to enhance diversity, while the MRF features guide the exploration towards regions of high predicted fitness. Experiments on eight different protein design tasks show that IsEM-Pro outperforms previous state-of-the-art methods, achieving at least 55% higher average fitness scores and generating more diverse and novel protein sequences.

Critical Analysis

The paper provides a novel and promising approach for efficiently generating diverse protein sequences with high predicted functionality. The use of a generative model combined with MRF features is an interesting technical contribution that seems to improve upon prior methods.

However, the paper does not address some important limitations. For example, the accuracy of the predicted fitness scores relies on the quality of the underlying sequence-function model, which is not the focus of this work. Additionally, the experimental tasks are relatively narrow, so further validation on a broader range of protein design problems would be valuable.

It would also be important to understand how the generated sequences perform in actual wet-lab experiments, beyond just the predicted fitness scores. Ultimately, the true test of the approach's utility will be its ability to expedite the discovery of novel, functional proteins in real-world applications.

Conclusion

This paper introduces IsEM-Pro, a novel method for efficiently generating diverse and high-fitness protein sequences using a combination of generative modeling and structural features. The results demonstrate significant improvements over previous techniques, suggesting IsEM-Pro could be a useful tool to accelerate protein design research.

However, the approach still has room for further development and validation, particularly in bridging the gap between computational predictions and experimental realities. Continued advancements in this area have the potential to greatly impact fields like biology, medicine, and biotechnology by streamlining the discovery of new functional proteins.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧠

Importance Weighted Expectation-Maximization for Protein Sequence Design

Zhenqiao Song, Lei Li

Designing protein sequences with desired biological function is crucial in biology and chemistry. Recent machine learning methods use a surrogate sequence-function model to replace the expensive wet-lab validation. How can we efficiently generate diverse and novel protein sequences with high fitness? In this paper, we propose IsEM-Pro, an approach to generate protein sequences towards a given fitness criterion. At its core, IsEM-Pro is a latent generative model, augmented by combinatorial structure features from a separately learned Markov random fields (MRFs). We develop an Monte Carlo Expectation-Maximization method (MCEM) to learn the model. During inference, sampling from its latent space enhances diversity while its MRFs features guide the exploration in high fitness regions. Experiments on eight protein sequence design tasks show that our IsEM-Pro outperforms the previous best methods by at least 55% on average fitness score and generates more diverse and novel protein sequences.

7/18/2024

Reinforcement Learning for Sequence Design Leveraging Protein Language Models

Jithendaraa Subramanian, Shivakanth Sujit, Niloy Irtisam, Umong Sain, Derek Nowrouzezahrai, Samira Ebrahimi Kahou, Riashat Islam

Protein sequence design, determined by amino acid sequences, are essential to protein engineering problems in drug discovery. Prior approaches have resorted to evolutionary strategies or Monte-Carlo methods for protein design, but often fail to exploit the structure of the combinatorial search space, to generalize to unseen sequences. In the context of discrete black box optimization over large search spaces, learning a mutation policy to generate novel sequences with reinforcement learning is appealing. Recent advances in protein language models (PLMs) trained on large corpora of protein sequences offer a potential solution to this problem by scoring proteins according to their biological plausibility (such as the TM-score). In this work, we propose to use PLMs as a reward function to generate new sequences. Yet the PLM can be computationally expensive to query due to its large size. To this end, we propose an alternative paradigm where optimization can be performed on scores from a smaller proxy model that is periodically finetuned, jointly while learning the mutation policy. We perform extensive experiments on various sequence lengths to benchmark RL-based approaches, and provide comprehensive evaluations along biological plausibility and diversity of the protein. Our experimental results include favorable evaluations of the proposed sequences, along with high diversity scores, demonstrating that RL is a strong candidate for biological sequence design. Finally, we provide a modular open source implementation can be easily integrated in most RL training loops, with support for replacing the reward model with other PLMs, to spur further research in this domain. The code for all experiments is provided in the supplementary material.

7/4/2024

🛠️

Robust Model-Based Optimization for Challenging Fitness Landscapes

Saba Ghaffari, Ehsan Saleh, Alexander G. Schwing, Yu-Xiong Wang, Martin D. Burke, Saurabh Sinha

Protein design, a grand challenge of the day, involves optimization on a fitness landscape, and leading methods adopt a model-based approach where a model is trained on a training set (protein sequences and fitness) and proposes candidates to explore next. These methods are challenged by sparsity of high-fitness samples in the training set, a problem that has been in the literature. A less recognized but equally important problem stems from the distribution of training samples in the design space: leading methods are not designed for scenarios where the desired optimum is in a region that is not only poorly represented in training data, but also relatively far from the highly represented low-fitness regions. We show that this problem of separation in the design space is a significant bottleneck in existing model-based optimization tools and propose a new approach that uses a novel VAE as its search model to overcome the problem. We demonstrate its advantage over prior methods in robustly finding improved samples, regardless of the imbalance and separation between low- and high-fitness samples. Our comprehensive benchmark on real and semi-synthetic protein datasets as well as solution design for physics-informed neural networks, showcases the generality of our approach in discrete and continuous design spaces. Our implementation is available at https://github.com/sabagh1994/PGVAE.

7/1/2024

Progressive Multi-Modality Learning for Inverse Protein Folding

Jiangbin Zheng, Stan Z. Li

While deep generative models show promise for learning inverse protein folding directly from data, the lack of publicly available structure-sequence pairings limits their generalization. Previous improvements and data augmentation efforts to overcome this bottleneck have been insufficient. To further address this challenge, we propose a novel protein design paradigm called MMDesign, which leverages multi-modality transfer learning. To our knowledge, MMDesign is the first framework that combines a pretrained structural module with a pretrained contextual module, using an auto-encoder (AE) based language model to incorporate prior protein semantic knowledge. Experimental results, only training with the small dataset, demonstrate that MMDesign consistently outperforms baselines on various public benchmarks. To further assess the biological plausibility, we present systematic quantitative analysis techniques that provide interpretability and reveal more about the laws of protein design.

7/23/2024