Improved motif-scaffolding with SE(3) flow matching

Read original: arXiv:2401.04082 - Published 7/22/2024 by Jason Yim, Andrew Campbell, Emile Mathieu, Andrew Y. K. Foong, Michael Gastegger, Jos'e Jim'enez-Luna, Sarah Lewis, Victor Garcia Satorras, Bastiaan S. Veeling, Frank No'e and 2 others

🧠

Overview

Protein design often starts with a desired function from a specific motif, and the goal is to construct a functional protein around that motif.
Generative models have recently made breakthroughs in designing scaffolds for a range of motifs.
However, the generated scaffolds tend to lack structural diversity, which can hinder success in wet-lab validation.
This work extends an existing model, FrameFlow, to perform motif-scaffolding using two complementary approaches: motif amortization and motif guidance.

Plain English Explanation

Designing new proteins from scratch is a complex challenge in biology. Researchers often start with a specific 3D shape or "motif" that they want the protein to have, and then try to build a full protein structure around that motif.

Recently, advanced machine learning models have shown success in generating potential protein scaffolds to fit desired motifs. However, the scaffolds produced by these models tend to lack diversity - they all look quite similar to each other.

This lack of diversity can be a problem when trying to validate the designs in real-world experiments, since the proteins may not behave as expected. The authors of this paper set out to address this by extending an existing model called FrameFlow in two ways:

Motif Amortization: Training FrameFlow to generate scaffolds directly from the desired motif, using a technique called data augmentation to increase the diversity of the training examples.
Motif Guidance: Modifying FrameFlow to create scaffolds guided by an estimate of the conditional probability distribution, without needing additional training.

The researchers tested these approaches on a set of 24 biologically relevant motifs, and found that their method was able to produce 2.5 times more unique and usable scaffold designs compared to previous state-of-the-art methods. This could lead to more successful experiments in the lab when trying to create new proteins from scratch.

Technical Explanation

The paper presents two complementary approaches for motif-scaffolding using an SE(3) flow matching model called FrameFlow:

Motif Amortization: The authors train FrameFlow to generate protein backbones conditioned on the desired motif. They use a data augmentation strategy to increase the diversity of the training data, including applying random rotations and translations to the motifs.
Motif Guidance: Instead of additional training, the authors modify the FrameFlow model to estimate the conditional score function, which is then used to guide the sampling process towards high-scoring scaffolds that fit the desired motif.

The researchers evaluate their methods on a benchmark of 24 biologically meaningful motifs, and show that their approach achieves 2.5 times more designable and unique motif-scaffolds compared to the previous state-of-the-art. This suggests their methods are more effective at generating diverse, high-quality protein backbones that match the target motifs.

Critical Analysis

The paper demonstrates a solid technical approach and clear experimental results showing the advantages of the proposed motif-scaffolding methods. However, a few potential limitations or areas for further research are worth noting:

The benchmark only includes 24 motifs, so it would be valuable to see how the methods scale to a larger and more diverse set of targets.
The paper does not provide much detail on the specific wet-lab validation process or the success rate of the generated scaffolds in real-world experiments. More evidence on the practical utility of these designs would strengthen the claims.
While the authors show their methods outperform previous state-of-the-art, it would be helpful to understand how the computational cost and runtime of their approaches compare to other techniques.
Exploring ways to further increase the structural diversity of the generated scaffolds, beyond what is achieved through data augmentation, could lead to even more successful protein design outcomes.

Overall, this work represents a promising advance in the field of computational protein design, and the insights gained could have meaningful impacts on future wet-lab experiments and real-world applications.

Conclusion

This paper presents two innovative approaches, motif amortization and motif guidance, for generating diverse protein scaffolds that match desired structural motifs. By extending the FrameFlow model, the researchers were able to significantly outperform previous state-of-the-art methods on a benchmark of 24 biologically relevant motifs.

These advances in computational protein design could lead to more successful wet-lab validation and a greater ability to create new proteins from scratch. While some limitations and areas for further research remain, this work represents an important step forward in the field and could have valuable implications for fields like medicine, biotechnology, and materials science.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧠

Improved motif-scaffolding with SE(3) flow matching

Jason Yim, Andrew Campbell, Emile Mathieu, Andrew Y. K. Foong, Michael Gastegger, Jos'e Jim'enez-Luna, Sarah Lewis, Victor Garcia Satorras, Bastiaan S. Veeling, Frank No'e, Regina Barzilay, Tommi S. Jaakkola

Protein design often begins with the knowledge of a desired function from a motif which motif-scaffolding aims to construct a functional protein around. Recently, generative models have achieved breakthrough success in designing scaffolds for a range of motifs. However, generated scaffolds tend to lack structural diversity, which can hinder success in wet-lab validation. In this work, we extend FrameFlow, an SE(3) flow matching model for protein backbone generation, to perform motif-scaffolding with two complementary approaches. The first is motif amortization, in which FrameFlow is trained with the motif as input using a data augmentation strategy. The second is motif guidance, which performs scaffolding using an estimate of the conditional score from FrameFlow without additional training. On a benchmark of 24 biologically meaningful motifs, we show our method achieves 2.5 times more designable and unique motif-scaffolds compared to state-of-the-art. Code: https://github.com/microsoft/protein-frame-flow

7/22/2024

Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation

Guillaume Huguet, James Vuckovic, Kilian Fatras, Eric Thibodeau-Laufer, Pablo Lemos, Riashat Islam, Cheng-Hao Liu, Jarrid Rector-Brooks, Tara Akhound-Sadegh, Michael Bronstein, Alexander Tong, Avishek Joey Bose

Proteins are essential for almost all biological processes and derive their diverse functions from complex 3D structures, which are in turn determined by their amino acid sequences. In this paper, we exploit the rich biological inductive bias of amino acid sequences and introduce FoldFlow-2, a novel sequence-conditioned SE(3)-equivariant flow matching model for protein structure generation. FoldFlow-2 presents substantial new architectural features over the previous FoldFlow family of models including a protein large language model to encode sequence, a new multi-modal fusion trunk that combines structure and sequence representations, and a geometric transformer based decoder. To increase diversity and novelty of generated samples -- crucial for de-novo drug design -- we train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works, containing both known proteins in PDB and high-quality synthetic structures achieved through filtering. We further demonstrate the ability to align FoldFlow-2 to arbitrary rewards, e.g. increasing secondary structures diversity, by introducing a Reinforced Finetuning (ReFT) objective. We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models, improving over RFDiffusion in terms of unconditional generation across all metrics including designability, diversity, and novelty across all protein lengths, as well as exhibiting generalization on the task of equilibrium conformation sampling. Finally, we demonstrate that a fine-tuned FoldFlow-2 makes progress on challenging conditional design tasks such as designing scaffolds for the VHH nanobody.

5/31/2024

SE(3)-Stochastic Flow Matching for Protein Backbone Generation

Avishek Joey Bose, Tara Akhound-Sadegh, Guillaume Huguet, Kilian Fatras, Jarrid Rector-Brooks, Cheng-Hao Liu, Andrei Cristian Nica, Maksym Korablyov, Michael Bronstein, Alexander Tong

The computational design of novel protein structures has the potential to impact numerous scientific disciplines greatly. Toward this goal, we introduce FoldFlow, a series of novel generative models of increasing modeling power based on the flow-matching paradigm over $3mathrm{D}$ rigid motions -- i.e. the group $text{SE}(3)$ -- enabling accurate modeling of protein backbones. We first introduce FoldFlow-Base, a simulation-free approach to learning deterministic continuous-time dynamics and matching invariant target distributions on $text{SE}(3)$. We next accelerate training by incorporating Riemannian optimal transport to create FoldFlow-OT, leading to the construction of both more simple and stable flows. Finally, we design FoldFlow-SFM, coupling both Riemannian OT and simulation-free training to learn stochastic continuous-time dynamics over $text{SE}(3)$. Our family of FoldFlow, generative models offers several key advantages over previous approaches to the generative modeling of proteins: they are more stable and faster to train than diffusion-based approaches, and our models enjoy the ability to map any invariant source distribution to any invariant target distribution over $text{SE}(3)$. Empirically, we validate FoldFlow, on protein backbone generation of up to $300$ amino acids leading to high-quality designable, diverse, and novel samples.

4/12/2024

🧠

Learning to Extend Molecular Scaffolds with Structural Motifs

Krzysztof Maziarz, Henry Jackson-Flux, Pashmina Cameron, Finton Sirockin, Nadine Schneider, Nikolaus Stiefl, Marwin Segler, Marc Brockschmidt

Recent advancements in deep learning-based modeling of molecules promise to accelerate in silico drug discovery. A plethora of generative models is available, building molecules either atom-by-atom and bond-by-bond or fragment-by-fragment. However, many drug discovery projects require a fixed scaffold to be present in the generated molecule, and incorporating that constraint has only recently been explored. Here, we propose MoLeR, a graph-based model that naturally supports scaffolds as initial seed of the generative procedure, which is possible because it is not conditioned on the generation history. Our experiments show that MoLeR performs comparably to state-of-the-art methods on unconstrained molecular optimization tasks, and outperforms them on scaffold-based tasks, while being an order of magnitude faster to train and sample from than existing approaches. Furthermore, we show the influence of a number of seemingly minor design choices on the overall performance.

5/14/2024