Energy Rank Alignment: Using Preference Optimization to Search Chemical Space at Scale

Read original: arXiv:2405.12961 - Published 5/22/2024 by Shriram Chennakesavalu, Frank Hu, Sebastian Ibarraran, Grant M. Rotskoff

🛠️

Overview

Searching for new chemical compounds is a major challenge due to the vast number of possible molecules.
Large language models trained on chemical data can generate new molecules, but struggle to produce desired properties.
The authors introduce an algorithm called "Energy Rank Alignment" (ERA) that optimizes autoregressive models to generate molecules with specific properties.
ERA is related to existing techniques like Proximal Policy Optimization and Direct Preference Optimization, but has theoretical advantages.
The algorithm performs well on both chemical search and AI supervision tasks, demonstrating its scalability and generality.

Plain English Explanation

The number of possible chemical compounds grows exponentially as you add more atoms. This makes it incredibly difficult to search through all the possibilities and find molecules with specific desired properties. While large AI language models trained on chemical data can generate new molecules, they still struggle to reliably produce compounds with the properties we want.

The researchers introduce a new algorithm called "Energy Rank Alignment" (ERA) that aims to solve this challenge. ERA uses an explicit reward function that describes the desired properties of the molecule. It then optimizes the language model to generate molecules that maximize this reward function, essentially guiding the model to explore the most promising areas of chemical space.

Theoretically, ERA is closely related to other techniques like Proximal Policy Optimization and Direct Preference Optimization, but has some key advantages. For example, it is highly scalable and doesn't require reinforcement learning, which can be data-hungry and unstable.

The researchers show that ERA performs well not only on chemical search tasks, but also on more general AI supervision problems. This suggests the method is flexible and powerful, able to tackle a variety of challenging optimization problems.

Technical Explanation

The paper introduces an algorithm called "Energy Rank Alignment" (ERA) that leverages an explicit reward function to optimize autoregressive models for generating molecules with desired properties. This closely resembles the "alignment problem" for large language models, where the goal is to ensure the model's outputs match some target objective or preference.

Theoretically, the authors show that ERA is related to techniques like Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), but has some key advantages. Notably, ERA's minimizer converges to an ideal Gibbs-Boltzmann distribution, with the reward function playing the role of an "energy" function. This provides a principled framework for the optimization.

Additionally, ERA is highly scalable and does not require reinforcement learning, which can be data-hungry and unstable. The authors show that ERA outperforms DPO when the number of preference observations per pairing is small, a common challenge in real-world settings.

The researchers deploy ERA to align molecular transformer models to generate molecules with externally specified properties. They find that ERA searches through diverse parts of chemical space robustly, demonstrating its effectiveness for this challenging problem. Importantly, the authors also obtain excellent results on a general AI supervision task, indicating the broad applicability of the method.

Critical Analysis

The paper makes a compelling case for the Energy Rank Alignment (ERA) algorithm as a powerful approach to the challenging problem of molecular search and optimization. The theoretical analysis connecting ERA to techniques like PPO and DPO is insightful and provides a solid foundation for the method.

One potential limitation is that the paper does not delve deeply into the specific properties of the reward functions used in the experiments. The authors mention that they used "externally specified properties," but more details on how these reward functions were constructed and their potential biases would be helpful for understanding the method's real-world applicability.

Additionally, while the authors demonstrate ERA's effectiveness on both chemical and AI supervision tasks, it would be valuable to see more comparisons to other state-of-the-art techniques in these domains. This would help readers better understand ERA's strengths and weaknesses relative to alternative approaches.

The paper also does not address potential issues around the interpretability and explainability of the generated molecules. As these models become more powerful, understanding the reasoning behind their outputs will be crucial for building trust and deploying them in safety-critical applications.

Overall, the Energy Rank Alignment algorithm appears to be a promising approach to the challenging problem of molecular search and optimization. The authors have provided a strong theoretical foundation and promising empirical results. Further research exploring the method's real-world applicability and robustness would be valuable contributions to the field.

Conclusion

The paper presents a novel algorithm called Energy Rank Alignment (ERA) that leverages an explicit reward function to optimize autoregressive models for generating molecules with desired properties. Theoretically, ERA is closely related to techniques like Proximal Policy Optimization and Direct Preference Optimization, but has some key advantages in terms of scalability and stability.

The authors demonstrate ERA's effectiveness on both chemical search tasks and general AI supervision problems, showcasing the method's broad applicability. While the paper leaves some questions around the specifics of the reward functions and comparisons to other state-of-the-art approaches, it provides a strong foundation for further research and development in this important area of molecular discovery and optimization.

As the field of AI continues to advance, tools like ERA will become increasingly crucial for tackling complex, high-dimensional search problems with practical real-world implications. The authors' work represents an important step forward in this direction, with the potential to unlock new frontiers in materials science, drug discovery, and beyond.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛠️

Energy Rank Alignment: Using Preference Optimization to Search Chemical Space at Scale

Shriram Chennakesavalu, Frank Hu, Sebastian Ibarraran, Grant M. Rotskoff

Searching through chemical space is an exceptionally challenging problem because the number of possible molecules grows combinatorially with the number of atoms. Large, autoregressive models trained on databases of chemical compounds have yielded powerful generators, but we still lack robust strategies for generating molecules with desired properties. This molecular search problem closely resembles the alignment problem for large language models, though for many chemical tasks we have a specific and easily evaluable reward function. Here, we introduce an algorithm called energy rank alignment (ERA) that leverages an explicit reward function to produce a gradient-based objective that we use to optimize autoregressive policies. We show theoretically that this algorithm is closely related to proximal policy optimization (PPO) and direct preference optimization (DPO), but has a minimizer that converges to an ideal Gibbs-Boltzmann distribution with the reward playing the role of an energy function. Furthermore, this algorithm is highly scalable, does not require reinforcement learning, and performs well relative to DPO when the number of preference observations per pairing is small. We deploy this approach to align molecular transformers to generate molecules with externally specified properties and find that it does so robustly, searching through diverse parts of chemical space. While our focus here is on chemical search, we also obtain excellent results on an AI supervised task for LLM alignment, showing that the method is scalable and general.

5/22/2024

Robust Preference Optimization through Reward Model Distillation

Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, Jonathan Berant

Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. However, typical preference datasets have only a single, or at most a few, annotation per preference pair, which causes DPO to overconfidently assign rewards that trend towards infinite magnitude. This frequently leads to degenerate policies, sometimes causing even the probabilities of the preferred generations to go to zero. In this work, we analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs: we train the LM to produce probabilities that match the distribution induced by a reward model trained on the preference data. Moreover, to account for uncertainty in the reward model we are distilling from, we optimize against a family of reward models that, as a whole, is likely to include at least one reasonable proxy for the preference distribution. Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations, while preserving the simple supervised nature of DPO.

5/30/2024

Preference Optimization for Molecule Synthesis with Conditional Residual Energy-based Models

Songtao Liu, Hanjun Dai, Yue Zhao, Peng Liu

Molecule synthesis through machine learning is one of the fundamental problems in drug discovery. Current data-driven strategies employ one-step retrosynthesis models and search algorithms to predict synthetic routes in a top-bottom manner. Despite their effective performance, these strategies face limitations in the molecule synthetic route generation due to a greedy selection of the next molecule set without any lookahead. Furthermore, existing strategies cannot control the generation of synthetic routes based on possible criteria such as material costs, yields, and step count. In this work, we propose a general and principled framework via conditional residual energy-based models (EBMs), that focus on the quality of the entire synthetic route based on the specific criteria. By incorporating an additional energy-based function into our probabilistic model, our proposed algorithm can enhance the quality of the most probable synthetic routes (with higher probabilities) generated by various strategies in a plug-and-play fashion. Extensive experiments demonstrate that our framework can consistently boost performance across various strategies and outperforms previous state-of-the-art top-1 accuracy by a margin of 2.5%. Code is available at https://github.com/SongtaoLiu0823/CREBM.

6/5/2024

💬

Entropy-Reinforced Planning with Large Language Models for Drug Discovery

Xuefeng Liu, Chih-chan Tien, Peng Ding, Songhao Jiang, Rick L. Stevens

The objective of drug discovery is to identify chemical compounds that possess specific pharmaceutical properties toward a binding target. Existing large language models (LLMS) can achieve high token matching scores in terms of likelihood for molecule generation. However, relying solely on LLM decoding often results in the generation of molecules that are either invalid due to a single misused token, or suboptimal due to unbalanced exploration and exploitation as a consequence of the LLMs prior experience. Here we propose ERP, Entropy-Reinforced Planning for Transformer Decoding, which employs an entropy-reinforced planning algorithm to enhance the Transformer decoding process and strike a balance between exploitation and exploration. ERP aims to achieve improvements in multiple properties compared to direct sampling from the Transformer. We evaluated ERP on the SARS-CoV-2 virus (3CLPro) and human cancer cell target protein (RTCB) benchmarks and demonstrated that, in both benchmarks, ERP consistently outperforms the current state-of-the-art algorithm by 1-5 percent, and baselines by 5-10 percent, respectively. Moreover, such improvement is robust across Transformer models trained with different objectives. Finally, to further illustrate the capabilities of ERP, we tested our algorithm on three code generation benchmarks and outperformed the current state-of-the-art approach as well. Our code is publicly available at: https://github.com/xuefeng-cs/ERP.

6/12/2024