$texttt{MiniMol}$: A Parameter-Efficient Foundation Model for Molecular Learning

Read original: arXiv:2404.14986 - Published 4/24/2024 by Kerstin Klaser, B{l}a.zej Banaszewski, Samuel Maddrell-Mander, Callum McLean, Luis Muller, Ali Parviz, Shenyang Huang, Andrew Fitzgibbon

📈

Overview

In biological tasks, data is often scarce due to the difficulty of collecting measurements
Transferring knowledge from pre-trained foundation models to low-data downstream tasks is a promising approach
Designing effective foundation models for molecular learning remains an open question
This work proposes MiniMol, a small foundational model for molecular learning with 10 million parameters
MiniMol is pre-trained on a diverse set of graph- and node-level tasks from quantum chemistry and biology
The model is evaluated on downstream tasks from the Therapeutic Data Commons (TDC) ADMET group, showing significant improvements over prior state-of-the-art models
MiniMol will be made publicly available for future research

Plain English Explanation

In many biology-related tasks, collecting the necessary data can be challenging. This means that researchers often have to work with relatively small datasets, which can make it difficult to train powerful machine learning models. One promising solution is to use transfer learning. The idea is to first train a large, general-purpose model (called a "foundation model") on a vast amount of available data, and then fine-tune this model on the specific task of interest.

However, designing effective foundation models for molecular learning has been an open problem. Most existing approaches have focused on using very large models with many parameters. In this work, the researchers propose a different approach - a smaller foundational model called MiniMol that has only 10 million parameters.

MiniMol is pre-trained on a diverse set of graph-based and node-based tasks from both quantum chemistry and biology. This includes around 6 million molecules and 500 million labels. The researchers then evaluate how well MiniMol can be transferred to downstream tasks related to drug properties (called ADMET tasks) from the Therapeutic Data Commons. They find that MiniMol outperforms previous state-of-the-art foundation models across 17 different ADMET tasks.

The key insight here is that you don't necessarily need a massive model to effectively capture the important patterns in molecular data. By carefully designing the pre-training tasks and dataset, the researchers were able to create a relatively small model that can still perform very well on a wide range of downstream applications. MiniMol will be made publicly available, allowing other researchers to build upon this work.

Technical Explanation

The researchers propose a foundation model called MiniMol that has only 10 million parameters, much smaller than typical large language models used for molecular learning tasks. MiniMol is pre-trained on a diverse set of approximately 3,300 graph-based and node-based tasks from both quantum chemistry and biology. The pre-training dataset includes around 6 million molecules and 500 million labels.

The key innovation is the design of the pre-training tasks and dataset. Rather than using a single large pre-training task, the researchers curated a mix of smaller, more specific tasks covering a wide range of molecular properties and behaviors. This includes tasks related to quantum mechanical properties, protein-ligand interactions, and various biological assays.

To evaluate the generalizability of MiniMol, the researchers fine-tune the model on downstream tasks from the Therapeutic Data Commons (TDC) ADMET group. ADMET tasks focus on predicting various drug-like properties, such as absorption, distribution, metabolism, excretion, and toxicity. Across 17 different ADMET tasks, MiniMol significantly outperforms previous state-of-the-art foundation models, demonstrating the effectiveness of the pre-training approach.

Critical Analysis

The researchers make a compelling case for the benefits of using a smaller, more targeted foundation model like MiniMol for molecular learning tasks. By carefully curating the pre-training dataset and tasks, they are able to achieve strong performance with a model that has much fewer parameters than typical large language models.

However, the paper does not provide a detailed analysis of the limitations or potential drawbacks of the MiniMol approach. For example, it would be interesting to understand how the performance of MiniMol compares to larger models when the amount of downstream data is increased, or how the model's performance scales with the size of the pre-training dataset.

Additionally, the researchers do not address potential concerns around the generalizability of MiniMol beyond the ADMET tasks considered in this work. It would be valuable to see how the model performs on a wider range of molecular learning benchmarks, including tasks that may be more distant from the pre-training data.

Overall, the MiniMol approach is a promising direction for efficient and effective molecular learning, but further research is needed to fully understand its capabilities and limitations.

Conclusion

In this work, the researchers propose MiniMol, a small foundational model for molecular learning with only 10 million parameters. MiniMol is pre-trained on a diverse set of graph- and node-level tasks from quantum chemistry and biology, and is shown to outperform previous state-of-the-art foundation models on a range of downstream ADMET tasks.

The key insight is that you don't necessarily need a massive model to effectively capture the important patterns in molecular data. By carefully designing the pre-training tasks and dataset, the researchers were able to create a relatively small model that can still perform very well on a wide range of applications.

MiniMol will be made publicly available, allowing other researchers to build upon this work and explore the potential of small, targeted foundation models for molecular learning. This approach could have significant implications for fields like drug discovery and development, where data is often scarce and computational resources are limited.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📈

$texttt{MiniMol}$: A Parameter-Efficient Foundation Model for Molecular Learning

Kerstin Klaser, B{l}a.zej Banaszewski, Samuel Maddrell-Mander, Callum McLean, Luis Muller, Ali Parviz, Shenyang Huang, Andrew Fitzgibbon

In biological tasks, data is rarely plentiful as it is generated from hard-to-gather measurements. Therefore, pre-training foundation models on large quantities of available data and then transfer to low-data downstream tasks is a promising direction. However, how to design effective foundation models for molecular learning remains an open question, with existing approaches typically focusing on models with large parameter capacities. In this work, we propose $texttt{MiniMol}$, a foundational model for molecular learning with 10 million parameters. $texttt{MiniMol}$ is pre-trained on a mix of roughly 3300 sparsely defined graph- and node-level tasks of both quantum and biological nature. The pre-training dataset includes approximately 6 million molecules and 500 million labels. To demonstrate the generalizability of $texttt{MiniMol}$ across tasks, we evaluate it on downstream tasks from the Therapeutic Data Commons (TDC) ADMET group showing significant improvements over the prior state-of-the-art foundation model across 17 tasks. $texttt{MiniMol}$ will be a public and open-sourced model for future research.

4/24/2024

Uni-Mol2: Exploring Molecular Pretraining Model at Scale

Xiaohong Ji, Zhen Wang, Zhifeng Gao, Hang Zheng, Linfeng Zhang, Guolin Ke, Weinan E

In recent years, pretraining models have made significant advancements in the fields of natural language processing (NLP), computer vision (CV), and life sciences. The significant advancements in NLP and CV are predominantly driven by the expansion of model parameters and data size, a phenomenon now recognized as the scaling laws. However, research exploring scaling law in molecular pretraining models remains unexplored. In this work, we present Uni-Mol2 , an innovative molecular pretraining model that leverages a two-track transformer to effectively integrate features at the atomic level, graph level, and geometry structure level. Along with this, we systematically investigate the scaling law within molecular pretraining models, characterizing the power-law correlations between validation loss and model size, dataset size, and computational resources. Consequently, we successfully scale Uni-Mol2 to 1.1 billion parameters through pretraining on 800 million conformations, making it the largest molecular pretraining model to date. Extensive experiments show consistent improvement in the downstream tasks as the model size grows. The Uni-Mol2 with 1.1B parameters also outperforms existing methods, achieving an average 27% improvement on the QM9 and 14% on COMPAS-1D dataset.

7/2/2024

A Large Encoder-Decoder Family of Foundation Models For Chemical Language

Eduardo Soares, Victor Shirasuna, Emilio Vital Brazil, Renato Cerqueira, Dmitry Zubarev, Kristin Schmidt

Large-scale pre-training methodologies for chemical language models represent a breakthrough in cheminformatics. These methods excel in tasks such as property prediction and molecule generation by learning contextualized representations of input tokens through self-supervised learning on large unlabeled corpora. Typically, this involves pre-training on unlabeled data followed by fine-tuning on specific tasks, reducing dependence on annotated datasets and broadening chemical language representation understanding. This paper introduces a large encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, which is equivalent to 4 billion of molecular tokens. The proposed foundation model supports different complex tasks, including quantum property prediction, and offer flexibility with two main variants (289M and $8times289M$). Our experiments across multiple benchmark datasets validate the capacity of the proposed model in providing state-of-the-art results for different tasks. We also provide a preliminary assessment of the compositionality of the embedding space as a prerequisite for the reasoning tasks. We demonstrate that the produced latent space is separable compared to the state-of-the-art with few-shot learning capabilities.

7/31/2024

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

Khiem Le, Zhichun Guo, Kaiwen Dong, Xiaobao Huang, Bozhao Nan, Roshni Iyer, Xiangliang Zhang, Olaf Wiest, Wei Wang, Nitesh V. Chawla

Large Language Models (LLMs) with their strong task-handling capabilities have shown remarkable advancements across a spectrum of fields, moving beyond natural language understanding. However, their proficiency within the chemistry domain remains restricted, especially in solving professional molecule-related tasks. This challenge is attributed to their inherent limitations in comprehending molecules using only common textual representations, i.e., SMILES strings. In this study, we seek to enhance the ability of LLMs to comprehend molecules by equipping them with a multi-modal external module, namely MolX. In particular, instead of directly using a SMILES string to represent a molecule, we utilize specific encoders to extract fine-grained features from both SMILES string and 2D molecular graph representations for feeding into an LLM. Moreover, a handcrafted molecular fingerprint is incorporated to leverage its embedded domain knowledge. Then, to establish an alignment between MolX and the LLM's textual input space, the whole model in which the LLM is frozen, is pre-trained with a versatile strategy including a diverse set of tasks. Experimental evaluations show that our proposed method outperforms baselines across 4 downstream molecule-related tasks ranging from molecule-to-text translation to retrosynthesis, with and without fine-tuning the LLM, while only introducing a small number of trainable parameters 0.53% and 0.82%, respectively.

8/23/2024