SMILES-Mamba: Chemical Mamba Foundation Models for Drug ADMET Prediction

Read original: arXiv:2408.05696 - Published 8/13/2024 by Bohao Xu, Yingzhou Lu, Chenhao Li, Ling Yue, Xiao Wang, Nan Hao, Tianfan Fu, Jim Chen

SMILES-Mamba: Chemical Mamba Foundation Models for Drug ADMET Prediction

Overview

This paper presents SMILES-Mamba, a set of chemical foundation models that can predict drug ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties.
The models are trained on a large, diverse dataset of molecular structures and their associated ADMET data.
The authors demonstrate that SMILES-Mamba outperforms existing ADMET prediction models on a range of benchmark tasks.

Plain English Explanation

The paper introduces SMILES-Mamba, a new set of machine learning models that can predict important properties of drug molecules. These properties, known as ADMET, describe how a drug behaves in the body, such as how well it is absorbed, how it moves around, how it is broken down, and whether it is toxic.

The researchers trained their SMILES-Mamba models on a large dataset of chemical structures and their corresponding ADMET data. SMILES is a way of representing chemical structures using text, and the models were designed to work directly with this SMILES format.

By leveraging this large, diverse dataset, the SMILES-Mamba models were able to learn patterns and make accurate predictions about the ADMET properties of new drug candidates. The authors show that SMILES-Mamba outperforms other existing models on a variety of standard ADMET prediction benchmarks.

Technical Explanation

The paper introduces SMILES-Mamba, a set of chemical foundation models for predicting drug ADMET properties. The models are trained on a large, curated dataset of over 1 million small molecules and their associated ADMET labels, including measurements of absorption, distribution, metabolism, excretion, and toxicity.

The core of the SMILES-Mamba architecture is a transformer-based model that takes SMILES string representations of molecules as input and produces predicted ADMET properties as output. The authors explore different variations of the model, including using pre-trained language models as a starting point and incorporating additional molecular graph-based features.

Through extensive experiments on a range of ADMET prediction benchmark tasks, the researchers demonstrate that SMILES-Mamba significantly outperforms previous state-of-the-art models. For example, on a task of predicting human intestinal absorption, SMILES-Mamba achieved an R^2 score of 0.89, compared to 0.79 for the previous best model.

The authors attribute the strong performance of SMILES-Mamba to its ability to effectively leverage the large and diverse training dataset, as well as the model's capacity to capture relevant chemical and biological patterns from the SMILES representations.

Critical Analysis

The paper provides a thorough and well-designed study on using foundation models for drug ADMET prediction. The authors carefully curate a large-scale dataset, explore different model architectures, and conduct rigorous benchmarking to demonstrate the effectiveness of their approach.

One potential limitation is the reliance on SMILES strings as the sole input representation. While SMILES are a convenient textual format, they may not fully capture the 3D structural information that can be important for certain ADMET properties. Future work could explore incorporating additional molecular graph-based or 3D structural features.

Additionally, the paper does not delve deeply into the interpretability of the SMILES-Mamba models. Understanding the specific chemical and biological insights learned by the models could help build trust and enable further scientific discoveries. Techniques like attribution analysis or concept activation could provide more transparency.

Overall, this is a strong contribution to the field of computational drug discovery, demonstrating the power of large-scale foundation models for accelerating ADMET prediction. The SMILES-Mamba models have the potential to significantly streamline the drug development process and help identify promising drug candidates earlier on.

Conclusion

The SMILES-Mamba paper presents a novel approach to predicting the ADMET properties of drug molecules using advanced deep learning models. By leveraging a large, diverse dataset and a transformer-based architecture, the authors show that their models can outperform previous state-of-the-art methods on a range of ADMET prediction tasks.

This work highlights the potential of foundation models, trained on extensive chemical data, to accelerate the drug discovery process. By providing accurate and reliable ADMET predictions, SMILES-Mamba could help pharmaceutical researchers identify promising drug candidates early on and reduce the time and resources needed for experimental testing.

While the study has some limitations, such as the reliance on 1D SMILES representations, the overall contribution is significant and demonstrates the power of large-scale machine learning for computational drug discovery. As the field continues to evolve, we can expect to see more innovative applications of foundation models in this domain, further enhancing our ability to develop safer and more effective therapeutic agents.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

SMILES-Mamba: Chemical Mamba Foundation Models for Drug ADMET Prediction

Bohao Xu, Yingzhou Lu, Chenhao Li, Ling Yue, Xiao Wang, Nan Hao, Tianfan Fu, Jim Chen

In drug discovery, predicting the absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of small-molecule drugs is critical for ensuring safety and efficacy. However, the process of accurately predicting these properties is often resource-intensive and requires extensive experimental data. To address this challenge, we propose SMILES-Mamba, a two-stage model that leverages both unlabeled and labeled data through a combination of self-supervised pretraining and fine-tuning strategies. The model first pre-trains on a large corpus of unlabeled SMILES strings to capture the underlying chemical structure and relationships, before being fine-tuned on smaller, labeled datasets specific to ADMET tasks. Our results demonstrate that SMILES-Mamba exhibits competitive performance across 22 ADMET datasets, achieving the highest score in 14 tasks, highlighting the potential of self-supervised learning in improving molecular property prediction. This approach not only enhances prediction accuracy but also reduces the dependence on large, labeled datasets, offering a promising direction for future research in drug discovery.

8/13/2024

📊

Accelerating Drug Safety Assessment using Bidirectional-LSTM for SMILES Data

K. Venkateswara Rao, Dr. Kunjam Nageswara Rao, Dr. G. Sita Ratnam

Computational methods are useful in accelerating the pace of drug discovery. Drug discovery carries several steps such as target identification and validation, lead discovery, and lead optimisation etc., In the phase of lead optimisation, the absorption, distribution, metabolism, excretion, and toxicity properties of lead compounds are assessed. To address the issue of predicting toxicity and solubility in the lead compounds, represented in Simplified Molecular Input Line Entry System (SMILES) notation. Among the different approaches that work on SMILES data, the proposed model was built using a sequence-based approach. The proposed Bi-Directional Long Short Term Memory (BiLSTM) is a variant of Recurrent Neural Network (RNN) that processes input molecular sequences for the comprehensive examination of the structural features of molecules from both forward and backward directions. The proposed work aims to understand the sequential patterns encoded in the SMILES strings, which are then utilised for predicting the toxicity of the molecules. The proposed model on the ClinTox dataset surpasses previous approaches such as Trimnet and Pre-training Graph neural networks(GNN) by achieving a ROC accuracy of 0.96. BiLSTM outperforms the previous model on FreeSolv dataset with a low RMSE value of 1.22 in solubility prediction.

7/30/2024

Can Large Language Models Understand Molecules?

Shaghayegh Sadeghi, Alan Bui, Ali Forooghi, Jianguo Lu, Alioune Ngom

Purpose: Large Language Models (LLMs) like GPT (Generative Pre-trained Transformer) from OpenAI and LLaMA (Large Language Model Meta AI) from Meta AI are increasingly recognized for their potential in the field of cheminformatics, particularly in understanding Simplified Molecular Input Line Entry System (SMILES), a standard method for representing chemical structures. These LLMs also have the ability to decode SMILES strings into vector representations. Method: We investigate the performance of GPT and LLaMA compared to pre-trained models on SMILES in embedding SMILES strings on downstream tasks, focusing on two key applications: molecular property prediction and drug-drug interaction prediction. Results: We find that SMILES embeddings generated using LLaMA outperform those from GPT in both molecular property and DDI prediction tasks. Notably, LLaMA-based SMILES embeddings show results comparable to pre-trained models on SMILES in molecular prediction tasks and outperform the pre-trained models for the DDI prediction tasks. Conclusion: The performance of LLMs in generating SMILES embeddings shows great potential for further investigation of these models for molecular embedding. We hope our study bridges the gap between LLMs and molecular embedding, motivating additional research into the potential of LLMs in the molecular representation field. GitHub: https://github.com/sshaghayeghs/LLaMA-VS-GPT

5/22/2024

Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking

Thomas Le Menestrel, Manuel Rivas

Docking is a crucial component in drug discovery aimed at predicting the binding conformation and affinity between small molecules and target proteins. ML-based docking has recently emerged as a prominent approach, outpacing traditional methods like DOCK and AutoDock Vina in handling the growing scale and complexity of molecular libraries. However, the availability of comprehensive and user-friendly datasets for training and benchmarking ML-based docking algorithms remains limited. We introduce Smiles2Dock, an open large-scale multi-task dataset for molecular docking. We created a framework combining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL database against 15 AlphaFold proteins, giving us more than 25 million protein-ligand binding scores. The dataset leverages a wide range of high-accuracy AlphaFold protein models, encompasses a diverse set of biologically relevant compounds and enables researchers to benchmark all major approaches for ML-based docking such as Graph, Transformer and CNN-based methods. We also introduce a novel Transformer-based architecture for docking scores prediction and set it as an initial benchmark for our dataset. Our dataset and code are publicly available to support the development of novel ML-based methods for molecular docking to advance scientific research in this field.

6/11/2024