Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking

Read original: arXiv:2406.05738 - Published 6/11/2024 by Thomas Le Menestrel, Manuel Rivas

Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking

Overview

Introduces a new open-source dataset called Smiles2Dock for machine learning-based molecular docking
Provides a large-scale, multi-task dataset for training and evaluating docking models
Aims to advance the field of computational drug discovery and design

Plain English Explanation

The paper introduces a new dataset called Smiles2Dock that is designed to help improve machine learning models for molecular docking. Molecular docking is a computational technique used in drug discovery to predict how a candidate drug molecule will bind to a target protein.

The Smiles2Dock dataset contains a large collection of chemical compounds and information about how they interact with various target proteins. This data can be used to train and test machine learning models that aim to accurately predict the binding interactions between molecules and proteins.

By making this dataset openly available, the researchers hope to accelerate progress in the field of computational drug discovery. More accurate docking models could streamline the process of identifying promising drug candidates, potentially leading to new treatments for diseases.

Technical Explanation

The paper introduces the Smiles2Dock dataset, a large-scale, multi-task dataset for training and evaluating machine learning-based molecular docking models. The dataset contains over 1.6 million small molecules and associated docking poses and scores for 2,000 target proteins.

The dataset was constructed by combining data from several open-source databases, including the PDBBind and ChEMBL repositories. The molecules are represented using the SMILES string format, and the docking data includes information about the binding poses and affinities between the molecules and target proteins.

The dataset is designed to support a variety of machine learning tasks, including virtual screening, binding affinity prediction, and binding pose estimation. The researchers provide baseline results using several state-of-the-art docking and machine learning models, including Uni-Mol-Docking-V2, Deep Learning for Protein-Ligand Docking, and Pre-Training on Large-Scale Generated Docking Conformations.

Critical Analysis

The Smiles2Dock dataset represents a significant contribution to the field of computational drug discovery, as it provides a large-scale, multi-task dataset that can be used to develop and evaluate new machine learning-based docking models.

One potential limitation of the dataset is that it is primarily focused on small molecule-protein interactions, and may not fully capture the complexities of larger, more diverse biomolecular systems. Additionally, the dataset relies on docking data from existing computational methods, which may have their own biases and limitations.

However, the researchers acknowledge these limitations and encourage the community to further expand and refine the dataset. They also note that the dataset can be used to identify areas for improvement in current docking and machine learning approaches, potentially leading to the development of more accurate and reliable models.

Conclusion

The Smiles2Dock dataset represents an important step forward in the field of computational drug discovery, providing a large-scale, multi-task dataset that can be used to develop and evaluate machine learning-based docking models. By making this dataset openly available, the researchers hope to accelerate progress in this critical area of research, ultimately leading to the discovery of new and more effective drug therapies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking

Thomas Le Menestrel, Manuel Rivas

Docking is a crucial component in drug discovery aimed at predicting the binding conformation and affinity between small molecules and target proteins. ML-based docking has recently emerged as a prominent approach, outpacing traditional methods like DOCK and AutoDock Vina in handling the growing scale and complexity of molecular libraries. However, the availability of comprehensive and user-friendly datasets for training and benchmarking ML-based docking algorithms remains limited. We introduce Smiles2Dock, an open large-scale multi-task dataset for molecular docking. We created a framework combining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL database against 15 AlphaFold proteins, giving us more than 25 million protein-ligand binding scores. The dataset leverages a wide range of high-accuracy AlphaFold protein models, encompasses a diverse set of biologically relevant compounds and enables researchers to benchmark all major approaches for ML-based docking such as Graph, Transformer and CNN-based methods. We also introduce a novel Transformer-based architecture for docking scores prediction and set it as an initial benchmark for our dataset. Our dataset and code are publicly available to support the development of novel ML-based methods for molecular docking to advance scientific research in this field.

6/11/2024

Uni-Mol Docking V2: Towards Realistic and Accurate Binding Pose Prediction

Eric Alcaide, Zhifeng Gao, Guolin Ke, Yaqi Li, Linfeng Zhang, Hang Zheng, Gengmo Zhou

In recent years, machine learning (ML) methods have emerged as promising alternatives for molecular docking, offering the potential for high accuracy without incurring prohibitive computational costs. However, recent studies have indicated that these ML models may overfit to quantitative metrics while neglecting the physical constraints inherent in the problem. In this work, we present Uni-Mol Docking V2, which demonstrates a remarkable improvement in performance, accurately predicting the binding poses of 77+% of ligands in the PoseBusters benchmark with an RMSD value of less than 2.0 {AA}, and 75+% passing all quality checks. This represents a significant increase from the 62% achieved by the previous Uni-Mol Docking model. Notably, our Uni-Mol Docking approach generates chemically accurate predictions, circumventing issues such as chirality inversions and steric clashes that have plagued previous ML models. Furthermore, we observe enhanced performance in terms of high-quality predictions (RMSD values of less than 1.0 {AA} and 1.5 {AA}) and physical soundness when Uni-Mol Docking is combined with more physics-based methods like Uni-Dock. Our results represent a significant advancement in the application of artificial intelligence for scientific research, adopting a holistic approach to ligand docking that is well-suited for industrial applications in virtual screening and drug design. The code, data and service for Uni-Mol Docking are publicly available for use and further development in https://github.com/dptech-corp/Uni-Mol.

5/21/2024

🤿

Deep Learning for Protein-Ligand Docking: Are We There Yet?

Alex Morehead, Nabin Giri, Jian Liu, Jianlin Cheng

The effects of ligand binding on protein structures and their in vivo functions carry numerous implications for modern biomedical research and biotechnology development efforts such as drug discovery. Although several deep learning (DL) methods and benchmarks designed for protein-ligand docking have recently been introduced, to date no prior works have systematically studied the behavior of docking methods within the practical context of (1) using predicted (apo) protein structures for docking (e.g., for broad applicability); (2) docking multiple ligands concurrently to a given target protein (e.g., for enzyme design); and (3) having no prior knowledge of binding pockets (e.g., for pocket generalization). To enable a deeper understanding of docking methods' real-world utility, we introduce PoseBench, the first comprehensive benchmark for practical protein-ligand docking. PoseBench enables researchers to rigorously and systematically evaluate DL docking methods for apo-to-holo protein-ligand docking and protein-ligand structure generation using both single and multi-ligand benchmark datasets, the latter of which we introduce for the first time to the DL community. Empirically, using PoseBench, we find that all recent DL docking methods but one fail to generalize to multi-ligand protein targets and also that template-based docking algorithms perform equally well or better for multi-ligand docking as recent single-ligand DL docking methods, suggesting areas of improvement for future work. Code, data, tutorials, and benchmark results are available at https://github.com/BioinfoMachineLearning/PoseBench.

7/9/2024

↗️

MoleculeCLA: Rethinking Molecular Benchmark via Computational Ligand-Target Binding Analysis

Shikun Feng, Jiaxin Zheng, Yinjun Jia, Yanwen Huang, Fengfeng Zhou, Wei-Ying Ma, Yanyan Lan

Molecular representation learning is pivotal for various molecular property prediction tasks related to drug discovery. Robust and accurate benchmarks are essential for refining and validating current methods. Existing molecular property benchmarks derived from wet experiments, however, face limitations such as data volume constraints, unbalanced label distribution, and noisy labels. To address these issues, we construct a large-scale and precise molecular representation dataset of approximately 140,000 small molecules, meticulously designed to capture an extensive array of chemical, physical, and biological properties, derived through a robust computational ligand-target binding analysis pipeline. We conduct extensive experiments on various deep learning models, demonstrating that our dataset offers significant physicochemical interpretability to guide model development and design. Notably, the dataset's properties are linked to binding affinity metrics, providing additional insights into model performance in drug-target interaction tasks. We believe this dataset will serve as a more accurate and reliable benchmark for molecular representation learning, thereby expediting progress in the field of artificial intelligence-driven drug discovery.

6/27/2024