Regression with Large Language Models for Materials and Molecular Property Prediction

Read original: arXiv:2409.06080 - Published 9/11/2024 by Ryan Jacobs, Maciej P. Polak, Lane E. Schultz, Hamed Mahdavi, Vasant Honavar, Dane Morgan

↗️

Overview

Large language models (LLMs) can now perform material and molecular property regression tasks, going beyond their traditional use cases.
The researchers benchmarked the LLaMA 3 model on molecular properties in the QM9 dataset and 24 materials properties.
LLaMA 3 was fine-tuned using only the SMILES representation of molecules and the generative loss.
The results show LLaMA 3 can provide useful regression results that rival standard materials property prediction models, though its errors are higher than specialized models.
Interestingly, LLaMA 3 outperforms GPT-3.5 and GPT-4 on these tasks.

Plain English Explanation

Large language models (LLMs) are powerful AI systems trained on vast amounts of text data. Traditionally, these models have been used for tasks like language translation and text generation. However, this new research demonstrates that LLMs can also be used for predicting the properties of materials and molecules.

The researchers tested the LLaMA 3 model, a large language model developed by Meta AI, on two types of property prediction tasks: molecular properties from the QM9 dataset and 24 different materials properties. To do this, they fine-tuned the LLaMA 3 model using only the SMILES representation of molecules (a text-based way of describing molecular structure) and the model's generative loss function.

Surprisingly, the fine-tuned LLaMA 3 model was able to produce reasonably accurate predictions of these material and molecular properties, rivaling the performance of specialized models like random forests and neural networks. This suggests that LLMs like LLaMA 3 can be versatile tools capable of tackling complex physical and chemical phenomena, beyond their traditional language-based applications.

The researchers also found that LLaMA 3 outperformed larger language models like GPT-3.5 and GPT-4 on these tasks. This is an interesting finding that highlights the potential of LLMs to contribute to fields like chemistry and materials science.

Technical Explanation

The researchers conducted experiments to evaluate the ability of the LLaMA 3 large language model to perform material and molecular property regression tasks. They benchmarked LLaMA 3 on two datasets: the QM9 dataset of molecular properties and a set of 24 materials properties.

For the input, the researchers only used the SMILES representation of the molecules, which is a text-based way of describing molecular structure. They fine-tuned the LLaMA 3 model using only the generative loss function, without any additional input features like atom types or coordinates.

The results showed that the fine-tuned LLaMA 3 model was able to produce useful regression results on both the molecular and materials property prediction tasks. The model's performance rivaled that of standard materials property prediction models like random forests and fully connected neural networks on the QM9 dataset.

However, the researchers found that the LLaMA 3 model's errors were 5-10 times higher than state-of-the-art models that used more detailed molecular representations (e.g., atom types and coordinates) for the same tasks. This suggests that the text-based SMILES input is a less informative representation for these property prediction problems compared to the more granular molecular features used by specialized models.

Interestingly, the researchers found that LLaMA 3 outperformed larger language models like GPT-3.5 and GPT-4 on these material and molecular property prediction tasks. This indicates that the architectural choices and training procedure used for LLaMA 3 may have been more suitable for these types of physical and chemical tasks compared to the larger, more general-purpose language models.

Critical Analysis

The research demonstrates the versatility of large language models and their potential to tackle complex physical and chemical problems, going beyond their traditional language-based applications. The ability of LLaMA 3 to produce reasonably accurate predictions of material and molecular properties using only the SMILES representation is a notable achievement.

However, the researchers acknowledge that the performance of LLaMA 3 is still inferior to specialized models that use more detailed molecular representations. This suggests that while LLMs can be leveraged for these tasks, there is still room for improvement in terms of model architecture and training to better capture the underlying physical and chemical principles.

Additionally, the researchers only tested the models on a limited set of properties and datasets. Further research is needed to understand the breadth and limits of LLM capabilities in materials science and chemistry, as well as how they might compare to other state-of-the-art approaches in these domains.

It would also be valuable to explore ways of incorporating more domain-specific information and constraints into the LLM fine-tuning process, potentially leading to even better performance on these types of regression tasks.

Conclusion

This research demonstrates the potential of large language models to go beyond their traditional language-based applications and tackle complex material and molecular property prediction tasks. The ability of the LLaMA 3 model to produce useful regression results using only the SMILES representation of molecules suggests that LLMs can be versatile tools for scientific domains like chemistry and materials science.

While the performance of LLaMA 3 is still inferior to specialized models, the findings highlight the promise of LLM-based approaches and open up new avenues for future research and applications in these fields. As LLM architectures and training methods continue to evolve, we can expect to see even more impressive capabilities in the realm of physical and chemical property prediction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

↗️

Regression with Large Language Models for Materials and Molecular Property Prediction

Ryan Jacobs, Maciej P. Polak, Lane E. Schultz, Hamed Mahdavi, Vasant Honavar, Dane Morgan

We demonstrate the ability of large language models (LLMs) to perform material and molecular property regression tasks, a significant deviation from the conventional LLM use case. We benchmark the Large Language Model Meta AI (LLaMA) 3 on several molecular properties in the QM9 dataset and 24 materials properties. Only composition-based input strings are used as the model input and we fine tune on only the generative loss. We broadly find that LLaMA 3, when fine-tuned using the SMILES representation of molecules, provides useful regression results which can rival standard materials property prediction models like random forest or fully connected neural networks on the QM9 dataset. Not surprisingly, LLaMA 3 errors are 5-10x higher than those of the state-of-the-art models that were trained using far more granular representation of molecules (e.g., atom types and their coordinates) for the same task. Interestingly, LLaMA 3 provides improved predictions compared to GPT-3.5 and GPT-4o. This work highlights the versatility of LLMs, suggesting that LLM-like generative models can potentially transcend their traditional applications to tackle complex physical phenomena, thus paving the way for future research and applications in chemistry, materials science and other scientific domains.

9/11/2024

Can Large Language Models Understand Molecules?

Shaghayegh Sadeghi, Alan Bui, Ali Forooghi, Jianguo Lu, Alioune Ngom

Purpose: Large Language Models (LLMs) like GPT (Generative Pre-trained Transformer) from OpenAI and LLaMA (Large Language Model Meta AI) from Meta AI are increasingly recognized for their potential in the field of cheminformatics, particularly in understanding Simplified Molecular Input Line Entry System (SMILES), a standard method for representing chemical structures. These LLMs also have the ability to decode SMILES strings into vector representations. Method: We investigate the performance of GPT and LLaMA compared to pre-trained models on SMILES in embedding SMILES strings on downstream tasks, focusing on two key applications: molecular property prediction and drug-drug interaction prediction. Results: We find that SMILES embeddings generated using LLaMA outperform those from GPT in both molecular property and DDI prediction tasks. Notably, LLaMA-based SMILES embeddings show results comparable to pre-trained models on SMILES in molecular prediction tasks and outperform the pre-trained models for the DDI prediction tasks. Conclusion: The performance of LLMs in generating SMILES embeddings shows great potential for further investigation of these models for molecular embedding. We hope our study bridges the gap between LLMs and molecular embedding, motivating additional research into the potential of LLMs in the molecular representation field. GitHub: https://github.com/sshaghayeghs/LLaMA-VS-GPT

5/22/2024

📈

The Role of Model Architecture and Scale in Predicting Molecular Properties: Insights from Fine-Tuning RoBERTa, BART, and LLaMA

Lee Youngmin, Lang S. I. D. Andrew, Cai Duoduo, Wheat R. Stephen

This study introduces a systematic framework to compare the efficacy of Large Language Models (LLMs) for fine-tuning across various cheminformatics tasks. Employing a uniform training methodology, we assessed three well-known models-RoBERTa, BART, and LLaMA-on their ability to predict molecular properties using the Simplified Molecular Input Line Entry System (SMILES) as a universal molecular representation format. Our comparative analysis involved pre-training 18 configurations of these models, with varying parameter sizes and dataset scales, followed by fine-tuning them on six benchmarking tasks from DeepChem. We maintained consistent training environments across models to ensure reliable comparisons. This approach allowed us to assess the influence of model type, size, and training dataset size on model performance. Specifically, we found that LLaMA-based models generally offered the lowest validation loss, suggesting their superior adaptability across tasks and scales. However, we observed that absolute validation loss is not a definitive indicator of model performance - contradicts previous research - at least for fine-tuning tasks: instead, model size plays a crucial role. Through rigorous replication and validation, involving multiple training and fine-tuning cycles, our study not only delineates the strengths and limitations of each model type but also provides a robust methodology for selecting the most suitable LLM for specific cheminformatics applications. This research underscores the importance of considering model architecture and dataset characteristics in deploying AI for molecular property prediction, paving the way for more informed and effective utilization of AI in drug discovery and related fields.

5/3/2024

Cross-Modal Learning for Chemistry Property Prediction: Large Language Models Meet Graph Machine Learning

Sakhinana Sagar Srinivas, Venkataramana Runkana

In the field of chemistry, the objective is to create novel molecules with desired properties, facilitating accurate property predictions for applications such as material design and drug screening. However, existing graph deep learning methods face limitations that curb their expressive power. To address this, we explore the integration of vast molecular domain knowledge from Large Language Models (LLMs) with the complementary strengths of Graph Neural Networks (GNNs) to enhance performance in property prediction tasks. We introduce a Multi-Modal Fusion (MMF) framework that synergistically harnesses the analytical prowess of GNNs and the linguistic generative and predictive abilities of LLMs, thereby improving accuracy and robustness in predicting molecular properties. Our framework combines the effectiveness of GNNs in modeling graph-structured data with the zero-shot and few-shot learning capabilities of LLMs, enabling improved predictions while reducing the risk of overfitting. Furthermore, our approach effectively addresses distributional shifts, a common challenge in real-world applications, and showcases the efficacy of learning cross-modal representations, surpassing state-of-the-art baselines on benchmark datasets for property prediction tasks.

8/28/2024