Generative Pre-Trained Transformer for Symbolic Regression Base In-Context Reinforcement Learning

2404.06330

Published 4/10/2024 by Yanjie Li, Weijun Li, Lina Yu, Min Wu, Jingyi Liu, Wenqiang Li, Meilan Hao, Shu Wei, Yusong Deng

Generative Pre-Trained Transformer for Symbolic Regression Base In-Context Reinforcement Learning

Abstract

The mathematical formula is the human language to describe nature and is the essence of scientific research. Finding mathematical formulas from observational data is a major demand of scientific research and a major challenge of artificial intelligence. This area is called symbolic regression. Originally symbolic regression was often formulated as a combinatorial optimization problem and solved using GP or reinforcement learning algorithms. These two kinds of algorithms have strong noise robustness ability and good Versatility. However, inference time usually takes a long time, so the search efficiency is relatively low. Later, based on large-scale pre-training data proposed, such methods use a large number of synthetic data points and expression pairs to train a Generative Pre-Trained Transformer(GPT). Then this GPT can only need to perform one forward propagation to obtain the results, the advantage is that the inference speed is very fast. However, its performance is very dependent on the training data and performs poorly on data outside the training set, which leads to poor noise robustness and Versatility of such methods. So, can we combine the advantages of the above two categories of SR algorithms? In this paper, we propose textbf{FormulaGPT}, which trains a GPT using massive sparse reward learning histories of reinforcement learning-based SR algorithms as training data. After training, the SR algorithm based on reinforcement learning is distilled into a Transformer. When new test data comes, FormulaGPT can directly generate a reinforcement learning process and automatically update the learning policy in context. Tested on more than ten datasets including SRBench, formulaGPT achieves the state-of-the-art performance in fitting ability compared with four baselines. In addition, it achieves satisfactory results in noise robustness, versatility, and inference efficiency.

Create account to get full access

Overview

This paper introduces a novel approach for symbolic regression using a Generative Pre-Trained Transformer (GPT) model, which is trained using in-context reinforcement learning.
The proposed method aims to address the limitations of traditional symbolic regression techniques, such as genetic programming, by leveraging the powerful language modeling capabilities of large language models.
The authors demonstrate the effectiveness of their approach on a range of symbolic regression tasks and compare it to state-of-the-art methods.

Plain English Explanation

The paper presents a new way to solve symbolic regression problems using a type of AI model called a Generative Pre-Trained Transformer (GPT). Symbolic regression is the process of finding mathematical equations that best fit a set of data. Traditional methods, like genetic programming, have limitations, so the researchers wanted to try a different approach using large language models.

Large language models, like GPT, are very good at understanding and generating human-like text. The researchers trained a GPT model to learn how to generate symbolic mathematical expressions that fit the data, using a technique called in-context reinforcement learning. This means the model learns by getting feedback on whether its generated expressions are accurate, and it gradually improves.

The key idea is to leverage the powerful text generation capabilities of GPT to solve symbolic regression problems, which are traditionally quite challenging. The researchers show that their approach performs well compared to other state-of-the-art methods, like those based on code generation or generative AI techniques.

Technical Explanation

The paper proposes a Generative Pre-Trained Transformer (GPT) based approach for symbolic regression, trained using in-context reinforcement learning. The key novelty is the use of a large language model, pre-trained on a vast amount of text data, to generate symbolic mathematical expressions that fit a given dataset.

The authors first define a domain-specific language (DSL) to represent the space of candidate symbolic expressions. They then train a GPT model to generate expressions in this DSL, using a reinforcement learning algorithm that provides feedback on the accuracy of the generated expressions.

Specifically, the model is trained on a set of symbolic regression tasks, where it receives the input data and a target function, and must generate the symbolic expression that best fits the data. The model is rewarded for generating expressions that minimize the error between the predicted and target functions, and it iteratively improves its expression generation capabilities.

The authors evaluate their approach on a variety of symbolic regression benchmarks, comparing it to state-of-the-art methods like genetic programming, generative AI for text generation, and transformer-based code generation. The results demonstrate the effectiveness of their GPT-based approach, which outperforms the baselines on a range of tasks.

Critical Analysis

The paper presents a promising approach for symbolic regression, leveraging the powerful text generation capabilities of large language models. However, the authors acknowledge several limitations and areas for future research.

One key limitation is the dependence on the pre-defined domain-specific language (DSL) to represent the space of candidate expressions. While the DSL was designed to be expressive, it may still limit the model's ability to discover truly novel mathematical expressions. Exploring more open-ended generation approaches, like those used in generative software engineering, could be an interesting direction for future work.

Additionally, the in-context reinforcement learning approach used to train the model relies on the availability of a large number of symbolic regression tasks for the model to learn from. In practical scenarios, such a diverse dataset may not always be available. Investigating ways to adapt the model to new tasks with limited data, or to leverage pre-training on broader datasets, could improve the model's applicability.

Overall, the paper presents an intriguing approach that demonstrates the potential of large language models for symbolic regression. Further research to address the identified limitations and explore the broader implications of this work could lead to significant advancements in the field of automated mathematical reasoning and symbolic computing.

Conclusion

This paper introduces a novel Generative Pre-Trained Transformer (GPT) based approach for symbolic regression, trained using in-context reinforcement learning. The key contribution is the use of a powerful language model to generate symbolic mathematical expressions that fit given datasets, addressing the limitations of traditional symbolic regression techniques.

The results show that the proposed method outperforms state-of-the-art baselines on a range of symbolic regression tasks, highlighting the potential of large language models for automated mathematical reasoning. While the paper identifies several areas for future research, the work demonstrates the exciting possibilities at the intersection of machine learning and symbolic computation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

⚙️

A Symbolic Framework for Evaluating Mathematical Reasoning and Generalisation with Transformers

Jordan Meadows, Marco Valentino, Damien Teney, Andre Freitas

This paper proposes a methodology for generating and perturbing detailed derivations of equations at scale, aided by a symbolic engine, to evaluate the generalisability of Transformers to out-of-distribution mathematical reasoning problems. Instantiating the framework in the context of sequence classification tasks, we compare the capabilities of GPT-4, GPT-3.5, and a canon of fine-tuned BERT models, exploring the relationship between specific operators and generalisation failure via the perturbation of reasoning aspects such as symmetry and variable surface forms. Surprisingly, our empirical evaluation reveals that the average in-distribution performance of fine-tuned models surpasses GPT-3.5, and rivals GPT-4. However, perturbations to input reasoning can reduce their performance by up to 80 F1 points. Overall, the results suggest that the in-distribution performance of smaller open-source models may potentially rival GPT by incorporating appropriately structured derivation dependencies during training, and highlight a shared weakness between BERT and GPT involving a relative inability to decode indirect references to mathematical entities. We release the full codebase, constructed datasets, and fine-tuned models to encourage future progress in the field.

4/9/2024

cs.CL cs.LG

In-Context Symbolic Regression: Leveraging Language Models for Function Discovery

Matteo Merler, Nicola Dainese, Katsiaryna Haitsiukevich

Symbolic Regression (SR) is a task which aims to extract the mathematical expression underlying a set of empirical observations. Transformer-based methods trained on SR datasets detain the current state-of-the-art in this task, while the application of Large Language Models (LLMs) to SR remains unexplored. This work investigates the integration of pre-trained LLMs into the SR pipeline, utilizing an approach that iteratively refines a functional form based on the prediction error it achieves on the observation set, until it reaches convergence. Our method leverages LLMs to propose an initial set of possible functions based on the observations, exploiting their strong pre-training prior. These functions are then iteratively refined by the model itself and by an external optimizer for their coefficients. The process is repeated until the results are satisfactory. We then analyze Vision-Language Models in this context, exploring the inclusion of plots as visual inputs to aid the optimization process. Our findings reveal that LLMs are able to successfully recover good symbolic equations that fit the given data, outperforming SR baselines based on Genetic Programming, with the addition of images in the input showing promising results for the most complex benchmarks.

5/1/2024

cs.CL cs.LG

↗️

A Comparison of Recent Algorithms for Symbolic Regression to Genetic Programming

Yousef A. Radwan, Gabriel Kronberger, Stephan Winkler

Symbolic regression is a machine learning method with the goal to produce interpretable results. Unlike other machine learning methods such as, e.g. random forests or neural networks, which are opaque, symbolic regression aims to model and map data in a way that can be understood by scientists. Recent advancements, have attempted to bridge the gap between these two fields; new methodologies attempt to fuse the mapping power of neural networks and deep learning techniques with the explanatory power of symbolic regression. In this paper, we examine these new emerging systems and test the performance of an end-to-end transformer model for symbolic regression versus the reigning traditional methods based on genetic programming that have spearheaded symbolic regression throughout the years. We compare these systems on novel datasets to avoid bias to older methods who were improved on well-known benchmark datasets. Our results show that traditional GP methods as implemented e.g., by Operon still remain superior to two recently published symbolic regression methods.

6/7/2024

cs.LG cs.AI

🌐

A Neural-Guided Dynamic Symbolic Network for Exploring Mathematical Expressions from Data

Wenqiang Li, Weijun Li, Lina Yu, Min Wu, Linjun Sun, Jingyi Liu, Yanjie Li, Shu Wei, Yusong Deng, Meilan Hao

Symbolic regression (SR) is a powerful technique for discovering the underlying mathematical expressions from observed data. Inspired by the success of deep learning, recent deep generative SR methods have shown promising results. However, these methods face difficulties in processing high-dimensional problems and learning constants due to the large search space, and they don't scale well to unseen problems. In this work, we propose DySymNet, a novel neural-guided Dynamic Symbolic Network for SR. Instead of searching for expressions within a large search space, we explore symbolic networks with various structures, guided by reinforcement learning, and optimize them to identify expressions that better-fitting the data. Based on extensive numerical experiments on low-dimensional public standard benchmarks and the well-known SRBench with more variables, DySymNet shows clear superiority over several representative baseline models. Open source code is available at https://github.com/AILWQ/DySymNet.

6/4/2024

cs.LG cs.AI