MMSR: Symbolic Regression is a Multi-Modal Information Fusion Task

Read original: arXiv:2402.18603 - Published 9/20/2024 by Yanjie Li, Jingyi Liu, Weijun Li, Lina Yu, Min Wu, Wenqiang Li, Meilan Hao, Su Wei, Yusong Deng

MMSR: Symbolic Regression is a Multi-Modal Information Fusion Task

Overview

Symbolic regression is a machine learning task that aims to discover mathematical expressions from data.
The paper argues that symbolic regression is a multimodal task, where the goal is to find a single mathematical expression that accurately fits multiple datasets.
The authors propose a novel approach called MMSR (Multimodal Symbolic Regression) to address this challenge.

Plain English Explanation

The paper explores the idea that symbolic regression, the process of discovering mathematical formulas from data, is actually a multimodal task. This means that the goal is not just to find a single equation that fits one dataset, but to find a single expression that accurately describes multiple datasets at the same time.

To tackle this challenge, the researchers developed a new technique called MMSR (Multimodal Symbolic Regression). Their approach aims to discover a single mathematical formula that can capture the patterns in various datasets, rather than finding separate equations for each dataset.

Technical Explanation

The paper argues that symbolic regression should be viewed as a multimodal task, where the goal is to find a single mathematical expression that fits multiple datasets simultaneously. This is in contrast to the traditional approach, which focuses on finding an equation that best fits a single dataset.

To address this, the authors propose the MMSR (Multimodal Symbolic Regression) method. MMSR employs a novel architecture that leverages multi-view learning to capture the underlying relationships across multiple datasets. The model is trained to discover a single mathematical expression that can accurately describe the patterns in all the input datasets.

Critical Analysis

The paper raises an important point that symbolic regression should be considered a multimodal task, as the goal is often to find a single equation that fits multiple datasets. This perspective challenges the traditional approach of optimizing for a single dataset.

The authors' MMSR method is a promising solution to this challenge, but the paper does not provide a thorough evaluation of its performance compared to existing techniques. More extensive experiments and comparisons would be helpful to assess the advantages and limitations of their approach.

Additionally, the paper does not discuss the potential limitations of the multimodal approach, such as the difficulty of finding a single expression that accurately captures the nuances of multiple datasets or the potential trade-offs between goodness-of-fit and model complexity.

Conclusion

This paper presents a novel perspective on symbolic regression, arguing that it should be viewed as a multimodal task. The authors propose the MMSR method to address this challenge, which aims to discover a single mathematical expression that fits multiple datasets simultaneously.

While the paper offers an interesting conceptual shift, more empirical evidence and analysis would be valuable to fully assess the merits and limitations of the multimodal approach to symbolic regression. Nevertheless, this work opens up new avenues for research in this important machine learning domain.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

New!MMSR: Symbolic Regression is a Multi-Modal Information Fusion Task

Yanjie Li, Jingyi Liu, Weijun Li, Lina Yu, Min Wu, Wenqiang Li, Meilan Hao, Su Wei, Yusong Deng

Mathematical formulas are the crystallization of human wisdom in exploring the laws of nature for thousands of years. Describing the complex laws of nature with a concise mathematical formula is a constant pursuit of scientists and a great challenge for artificial intelligence. This field is called symbolic regression (SR). Symbolic regression was originally formulated as a combinatorial optimization problem, and Genetic Programming (GP) and Reinforcement Learning algorithms were used to solve it. However, GP is sensitive to hyperparameters, and these two types of algorithms are inefficient. To solve this problem, researchers treat the mapping from data to expressions as a translation problem. And the corresponding large-scale pre-trained model is introduced. However, the data and expression skeletons do not have very clear word correspondences as the two languages do. Instead, they are more like two modalities (e.g., image and text). Therefore, in this paper, we proposed MMSR. The SR problem is solved as a pure multi-modal problem, and contrastive learning is also introduced in the training process for modal alignment to facilitate later modal feature fusion. It is worth noting that to better promote the modal feature fusion, we adopt the strategy of training contrastive learning loss and other losses at the same time, which only needs one-step training, instead of training contrastive learning loss first and then training other losses. Because our experiments prove training together can make the feature extraction module and feature fusion module wearing-in better. Experimental results show that compared with multiple large-scale pre-training baselines, MMSR achieves the most advanced results on multiple mainstream datasets including SRBench. Our code is open source at https://github.com/1716757342/MMSR

9/20/2024

MLLM-SR: Conversational Symbolic Regression base Multi-Modal Large Language Models

Yanjie Li, Weijun Li, Lina Yu, Min Wu, Jingyi Liu, Wenqiang Li, Shu Wei, Yusong Deng

Formulas are the language of communication between humans and nature. It is an important research topic of artificial intelligence to find expressions from observed data to reflect the relationship between each variable in the data, which is called a symbolic regression problem. The existing symbolic regression methods directly generate expressions according to the given observation data, and we cannot require the algorithm to generate expressions that meet specific requirements according to the known prior knowledge. For example, the expression needs to contain $sin$ or be symmetric, and so on. Even if it can, it often requires very complex operations, which is very inconvenient. In this paper, based on multi-modal large language models, we propose MLLM-SR, a conversational symbolic regression method that can generate expressions that meet the requirements simply by describing the requirements with natural language instructions. By experimenting on the Nguyen dataset, we can demonstrate that MLLM-SR leads the state-of-the-art baselines in fitting performance. More notably, we experimentally demonstrate that MLLM-SR can well understand the prior knowledge we add to the natural language instructions. Moreover, the addition of prior knowledge can effectively guide MLLM-SR to generate correct expressions.

6/11/2024

Multi-View Symbolic Regression

Etienne Russeil, Fabr'icio Olivetti de Franc{c}a, Konstantin Malanchev, Bogdan Burlacu, Emille E. O. Ishida, Marion Leroux, Cl'ement Michelin, Guillaume Moinard, Emmanuel Gangler

Symbolic regression (SR) searches for analytical expressions representing the relationship between a set of explanatory and response variables. Current SR methods assume a single dataset extracted from a single experiment. Nevertheless, frequently, the researcher is confronted with multiple sets of results obtained from experiments conducted with different setups. Traditional SR methods may fail to find the underlying expression since the parameters of each experiment can be different. In this work we present Multi-View Symbolic Regression (MvSR), which takes into account multiple datasets simultaneously, mimicking experimental environments, and outputs a general parametric solution. This approach fits the evaluated expression to each independent dataset and returns a parametric family of functions f(x; theta) simultaneously capable of accurately fitting all datasets. We demonstrate the effectiveness of MvSR using data generated from known expressions, as well as real-world data from astronomy, chemistry and economy, for which an a priori analytical expression is not available. Results show that MvSR obtains the correct expression more frequently and is robust to hyperparameters change. In real-world data, it is able to grasp the group behavior, recovering known expressions from the literature as well as promising alternatives, thus enabling the use of SR to a large range of experimental scenarios.

7/22/2024

In-Context Symbolic Regression: Leveraging Language Models for Function Discovery

Matteo Merler, Katsiaryna Haitsiukevich, Nicola Dainese, Pekka Marttinen

State of the art Symbolic Regression (SR) methods currently build specialized models, while the application of Large Language Models (LLMs) remains largely unexplored. In this work, we introduce the first comprehensive framework that utilizes LLMs for the task of SR. We propose In-Context Symbolic Regression (ICSR), an SR method which iteratively refines a functional form with an LLM and determines its coefficients with an external optimizer. ICSR leverages LLMs' strong mathematical prior both to propose an initial set of possible functions given the observations and to refine them based on their errors. Our findings reveal that LLMs are able to successfully find symbolic equations that fit the given data, matching or outperforming the overall performance of the best SR baselines on four popular benchmarks, while yielding simpler equations with better out of distribution generalization.

7/18/2024