A Symbolic Framework for Evaluating Mathematical Reasoning and Generalisation with Transformers

2305.12563

Published 4/9/2024 by Jordan Meadows, Marco Valentino, Damien Teney, Andre Freitas

⚙️

Abstract

This paper proposes a methodology for generating and perturbing detailed derivations of equations at scale, aided by a symbolic engine, to evaluate the generalisability of Transformers to out-of-distribution mathematical reasoning problems. Instantiating the framework in the context of sequence classification tasks, we compare the capabilities of GPT-4, GPT-3.5, and a canon of fine-tuned BERT models, exploring the relationship between specific operators and generalisation failure via the perturbation of reasoning aspects such as symmetry and variable surface forms. Surprisingly, our empirical evaluation reveals that the average in-distribution performance of fine-tuned models surpasses GPT-3.5, and rivals GPT-4. However, perturbations to input reasoning can reduce their performance by up to 80 F1 points. Overall, the results suggest that the in-distribution performance of smaller open-source models may potentially rival GPT by incorporating appropriately structured derivation dependencies during training, and highlight a shared weakness between BERT and GPT involving a relative inability to decode indirect references to mathematical entities. We release the full codebase, constructed datasets, and fine-tuned models to encourage future progress in the field.

Create account to get full access

Overview

The paper proposes a methodology for generating and perturbing detailed mathematical derivations at scale to evaluate the generalizability of Transformer language models like GPT-4, GPT-3.5, and fine-tuned BERT models to out-of-distribution mathematical reasoning problems.
The authors compare the performance of these models on sequence classification tasks, exploring how specific mathematical operators and reasoning aspects like symmetry and variable surface forms impact their generalization.
The results suggest that fine-tuned models can rival the in-distribution performance of GPT, but are still vulnerable to large performance drops when the input reasoning is perturbed.
The authors release the codebase, datasets, and models to encourage further research in this area.

Plain English Explanation

The researchers wanted to see how well large language models like GPT-4 and GPT-3.5 can handle mathematical reasoning problems, especially when the problems are different from the ones they were trained on. To do this, they developed a way to automatically generate and tweak detailed mathematical equations and derivations.

They then used these perturbed equations to test the models on sequence classification tasks - for example, determining whether an equation is correct or not. Surprisingly, they found that smaller, open-source models that had been fine-tuned on appropriate training data could match the in-distribution performance of GPT. However, when the input reasoning was changed, even slightly, the fine-tuned models' performance dropped significantly, by up to 80 points.

This suggests that while these smaller models can be trained to perform well on standard math problems, they still struggle to generalize their mathematical reasoning abilities, much like the larger GPT models. The researchers believe this reflects a shared weakness in how these language models handle indirect references to mathematical concepts.

By releasing all their code, datasets, and trained models, the researchers hope to spur further progress in developing language models that can truly excel at flexible, generalizable mathematical reasoning.

Technical Explanation

The paper proposes a framework for systematically generating and perturbing detailed mathematical derivations to assess the generalization capabilities of Transformer-based language models like GPT-4, GPT-3.5, and fine-tuned BERT models.

The authors instantiate this framework in the context of sequence classification tasks, where models must determine whether a given mathematical expression or derivation is correct. They compare the performance of the language models, exploring how perturbations to the input reasoning - such as changes to symmetry or variable surface forms - impact their generalization.

Surprisingly, the results show that the average in-distribution performance of fine-tuned BERT models can rival that of GPT-3.5, and even approach GPT-4. However, the fine-tuned models are still highly vulnerable to perturbations, with their performance dropping by up to 80 F1 points. This suggests a shared weakness between BERT and GPT in their ability to decode indirect references to mathematical entities.

The authors release the full codebase, datasets, and fine-tuned models to encourage further research into enhancing the mathematical reasoning capabilities of large language models and improving their generalization to out-of-distribution problems.

Critical Analysis

The paper provides a compelling and rigorous framework for evaluating the generalization capabilities of Transformer-based language models in the domain of mathematical reasoning. By systematically perturbing the input derivations, the authors are able to uncover key weaknesses in the models' ability to handle indirect references and changes to the underlying reasoning.

However, the paper does not delve deeply into the potential reasons for these generalization failures. It would be helpful to have a more thorough analysis of the specific modeling and architectural choices that may be contributing to the observed limitations. Additionally, the authors could explore potential approaches for improving the models' robustness, such as using more diverse training data or incorporating stronger inductive biases related to mathematical reasoning.

Furthermore, the paper focuses solely on sequence classification tasks, which may not fully capture the depth and nuance of mathematical problem-solving. It would be valuable to extend the evaluation to more open-ended mathematical tasks, such as proof generation or equation solving, to better understand the models' broader capabilities and limitations.

Despite these minor limitations, the paper makes a significant contribution to the field by providing a rigorous framework for evaluating the mathematical reasoning capabilities of large language models and highlighting crucial areas for future research and development.

Conclusion

This paper presents a novel methodology for systematically generating and perturbing mathematical derivations to assess the generalization capabilities of Transformer-based language models. The results suggest that while fine-tuned models can match the in-distribution performance of GPT, they still struggle to handle perturbations to the underlying reasoning, revealing a shared weakness in their ability to decode indirect references to mathematical entities.

By releasing the codebase, datasets, and trained models, the authors have provided a valuable resource for the research community to further explore the mathematical reasoning capabilities of large language models and develop more robust and generalizable approaches. This work represents an important step towards enhancing the ability of AI systems to engage in flexible, contextual mathematical problem-solving, which has far-reaching implications for fields ranging from scientific discovery to education and beyond.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Generative Pre-Trained Transformer for Symbolic Regression Base In-Context Reinforcement Learning

Yanjie Li, Weijun Li, Lina Yu, Min Wu, Jingyi Liu, Wenqiang Li, Meilan Hao, Shu Wei, Yusong Deng

The mathematical formula is the human language to describe nature and is the essence of scientific research. Finding mathematical formulas from observational data is a major demand of scientific research and a major challenge of artificial intelligence. This area is called symbolic regression. Originally symbolic regression was often formulated as a combinatorial optimization problem and solved using GP or reinforcement learning algorithms. These two kinds of algorithms have strong noise robustness ability and good Versatility. However, inference time usually takes a long time, so the search efficiency is relatively low. Later, based on large-scale pre-training data proposed, such methods use a large number of synthetic data points and expression pairs to train a Generative Pre-Trained Transformer(GPT). Then this GPT can only need to perform one forward propagation to obtain the results, the advantage is that the inference speed is very fast. However, its performance is very dependent on the training data and performs poorly on data outside the training set, which leads to poor noise robustness and Versatility of such methods. So, can we combine the advantages of the above two categories of SR algorithms? In this paper, we propose textbf{FormulaGPT}, which trains a GPT using massive sparse reward learning histories of reinforcement learning-based SR algorithms as training data. After training, the SR algorithm based on reinforcement learning is distilled into a Transformer. When new test data comes, FormulaGPT can directly generate a reinforcement learning process and automatically update the learning policy in context. Tested on more than ten datasets including SRBench, formulaGPT achieves the state-of-the-art performance in fitting ability compared with four baselines. In addition, it achieves satisfactory results in noise robustness, versatility, and inference efficiency.

4/10/2024

cs.LG cs.AI

🔎

Transformers in the Service of Description Logic-based Contexts

Angelos Poulis, Eleni Tsalapati, Manolis Koubarakis

Recent advancements in transformer-based models have initiated research interests in investigating their ability to learn to perform reasoning tasks. However, most of the contexts used for this purpose are in practice very simple: generated from short (fragments of) first-order logic sentences with only a few logical operators and quantifiers. In this work, we construct the natural language dataset, DELTA$_D$, using the description logic language $mathcal{ALCQ}$. DELTA$_D$ contains 384K examples, and increases in two dimensions: i) reasoning depth, and ii) linguistic complexity. In this way, we systematically investigate the reasoning ability of a supervised fine-tuned DeBERTa-based model and of two large language models (GPT-3.5, GPT-4) with few-shot prompting. Our results demonstrate that the DeBERTa-based model can master the reasoning task and that the performance of GPTs can improve significantly even when a small number of samples is provided (9 shots). We open-source our code and datasets.

4/29/2024

cs.CL cs.AI

Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

Boshi Wang, Xiang Yue, Yu Su, Huan Sun

We study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable language models struggle with. Focusing on two representative reasoning types, composition and comparison, we consistently find that transformers can learn implicit reasoning, but only through grokking, i.e., extended training far beyond overfitting. The levels of generalization also vary across reasoning types: when faced with out-of-distribution examples, transformers fail to systematically generalize for composition but succeed for comparison. We delve into the model's internals throughout training, conducting analytical experiments that reveal: 1) the mechanism behind grokking, such as the formation of the generalizing circuit and its relation to the relative efficiency of generalizing and memorizing circuits, and 2) the connection between systematicity and the configuration of the generalizing circuit. Our findings guide data and training setup to better induce implicit reasoning and suggest potential improvements to the transformer architecture, such as encouraging cross-layer knowledge sharing. Furthermore, we demonstrate that for a challenging reasoning task with a large search space, GPT-4-Turbo and Gemini-1.5-Pro based on non-parametric memory fail badly regardless of prompting styles or retrieval augmentation, while a fully grokked transformer can achieve near-perfect accuracy, showcasing the power of parametric memory for complex reasoning.

5/28/2024

cs.CL

🌐

When can transformers reason with abstract symbols?

Enric Boix-Adsera, Omid Saremi, Emmanuel Abbe, Samy Bengio, Etai Littwin, Joshua Susskind

We investigate the capabilities of transformer models on relational reasoning tasks. In these tasks, models are trained on a set of strings encoding abstract relations, and are then tested out-of-distribution on data that contains symbols that did not appear in the training dataset. We prove that for any relational reasoning task in a large family of tasks, transformers learn the abstract relations and generalize to the test set when trained by gradient descent on sufficiently large quantities of training data. This is in contrast to classical fully-connected networks, which we prove fail to learn to reason. Our results inspire modifications of the transformer architecture that add only two trainable parameters per head, and that we empirically demonstrate improve data efficiency for learning to reason.

4/17/2024

cs.CL cs.AI cs.LG