Arithmetic with Language Models: from Memorization to Computation

Read original: arXiv:2308.01154 - Published 8/6/2024 by Davide Maltoni, Matteo Ferrara

💬

Overview

This research investigates how a language model trained to predict the next token can perform arithmetic computations beyond its training data.
Binary addition and multiplication are used as a testbed, as they require a small vocabulary and exhibit relevant input/output discontinuities that make smooth interpolation ineffective for novel data.
The researchers successfully trained a lightweight language model to learn these tasks and ran experiments to examine its extrapolation capabilities and internal information processing.

Plain English Explanation

The researchers wanted to better understand how recent large language models, which are trained to predict the next word in a sequence, can also perform arithmetic computations that go beyond their original training data. They chose binary addition and multiplication as a good test case, as these tasks require only a small set of vocabulary words and have some discontinuities in their input-output relationships that make it challenging to simply "interpolate" or smooth out the model's responses to handle new data.

The researchers were able to train a relatively lightweight language model to learn these arithmetic tasks. They then ran a series of experiments to investigate how the model was able to extrapolate, or extend, its capabilities beyond the training data, as well as to examine what was happening inside the model as it processed the information to perform the computations.

The key finding is that the language model seems to be functioning as an Encoding-Regression-Decoding machine, where the actual computation takes place in the "value space" after the input tokens are mapped to an appropriate internal representation. This suggests the model is not simply memorizing or interpolating, but is actually learning to perform the underlying mathematical reasoning.

Technical Explanation

The researchers trained a lightweight language model to perform binary addition and multiplication, which require a small vocabulary but exhibit relevant input/output discontinuities that make smooth interpolation ineffective for novel data. They ran a series of experiments to investigate the model's extrapolation capabilities and internal information processing.

The findings support the hypothesis that the language model is functioning as an Encoding-Regression-Decoding machine, where the computation takes place in the "value space" after the input tokens are mapped to an appropriate internal representation. This suggests the model is not simply memorizing or interpolating, but is actually learning to perform the underlying mathematical reasoning.

The researchers found that the model was able to perform hard arithmetic tasks easily and understand the numerical values of the inputs, even for novel combinations that were not present in the training data. This mathematical extrapolation capability is an important property that could be leveraged to self-train language models for improved arithmetic reasoning.

Critical Analysis

The research provides valuable insights into the inner workings of language models and their ability to perform arithmetic computations beyond their initial training. However, the experiments were limited to binary addition and multiplication, which are relatively simple tasks compared to the more complex mathematical reasoning that would be required for real-world problem-solving.

It would be interesting to see how the model's capabilities scale to more advanced numerical operations, as well as to tasks that require a deeper understanding of mathematical concepts and reasoning. Additionally, the researchers did not explore the model's performance on tasks that involve a mix of natural language and numerical computation, which would be an important area for further investigation.

Overall, this research is an important step in understanding the emergent computational capabilities of large language models, but there is still much work to be done to fully harness their potential for mathematical problem-solving and reasoning.

Conclusion

This research demonstrates that language models trained to predict the next token can also perform basic arithmetic computations, such as binary addition and multiplication, by learning to map input tokens to an internal representation that allows for the necessary mathematical reasoning. This suggests language models have the potential to be leveraged for a wide range of computational tasks beyond just natural language processing, with important implications for further improving and broadening their applicability.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Arithmetic with Language Models: from Memorization to Computation

Davide Maltoni, Matteo Ferrara

A better understanding of the emergent computation and problem-solving capabilities of recent large language models is of paramount importance to further improve them and broaden their applicability. This work investigates how a language model, trained to predict the next token, can perform arithmetic computations generalizing beyond training data. Binary addition and multiplication constitute a good testbed for this purpose, since they require a very small vocabulary and exhibit relevant input/output discontinuities making smooth input interpolation ineffective for novel data. We successfully trained a light language model to learn these tasks and ran a number of experiments to investigate the extrapolation capabilities and internal information processing. Our findings support the hypothesis that the language model works as an Encoding-Regression-Decoding machine where the computation takes place in the value space once the input token representation is mapped to an appropriate internal representation.

8/6/2024

💬

Language Models Implement Simple Word2Vec-style Vector Arithmetic

Jack Merullo, Carsten Eickhoff, Ellie Pavlick

A primary criticism towards language models (LMs) is their inscrutability. This paper presents evidence that, despite their size and complexity, LMs sometimes exploit a simple vector arithmetic style mechanism to solve some relational tasks using regularities encoded in the hidden space of the model (e.g., Poland:Warsaw::China:Beijing). We investigate a range of language model sizes (from 124M parameters to 176B parameters) in an in-context learning setting, and find that for a variety of tasks (involving capital cities, uppercasing, and past-tensing) a key part of the mechanism reduces to a simple additive update typically applied by the feedforward (FFN) networks. We further show that this mechanism is specific to tasks that require retrieval from pretraining memory, rather than retrieval from local context. Our results contribute to a growing body of work on the interpretability of LMs, and offer reason to be optimistic that, despite the massive and non-linear nature of the models, the strategies they ultimately use to solve tasks can sometimes reduce to familiar and even intuitive algorithms.

4/4/2024

Language Models Do Hard Arithmetic Tasks Easily and Hardly Do Easy Arithmetic Tasks

Andrew Gambardella, Yusuke Iwasawa, Yutaka Matsuo

The ability (and inability) of large language models (LLMs) to perform arithmetic tasks has been the subject of much theoretical and practical debate. We show that LLMs are frequently able to correctly and confidently predict the first digit of n-digit by m-digit multiplication tasks without using chain of thought reasoning, despite these tasks require compounding operations to solve. Simultaneously, LLMs in practice often fail to correctly or confidently predict the last digit of an n-digit by m-digit multiplication, a task equivalent to 1-digit by 1-digit multiplication which can be easily learned or memorized. We show that the latter task can be solved more robustly when the LLM is conditioned on all of the correct higher-order digits, which on average increases the confidence of the correct last digit on 5-digit by 5-digit multiplication tasks using Llama 2-13B by over 230% (0.13 to 0.43) and Mistral-7B by 150% (0.22 to 0.55).

6/5/2024

Interpreting and Improving Large Language Models in Arithmetic Calculation

Wei Zhang, Chaoqun Wan, Yonggang Zhang, Yiu-ming Cheung, Xinmei Tian, Xu Shen, Jieping Ye

Large language models (LLMs) have demonstrated remarkable potential across numerous applications and have shown an emergent ability to tackle complex reasoning tasks, such as mathematical computations. However, even for the simplest arithmetic calculations, the intrinsic mechanisms behind LLMs remain mysterious, making it challenging to ensure reliability. In this work, we delve into uncovering a specific mechanism by which LLMs execute calculations. Through comprehensive experiments, we find that LLMs frequently involve a small fraction (< 5%) of attention heads, which play a pivotal role in focusing on operands and operators during calculation processes. Subsequently, the information from these operands is processed through multi-layer perceptrons (MLPs), progressively leading to the final solution. These pivotal heads/MLPs, though identified on a specific dataset, exhibit transferability across different datasets and even distinct tasks. This insight prompted us to investigate the potential benefits of selectively fine-tuning these essential heads/MLPs to boost the LLMs' computational performance. We empirically find that such precise tuning can yield notable enhancements on mathematical prowess, without compromising the performance on non-mathematical tasks. Our work serves as a preliminary exploration into the arithmetic calculation abilities inherent in LLMs, laying a solid foundation to reveal more intricate mathematical tasks.

9/4/2024