Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners?

2401.18070

Published 6/18/2024 by Andreas Opedal, Alessandro Stolfo, Haruki Shirakami, Ying Jiao, Ryan Cotterell, Bernhard Scholkopf, Abulhair Saparov, Mrinmaya Sachan

cs.CL cs.AI cs.LG

💬

Abstract

There is increasing interest in employing large language models (LLMs) as cognitive models. For such purposes, it is central to understand which properties of human cognition are well-modeled by LLMs, and which are not. In this work, we study the biases of LLMs in relation to those known in children when solving arithmetic word problems. Surveying the learning science literature, we posit that the problem-solving process can be split into three distinct steps: text comprehension, solution planning and solution execution. We construct tests for each one in order to understand whether current LLMs display the same cognitive biases as children in these steps. We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features. We find evidence that LLMs, with and without instruction-tuning, exhibit human-like biases in both the text-comprehension and the solution-planning steps of the solving process, but not in the final step, in which the arithmetic expressions are executed to obtain the answer.

Create account to get full access

Overview

This research examines how well large language models (LLMs) can mimic the cognitive biases observed in human children when solving arithmetic word problems.
The researchers split the problem-solving process into three steps: text comprehension, solution planning, and solution execution.
They designed tests to evaluate LLMs' performance in each step and compared the results to the known biases of human children.

Plain English Explanation

The researchers were interested in understanding how well large language models can model human cognition. Specifically, they wanted to see if LLMs would exhibit the same biases as children when solving math word problems.

Math word problems can be tricky because they involve multiple steps: first, you need to understand the text and what the problem is asking. Then, you need to plan how to solve it. Finally, you have to actually do the calculations to get the answer.

The researchers looked at each of these steps separately. They created special test problems to see how well LLMs performed at text comprehension, solution planning, and solution execution.

Interestingly, they found that LLMs showed similar biases to children in the first two steps - understanding the text and planning the solution. However, the LLMs did not display the same biases in the final step of actually calculating the answer.

This suggests that LLMs can mimic some aspects of human cognition, like how we interpret language and strategize, but they may not fully capture the complete problem-solving process that people use. There's still more to learn about the similarities and differences between how machines and humans think.

Technical Explanation

The researchers were interested in investigating which aspects of human cognition are well-modeled by large language models (LLMs). To do this, they focused on how LLMs perform on arithmetic word problems, and compared their biases to those observed in children.

They posited that the problem-solving process can be broken down into three distinct steps: text comprehension, solution planning, and solution execution. The researchers constructed separate tests to evaluate LLM performance in each of these steps.

Using a neuro-symbolic approach, they generated a novel set of word problems that allowed for fine-grained control over the problem features. This enabled them to systematically explore the biases exhibited by LLMs, both with and without instruction-tuning, and compare them to the known biases of children.

The results showed that LLMs display human-like biases in the text comprehension and solution planning stages of problem-solving. However, they did not exhibit the same biases in the final step of executing the arithmetic calculations to obtain the answer.

These findings suggest that LLMs can capture some aspects of human cognitive processes, but there are still limitations in how fully they can model the complete problem-solving competence observed in children. Further research is needed to understand the extent to which LLMs can accurately emulate deductive reasoning and the nuances of human cognition.

Critical Analysis

The researchers provide a thoughtful and well-designed study to explore the cognitive biases of LLMs in relation to human problem-solving. By breaking down the process into distinct steps, they were able to gain insights into the strengths and limitations of current language models.

One notable limitation is that the study focused only on arithmetic word problems, which may not fully capture the breadth of human cognitive abilities. It would be valuable to expand the analysis to other types of reasoning tasks to see if the findings generalize.

Additionally, the researchers acknowledge that the neuro-symbolic approach used to generate the test problems may have introduced biases of its own. Further validation using alternative problem generation methods could help strengthen the conclusions.

It is also worth considering whether the observed differences in the final solution execution step are truly indicative of a fundamental gap in LLM capabilities, or if they could be overcome with additional training or architectural changes. Continued research in this area could shed light on the specific challenges that LLMs face in emulating human-level mathematical reasoning.

Overall, this study provides valuable insights into the strengths and limitations of current language models in modeling human cognition. The findings encourage us to think critically about the capabilities and biases of AI systems, and to continue exploring how they can be better aligned with human intelligence.

Conclusion

This research offers an intriguing look at the cognitive biases exhibited by large language models when solving arithmetic word problems. The key takeaway is that while LLMs can mimic certain aspects of human problem-solving, such as text comprehension and solution planning, they do not fully capture the complete reasoning process observed in children.

These findings suggest that there is still much to be learned about the nuances of human cognition and how AI systems can be designed to more closely approximate our problem-solving abilities. Continued research in this area could lead to important advancements in the development of language models that better reflect our own cognitive capacities.

As we strive to create AI systems that can seamlessly interact with and assist humans, understanding the similarities and differences between machine and human reasoning will be crucial. This study represents an important step in that direction, and encourages us to think critically about the strengths and limitations of current language models.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Do Large Language Models Mirror Cognitive Language Processing?

Yuqi Ren, Renren Jin, Tongxuan Zhang, Deyi Xiong

Large Language Models (LLMs) have demonstrated remarkable abilities in text comprehension and logical reasoning, indicating that the text representations learned by LLMs can facilitate their language processing capabilities. In cognitive science, brain cognitive processing signals are typically utilized to study human language processing. Therefore, it is natural to ask how well the text embeddings from LLMs align with the brain cognitive processing signals, and how training strategies affect the LLM-brain alignment? In this paper, we employ Representational Similarity Analysis (RSA) to measure the alignment between 23 mainstream LLMs and fMRI signals of the brain to evaluate how effectively LLMs simulate cognitive language processing. We empirically investigate the impact of various factors (e.g., pre-training data size, model scaling, alignment training, and prompts) on such LLM-brain alignment. Experimental results indicate that pre-training data size and model scaling are positively correlated with LLM-brain similarity, and alignment training can significantly improve LLM-brain similarity. Explicit prompts contribute to the consistency of LLMs with brain cognitive language processing, while nonsensical noisy prompts may attenuate such alignment. Additionally, the performance of a wide range of LLM evaluations (e.g., MMLU, Chatbot Arena) is highly correlated with the LLM-brain similarity.

5/29/2024

cs.AI cs.CL

Language Models Trained to do Arithmetic Predict Human Risky and Intertemporal Choice

Jian-Qiao Zhu, Haijiang Yan, Thomas L. Griffiths

The observed similarities in the behavior of humans and Large Language Models (LLMs) have prompted researchers to consider the potential of using LLMs as models of human cognition. However, several significant challenges must be addressed before LLMs can be legitimately regarded as cognitive models. For instance, LLMs are trained on far more data than humans typically encounter, and may have been directly trained on human data in specific cognitive tasks or aligned with human preferences. Consequently, the origins of these behavioral similarities are not well understood. In this paper, we propose a novel way to enhance the utility of LLMs as cognitive models. This approach involves (i) leveraging computationally equivalent tasks that both an LLM and a rational agent need to master for solving a cognitive problem and (ii) examining the specific task distributions required for an LLM to exhibit human-like behaviors. We apply this approach to decision-making -- specifically risky and intertemporal choice -- where the key computationally equivalent task is the arithmetic of expected value calculations. We show that an LLM pretrained on an ecologically valid arithmetic dataset, which we call Arithmetic-GPT, predicts human behavior better than many traditional cognitive models. Pretraining LLMs on ecologically valid arithmetic datasets is sufficient to produce a strong correspondence between these models and human decision-making. Our results also suggest that LLMs used as cognitive models should be carefully investigated via ablation studies of the pretraining data.

5/30/2024

cs.AI cs.CL

Large Language Models for Mathematical Reasoning: Progresses and Challenges

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, Wenpeng Yin

Mathematical reasoning serves as a cornerstone for assessing the fundamental cognitive capabilities of human intelligence. In recent times, there has been a notable surge in the development of Large Language Models (LLMs) geared towards the automated resolution of mathematical problems. However, the landscape of mathematical problem types is vast and varied, with LLM-oriented techniques undergoing evaluation across diverse datasets and settings. This diversity makes it challenging to discern the true advancements and obstacles within this burgeoning field. This survey endeavors to address four pivotal dimensions: i) a comprehensive exploration of the various mathematical problems and their corresponding datasets that have been investigated; ii) an examination of the spectrum of LLM-oriented techniques that have been proposed for mathematical problem-solving; iii) an overview of factors and concerns affecting LLMs in solving math; and iv) an elucidation of the persisting challenges within this domain. To the best of our knowledge, this survey stands as one of the first extensive examinations of the landscape of LLMs in the realm of mathematics, providing a holistic perspective on the current state, accomplishments, and future challenges in this rapidly evolving field.

4/8/2024

cs.CL

💬

Evaluating the Deductive Competence of Large Language Models

Spencer M. Seals, Valerie L. Shalin

The development of highly fluent large language models (LLMs) has prompted increased interest in assessing their reasoning and problem-solving capabilities. We investigate whether several LLMs can solve a classic type of deductive reasoning problem from the cognitive science literature. The tested LLMs have limited abilities to solve these problems in their conventional form. We performed follow up experiments to investigate if changes to the presentation format and content improve model performance. We do find performance differences between conditions; however, they do not improve overall performance. Moreover, we find that performance interacts with presentation format and content in unexpected ways that differ from human performance. Overall, our results suggest that LLMs have unique reasoning biases that are only partially predicted from human reasoning performance and the human-generated language corpora that informs them.

4/16/2024

cs.CL