A Systematic Comparison of Syllogistic Reasoning in Humans and Language Models

Read original: arXiv:2311.00445 - Published 4/12/2024 by Tiwalayo Eisape, MH Tessler, Ishita Dasgupta, Fei Sha, Sjoerd van Steenkiste, Tal Linzen

💬

Overview

Examines how language models trained on human-generated text can replicate human biases in logical reasoning
Focuses on the case of syllogisms - inferences from two simple premises
Finds larger language models are more logical than smaller models, and more logical than humans
Shows models can still make systematic errors that mirror human reasoning biases

Plain English Explanation

The paper explores how language models, which are AI systems trained on human-written text, can reproduce the flaws and biases of human logical reasoning. The researchers focus on a specific type of logical inference called syllogisms, where you draw a conclusion from two simple statements.

The key finding is that larger language models tend to be better at logical reasoning than smaller models, and even better than humans. However, even the most advanced models still make certain systematic errors that match common mistakes people make. For example, the models can be influenced by the order of the variables in the syllogism, even though that order shouldn't matter logically. They also sometimes reach confident but incorrect conclusions, similar to human syllogistic fallacies.

Overall, the results show that while language models can surpass human-level logical reasoning in many cases, they still have some of the same biases and flaws as the human-written text they were trained on. This highlights the importance of carefully evaluating the reasoning capabilities of these powerful language models, and finding ways to help them overcome human-like reasoning mistakes.

Technical Explanation

The paper investigates whether large language models, which are trained on vast amounts of human-generated text, are able to overcome the biases and mistakes that humans commonly make in logical reasoning. Focusing on the case of syllogisms - simple logical inferences from two premises - the researchers tested a range of language models from the PaLM2 family, ranging from smaller to larger model sizes.

Their experiments showed that as the language models increased in size, they became progressively more accurate in their syllogistic reasoning, eventually outperforming average human performance. This suggests that the larger models are better able to learn and apply the formal rules of logic, going beyond the flaws present in their training data.

However, the researchers also found that even the largest models still exhibited certain systematic biases and errors in their reasoning. For example, the models were influenced by the superficial order of the variables in the syllogisms, despite this being logically irrelevant. The models also sometimes drew confident but incorrect conclusions, mirroring the syllogistic fallacies that humans are prone to.

These findings suggest that while language models can surpass human-level logical reasoning in many cases, they still have room for improvement in terms of fully deductive, inductive and abductive learning. Further research is needed to understand the limitations of these models and find ways to help them reason more accurately and robustly across a range of logical and deformalized reasoning tasks.

Critical Analysis

The paper provides a thoughtful and well-designed investigation into the logical reasoning capabilities of large language models. By focusing on the specific case of syllogistic reasoning, the researchers were able to conduct a nuanced analysis that revealed both the strengths and limitations of these models.

One notable strength of the work is the careful comparison between model performance and human performance. This helps put the language models' abilities into context and highlights the areas where they surpass human-level reasoning. The finding that larger models outperform smaller ones, and even outperform humans, is a significant result.

However, the paper also rightly acknowledges the remaining systematic biases and errors in the models' reasoning. The fact that they can be influenced by superficial features like variable ordering, and that they still exhibit syllogistic fallacies, suggests that there is still work to be done to help these models fully overcome human-like reasoning flaws.

One potential limitation of the study is the focus on a single type of logical reasoning (syllogisms). While this allows for a detailed and rigorous analysis, it's unclear how well the findings would generalize to other forms of deductive, inductive and abductive reasoning. Further research exploring a broader range of reasoning capabilities would help paint a more complete picture.

Overall, this paper makes a valuable contribution to our understanding of the logical reasoning abilities of large language models. By identifying both the strengths and limitations of these models, it highlights the importance of careful evaluation and ongoing development to ensure they can reason as accurately and robustly as possible.

Conclusion

This paper examines how large language models trained on human-generated text can both replicate and overcome the logical reasoning biases of their training data. By focusing on the specific case of syllogistic reasoning, the researchers found that larger models tend to outperform smaller models and even outperform humans in terms of logical accuracy.

However, the paper also revealed that even the most advanced language models still exhibit certain systematic errors and biases that mirror common human reasoning flaws, such as being influenced by superficial features and drawing confident but incorrect conclusions. This highlights the ongoing need to carefully evaluate the reasoning capabilities of these powerful language models, and find ways to help them overcome human-like reasoning mistakes as they continue to advance.

Overall, this research provides valuable insights into the logical reasoning abilities of large language models, and the importance of going beyond just accuracy to fully understand their capabilities and limitations.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

A Systematic Comparison of Syllogistic Reasoning in Humans and Language Models

Tiwalayo Eisape, MH Tessler, Ishita Dasgupta, Fei Sha, Sjoerd van Steenkiste, Tal Linzen

A central component of rational behavior is logical inference: the process of determining which conclusions follow from a set of premises. Psychologists have documented several ways in which humans' inferences deviate from the rules of logic. Do language models, which are trained on text generated by humans, replicate such human biases, or are they able to overcome them? Focusing on the case of syllogisms -- inferences from two simple premises -- we show that, within the PaLM2 family of transformer language models, larger models are more logical than smaller ones, and also more logical than humans. At the same time, even the largest models make systematic errors, some of which mirror human reasoning biases: they show sensitivity to the (irrelevant) ordering of the variables in the syllogism, and draw confident but incorrect inferences from particular syllogisms (syllogistic fallacies). Overall, we find that language models often mimic the human biases included in their training data, but are able to overcome them in some cases.

4/12/2024

A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

Leonardo Bertolazzi, Albert Gatt, Raffaella Bernardi

The reasoning abilities of Large Language Models (LLMs) are becoming a central focus of study in NLP. In this paper, we consider the case of syllogistic reasoning, an area of deductive reasoning studied extensively in logic and cognitive psychology. Previous research has shown that pre-trained LLMs exhibit reasoning biases, such as $textit{content effects}$, avoid answering that $textit{no conclusion follows}$, display human-like difficulties, and struggle with multi-step reasoning. We contribute to this research line by systematically investigating the effects of chain-of-thought reasoning, in-context learning (ICL), and supervised fine-tuning (SFT) on syllogistic reasoning, considering syllogisms with conclusions that support or violate world knowledge, as well as ones with multiple premises. Crucially, we go beyond the standard focus on accuracy, with an in-depth analysis of the conclusions generated by the models. Our results suggest that the behavior of pre-trained LLMs can be explained by heuristics studied in cognitive science and that both ICL and SFT improve model performance on valid inferences, although only the latter mitigates most reasoning biases without harming model consistency.

6/18/2024

Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning

Philipp Mondorf, Barbara Plank

Deductive reasoning plays a pivotal role in the formulation of sound and cohesive arguments. It allows individuals to draw conclusions that logically follow, given the truth value of the information provided. Recent progress in the domain of large language models (LLMs) has showcased their capability in executing deductive reasoning tasks. Nonetheless, a significant portion of research primarily assesses the accuracy of LLMs in solving such tasks, often overlooking a deeper analysis of their reasoning behavior. In this study, we draw upon principles from cognitive psychology to examine inferential strategies employed by LLMs, through a detailed evaluation of their responses to propositional logic problems. Our findings indicate that LLMs display reasoning patterns akin to those observed in humans, including strategies like $textit{supposition following}$ or $textit{chain construction}$. Moreover, our research demonstrates that the architecture and scale of the model significantly affect its preferred method of reasoning, with more advanced models tending to adopt strategies more frequently than less sophisticated ones. Importantly, we assert that a model's accuracy, that is the correctness of its final conclusion, does not necessarily reflect the validity of its reasoning process. This distinction underscores the necessity for more nuanced evaluation procedures in the field.

6/4/2024

💬

Language models show human-like content effects on reasoning tasks

Ishita Dasgupta, Andrew K. Lampinen, Stephanie C. Y. Chan, Hannah R. Sheahan, Antonia Creswell, Dharshan Kumaran, James L. McClelland, Felix Hill

Reasoning is a key ability for an intelligent system. Large language models (LMs) achieve above-chance performance on abstract reasoning tasks, but exhibit many imperfections. However, human abstract reasoning is also imperfect. For example, human reasoning is affected by our real-world knowledge and beliefs, and shows notable content effects; humans reason more reliably when the semantic content of a problem supports the correct logical inferences. These content-entangled reasoning patterns play a central role in debates about the fundamental nature of human intelligence. Here, we investigate whether language models $unicode{x2014}$ whose prior expectations capture some aspects of human knowledge $unicode{x2014}$ similarly mix content into their answers to logical problems. We explored this question across three logical reasoning tasks: natural language inference, judging the logical validity of syllogisms, and the Wason selection task. We evaluate state of the art large language models, as well as humans, and find that the language models reflect many of the same patterns observed in humans across these tasks $unicode{x2014}$ like humans, models answer more accurately when the semantic content of a task supports the logical inferences. These parallels are reflected both in answer patterns, and in lower-level features like the relationship between model answer distributions and human response times. Our findings have implications for understanding both these cognitive effects in humans, and the factors that contribute to language model performance.

7/19/2024