A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

2406.11341

Published 6/18/2024 by Leonardo Bertolazzi, Albert Gatt, Raffaella Bernardi

A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

Abstract

The reasoning abilities of Large Language Models (LLMs) are becoming a central focus of study in NLP. In this paper, we consider the case of syllogistic reasoning, an area of deductive reasoning studied extensively in logic and cognitive psychology. Previous research has shown that pre-trained LLMs exhibit reasoning biases, such as $textit{content effects}$, avoid answering that $textit{no conclusion follows}$, display human-like difficulties, and struggle with multi-step reasoning. We contribute to this research line by systematically investigating the effects of chain-of-thought reasoning, in-context learning (ICL), and supervised fine-tuning (SFT) on syllogistic reasoning, considering syllogisms with conclusions that support or violate world knowledge, as well as ones with multiple premises. Crucially, we go beyond the standard focus on accuracy, with an in-depth analysis of the conclusions generated by the models. Our results suggest that the behavior of pre-trained LLMs can be explained by heuristics studied in cognitive science and that both ICL and SFT improve model performance on valid inferences, although only the latter mitigates most reasoning biases without harming model consistency.

Create account to get full access

Overview

This paper presents a systematic analysis of how large language models (LLMs) perform on syllogistic reasoning tasks, which involve drawing logical inferences from given premises.
The researchers compare the reasoning abilities of LLMs to those of humans, providing insights into the strengths, limitations, and biases of these models as "soft reasoners."
The study builds on prior work on evaluating the deductive competence of LLMs and comparing the inferential strategies of humans and LLMs.
The authors also introduce LogicBench, a benchmark for systematically evaluating the logical reasoning abilities of AI systems.

Plain English Explanation

The paper examines how well large language models, such as those used in chatbots and virtual assistants, can perform logical reasoning tasks. Specifically, the researchers looked at how these models handle syllogisms, which are logical arguments with two premises that lead to a conclusion.

The researchers compared the reasoning abilities of the language models to those of humans. This allowed them to identify the strengths, weaknesses, and biases of the models when it comes to making logical inferences. The study builds on previous work that has looked at how well language models can handle deductive reasoning and how their reasoning strategies compare to those of humans.

The researchers also introduced a new benchmark called LogicBench, which can be used to systematically evaluate the logical reasoning capabilities of AI systems. This provides a way to more rigorously assess and compare the reasoning abilities of different models and approaches.

Technical Explanation

The paper presents a comprehensive analysis of how large language models (LLMs) perform on syllogistic reasoning tasks, which involve drawing logical conclusions from given premises. The researchers compared the reasoning abilities of LLMs to those of humans, providing insights into the strengths, limitations, and biases of these models as "soft reasoners."

The study builds on prior work on evaluating the deductive competence of LLMs and comparing the inferential strategies of humans and LLMs. The authors also introduce LogicBench, a benchmark for systematically evaluating the logical reasoning abilities of AI systems.

The researchers conducted a series of experiments to assess the syllogistic reasoning performance of several LLMs, including GPT-3, BERT, and RoBERTa. They evaluated the models' ability to correctly identify valid and invalid syllogistic inferences, as well as their tendency to exhibit common human biases, such as the illusion of explanatory depth and the belief bias.

The results showed that while LLMs can perform reasonably well on certain syllogistic reasoning tasks, they also exhibit significant limitations and biases. For example, the models struggled with more complex syllogisms and were prone to falling prey to the same types of logical fallacies that humans often make.

The researchers also found that the models' performance was heavily influenced by the specific prompting and task framing, suggesting that their reasoning abilities are heavily dependent on the context and the way the problem is presented.

Critical Analysis

The paper provides a valuable contribution to the ongoing research on the reasoning abilities of large language models. The authors' systematic approach to evaluating the models' performance on syllogistic reasoning tasks, and their comparison to human reasoning, offer important insights into the strengths and limitations of these models as "soft reasoners."

One potential limitation of the study is that it focuses solely on syllogistic reasoning, which may not fully capture the breadth of logical reasoning abilities required in real-world scenarios. Further research on the models' performance in a wider range of logical reasoning tasks would be valuable to provide a more comprehensive understanding of their capabilities.

Additionally, the study highlights the significant impact that prompting and task framing can have on the models' performance. This suggests that more work is needed to understand the underlying mechanisms and biases that shape the reasoning capabilities of LLMs, and to develop strategies for improving their logical reasoning abilities.

Conclusion

This paper presents a systematic analysis of how large language models (LLMs) perform on syllogistic reasoning tasks, which involve drawing logical inferences from given premises. The researchers compared the reasoning abilities of LLMs to those of humans, providing insights into the strengths, limitations, and biases of these models as "soft reasoners."

The results of the study suggest that while LLMs can perform reasonably well on certain syllogistic reasoning tasks, they also exhibit significant limitations and biases. The models' performance was heavily influenced by the specific prompting and task framing, highlighting the need for further research on the underlying mechanisms and biases that shape the reasoning capabilities of these models.

Overall, this paper contributes to our understanding of the strengths and limitations of LLMs as logical reasoners, and provides a valuable framework for the continued development and evaluation of AI systems' reasoning abilities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

A Systematic Comparison of Syllogistic Reasoning in Humans and Language Models

Tiwalayo Eisape, MH Tessler, Ishita Dasgupta, Fei Sha, Sjoerd van Steenkiste, Tal Linzen

A central component of rational behavior is logical inference: the process of determining which conclusions follow from a set of premises. Psychologists have documented several ways in which humans' inferences deviate from the rules of logic. Do language models, which are trained on text generated by humans, replicate such human biases, or are they able to overcome them? Focusing on the case of syllogisms -- inferences from two simple premises -- we show that, within the PaLM2 family of transformer language models, larger models are more logical than smaller ones, and also more logical than humans. At the same time, even the largest models make systematic errors, some of which mirror human reasoning biases: they show sensitivity to the (irrelevant) ordering of the variables in the syllogism, and draw confident but incorrect inferences from particular syllogisms (syllogistic fallacies). Overall, we find that language models often mimic the human biases included in their training data, but are able to overcome them in some cases.

4/12/2024

cs.CL cs.AI cs.LG

Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning

Philipp Mondorf, Barbara Plank

Deductive reasoning plays a pivotal role in the formulation of sound and cohesive arguments. It allows individuals to draw conclusions that logically follow, given the truth value of the information provided. Recent progress in the domain of large language models (LLMs) has showcased their capability in executing deductive reasoning tasks. Nonetheless, a significant portion of research primarily assesses the accuracy of LLMs in solving such tasks, often overlooking a deeper analysis of their reasoning behavior. In this study, we draw upon principles from cognitive psychology to examine inferential strategies employed by LLMs, through a detailed evaluation of their responses to propositional logic problems. Our findings indicate that LLMs display reasoning patterns akin to those observed in humans, including strategies like $textit{supposition following}$ or $textit{chain construction}$. Moreover, our research demonstrates that the architecture and scale of the model significantly affect its preferred method of reasoning, with more advanced models tending to adopt strategies more frequently than less sophisticated ones. Importantly, we assert that a model's accuracy, that is the correctness of its final conclusion, does not necessarily reflect the validity of its reasoning process. This distinction underscores the necessity for more nuanced evaluation procedures in the field.

6/4/2024

cs.CL cs.AI

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, Chitta Baral

Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really reason over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to 'logical reasoning' has remained underexplored. Existing work investigating this reasoning ability of LLMs has focused only on a couple of inference rules (such as modus ponens and modus tollens) of propositional and first-order logic. Addressing the above limitation, we comprehensively evaluate the logical reasoning ability of LLMs on 25 different reasoning patterns spanning over propositional, first-order, and non-monotonic logics. To enable systematic evaluation, we introduce LogicBench, a natural language question-answering dataset focusing on the use of a single inference rule. We conduct detailed analysis with a range of LLMs such as GPT-4, ChatGPT, Gemini, Llama-2, and Mistral using chain-of-thought prompting. Experimental results show that existing LLMs do not fare well on LogicBench; especially, they struggle with instances involving complex reasoning and negations. Furthermore, they sometimes overlook contextual information necessary for reasoning to arrive at the correct conclusion. We believe that our work and findings facilitate future research for evaluating and enhancing the logical reasoning ability of LLMs. Data and code are available at https://github.com/Mihir3009/LogicBench.

6/7/2024

cs.CL cs.AI

💬

Evaluating the Deductive Competence of Large Language Models

Spencer M. Seals, Valerie L. Shalin

The development of highly fluent large language models (LLMs) has prompted increased interest in assessing their reasoning and problem-solving capabilities. We investigate whether several LLMs can solve a classic type of deductive reasoning problem from the cognitive science literature. The tested LLMs have limited abilities to solve these problems in their conventional form. We performed follow up experiments to investigate if changes to the presentation format and content improve model performance. We do find performance differences between conditions; however, they do not improve overall performance. Moreover, we find that performance interacts with presentation format and content in unexpected ways that differ from human performance. Overall, our results suggest that LLMs have unique reasoning biases that are only partially predicted from human reasoning performance and the human-generated language corpora that informs them.

4/16/2024

cs.CL