Are LLMs classical or nonmonotonic reasoners? Lessons from generics

Read original: arXiv:2406.06590 - Published 6/13/2024 by Alina Leidinger, Robert van Rooij, Ekaterina Shutova

Are LLMs classical or nonmonotonic reasoners? Lessons from generics

Overview

This paper explores whether large language models (LLMs) are more like classical or nonmonotonic reasoners, using lessons from research on generics.
Generics are general statements about categories, like "birds fly," which can have exceptions. Understanding how LLMs handle generics can provide insights into their reasoning abilities.
The paper examines how LLMs perform on tasks involving generics and compares their behavior to classical and nonmonotonic reasoning approaches.

Plain English Explanation

Large language models (LLMs) are artificial intelligence systems that can generate human-like text on a wide range of topics. Researchers are trying to understand how these models "think" and reason, to see if they are more similar to classical logic-based systems or more flexible, "nonmonotonic" reasoning approaches.

To do this, the researchers looked at how LLMs handle statements known as "generics." Generics are general claims about categories, like "birds fly." These kinds of statements can have exceptions - not every single bird can fly. Classical logic has trouble with these kinds of flexible, general claims, while nonmonotonic reasoning systems are designed to handle them better.

By testing LLMs on tasks involving generics, the researchers hoped to get a better sense of the models' underlying reasoning abilities. Do they struggle with exceptions and nuance like classical systems, or can they handle the flexibility required for generic statements? The results could shed light on the fundamental nature of how LLMs reason and make sense of the world.

Technical Explanation

The paper examines the reasoning capabilities of large language models (LLMs) by focusing on how they handle generic statements. Generics are general claims about categories, like "birds fly," that allow for exceptions.

Classical logic-based reasoning systems have difficulty capturing the flexibility of generics, as they are designed to work with strict, universal rules. In contrast, nonmonotonic reasoning approaches are better equipped to handle the nuanced, defeasible nature of generic statements.

The researchers tested the performance of several prominent LLMs on a range of tasks involving generics, such as evaluating the validity of generic statements, generating relevant generic statements, and reasoning about exceptions. The results were compared to the expected behavior of classical and nonmonotonic reasoners.

The findings suggest that LLMs exhibit a mix of classical and nonmonotonic reasoning tendencies when dealing with generics. While they sometimes struggle with exceptions and show signs of logical inconsistencies, they also demonstrate a degree of flexibility and context-sensitivity that is more characteristic of nonmonotonic approaches.

Critical Analysis

The paper provides valuable insights into the reasoning capabilities of large language models, but it also acknowledges several limitations and areas for further research.

One key limitation is the scope of the tasks and datasets used in the experiments. While the generic statement tasks provide a useful testbed, there may be other types of reasoning and knowledge representation challenges that LLMs struggle with that were not captured in this study.

Additionally, the paper notes that the LLMs' performance may be sensitive to factors like training data, model architecture, and task formulation. More research is needed to understand how these variables influence the models' reasoning behaviors and generalize to real-world applications.

The authors also highlight the need for more systematic evaluations of LLMs' logical reasoning abilities, going beyond just accuracy metrics to examine the underlying reasoning processes and behaviors.

Conclusion

This paper takes an important step in understanding the reasoning capabilities of large language models by examining how they handle generic statements. The results suggest that LLMs exhibit a mix of classical and nonmonotonic reasoning tendencies, with both strengths and weaknesses in their ability to deal with the nuances of general claims and exceptions.

These findings have broader implications for the development and deployment of LLMs, as they shed light on the models' fundamental cognitive architecture and the types of reasoning tasks they may be well-suited or ill-equipped to handle. Continued research in this area, combined with more comprehensive evaluations of logical reasoning, will be crucial for advancing our understanding of these powerful AI systems and their potential applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Are LLMs classical or nonmonotonic reasoners? Lessons from generics

Alina Leidinger, Robert van Rooij, Ekaterina Shutova

Recent scholarship on reasoning in LLMs has supplied evidence of impressive performance and flexible adaptation to machine generated or human feedback. Nonmonotonic reasoning, crucial to human cognition for navigating the real world, remains a challenging, yet understudied task. In this work, we study nonmonotonic reasoning capabilities of seven state-of-the-art LLMs in one abstract and one commonsense reasoning task featuring generics, such as 'Birds fly', and exceptions, 'Penguins don't fly' (see Fig. 1). While LLMs exhibit reasoning patterns in accordance with human nonmonotonic reasoning abilities, they fail to maintain stable beliefs on truth conditions of generics at the addition of supporting examples ('Owls fly') or unrelated information ('Lions have manes'). Our findings highlight pitfalls in attributing human reasoning behaviours to LLMs, as well as assessing general capabilities, while consistent reasoning remains elusive.

6/13/2024

Can LLMs Reason in the Wild with Programs?

Yuan Yang, Siheng Xiong, Ali Payani, Ehsan Shareghi, Faramarz Fekri

Large Language Models (LLMs) have shown superior capability to solve reasoning problems with programs. While being a promising direction, most of such frameworks are trained and evaluated in settings with a prior knowledge of task requirements. However, as LLMs become more capable, it is necessary to assess their reasoning abilities in more realistic scenarios where many real-world problems are open-ended with ambiguous scope, and often require multiple formalisms to solve. To investigate this, we introduce the task of reasoning in the wild, where an LLM is tasked to solve a reasoning problem of unknown type by identifying the subproblems and their corresponding formalisms, and writing a program to solve each subproblem, guided by a tactic. We create a large tactic-guided trajectory dataset containing detailed solutions to a diverse set of reasoning problems, ranging from well-defined single-form reasoning (e.g., math, logic), to ambiguous and hybrid ones (e.g., commonsense, combined math and logic). This allows us to test various aspects of LLMs reasoning at the fine-grained level such as the selection and execution of tactics, and the tendency to take undesired shortcuts. In experiments, we highlight that existing LLMs fail significantly on problems with ambiguous and mixed scope, revealing critical limitations and overfitting issues (e.g. accuracy on GSM8K drops by at least 50%). We further show the potential of finetuning a local LLM on the tactic-guided trajectories in achieving better performance. Project repo is available at github.com/gblackout/Reason-in-the-Wild

6/21/2024

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, Chitta Baral

Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really reason over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to 'logical reasoning' has remained underexplored. Existing work investigating this reasoning ability of LLMs has focused only on a couple of inference rules (such as modus ponens and modus tollens) of propositional and first-order logic. Addressing the above limitation, we comprehensively evaluate the logical reasoning ability of LLMs on 25 different reasoning patterns spanning over propositional, first-order, and non-monotonic logics. To enable systematic evaluation, we introduce LogicBench, a natural language question-answering dataset focusing on the use of a single inference rule. We conduct detailed analysis with a range of LLMs such as GPT-4, ChatGPT, Gemini, Llama-2, and Mistral using chain-of-thought prompting. Experimental results show that existing LLMs do not fare well on LogicBench; especially, they struggle with instances involving complex reasoning and negations. Furthermore, they sometimes overlook contextual information necessary for reasoning to arrive at the correct conclusion. We believe that our work and findings facilitate future research for evaluating and enhancing the logical reasoning ability of LLMs. Data and code are available at https://github.com/Mihir3009/LogicBench.

6/7/2024

Caught in the Quicksand of Reasoning, Far from AGI Summit: Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

Pengfei Hong, Navonil Majumder, Deepanway Ghosal, Somak Aditya, Rada Mihalcea, Soujanya Poria

Recent advancements in Large Language Models (LLMs) have showcased striking results on existing logical reasoning benchmarks, with some models even surpassing human performance. However, the true depth of their competencies and robustness in reasoning tasks remains an open question. To this end, in this paper, we focus on two popular reasoning tasks: arithmetic reasoning and code generation. Particularly, we introduce: (i) a general ontology of perturbations for maths and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets, MORE and CORE, respectively, of perturbed maths and coding problems to probe the limits of LLM capabilities in numeric reasoning and coding tasks. Through comprehensive evaluations of both closed-source and open-source LLMs, we show a significant performance drop across all the models against the perturbed questions, suggesting that the current LLMs lack robust problem solving skills and structured reasoning abilities in many areas, as defined by our ontology. We open source the datasets and source codes at: https://github.com/declare-lab/llm_robustness.

6/28/2024