Transformers in the Service of Description Logic-based Contexts

2311.08941

Published 4/29/2024 by Angelos Poulis, Eleni Tsalapati, Manolis Koubarakis

🔎

Abstract

Recent advancements in transformer-based models have initiated research interests in investigating their ability to learn to perform reasoning tasks. However, most of the contexts used for this purpose are in practice very simple: generated from short (fragments of) first-order logic sentences with only a few logical operators and quantifiers. In this work, we construct the natural language dataset, DELTA$_D$, using the description logic language $mathcal{ALCQ}$. DELTA$_D$ contains 384K examples, and increases in two dimensions: i) reasoning depth, and ii) linguistic complexity. In this way, we systematically investigate the reasoning ability of a supervised fine-tuned DeBERTa-based model and of two large language models (GPT-3.5, GPT-4) with few-shot prompting. Our results demonstrate that the DeBERTa-based model can master the reasoning task and that the performance of GPTs can improve significantly even when a small number of samples is provided (9 shots). We open-source our code and datasets.

Create account to get full access

Overview

This paper explores the ability of transformer-based models, such as DeBERTa and GPT-3.5/4, to perform complex reasoning tasks.
The authors construct a dataset called DELTA$_D$ that systematically increases the depth and linguistic complexity of the reasoning required.
They evaluate the performance of a fine-tuned DeBERTa-based model and large language models (GPT-3.5, GPT-4) on this dataset using few-shot learning.

Plain English Explanation

The paper investigates how well transformer-based language models, like DeBERTa and GPT-3.5/4, can learn to perform complex reasoning tasks. To do this, the researchers created a new dataset called DELTA$_D$ that contains logical reasoning problems that get progressively more difficult.

The dataset is built using a formal logic language called description logic, which allows the researchers to systematically control the depth and complexity of the reasoning required. This is an important step, as previous work has often used very simple logical reasoning problems that may not reflect the true capabilities of these models.

The researchers then evaluate how well a fine-tuned DeBERTa-based model and large language models like GPT-3.5 and GPT-4 can solve the reasoning problems in DELTA$_D$, even when only shown a small number of examples (9 "shots"). The results show that the DeBERTa-based model can master the reasoning task, and that the performance of the GPT models can improve significantly with just a few examples.

Technical Explanation

The authors construct a new dataset, DELTA$_D$, to systematically evaluate the reasoning abilities of transformer-based language models. DELTA$_D$ is generated using the description logic language $\mathcal{ALCQ}$, which allows for fine-grained control over the depth and linguistic complexity of the reasoning problems.

The dataset contains 384,000 examples that vary in two key dimensions: i) reasoning depth, and ii) linguistic complexity. This allows the researchers to assess how well the models can handle increasingly challenging reasoning tasks.

They then evaluate the performance of a supervised fine-tuned DeBERTa-based model and two large language models (GPT-3.5, GPT-4) on the DELTA$_D$ dataset using few-shot learning. The results show that the DeBERTa-based model is able to master the reasoning task, while the GPT models demonstrate significant performance improvements with just 9 shots of training.

Critical Analysis

The authors have taken a thoughtful approach to constructing a dataset that can systematically evaluate the reasoning capabilities of transformer-based language models. By using the description logic language $\mathcal{ALCQ}$, they are able to fine-tune the difficulty of the reasoning problems in a principled way.

However, one potential limitation of this work is the focus on a single formal logic framework (description logic). It would be interesting to see how the models perform on reasoning tasks expressed in other logical formalisms, such as first-order logic or temporal logic.

Additionally, while the few-shot learning results for the GPT models are promising, it's unclear how well these models would scale to larger training datasets or more complex reasoning problems. Further research is needed to fully understand the limitations and potential of these large language models for reasoning tasks.

Conclusion

This paper makes an important contribution to the understanding of transformer-based language models' ability to perform complex reasoning tasks. By creating the DELTA$_D$ dataset and evaluating a range of models, the authors have provided valuable insights into the current state of the art and the potential for future advancements in this area.

The results suggest that with the right dataset and training approach, transformer-based models can demonstrate strong reasoning capabilities, even with limited training data. This has implications for a wide range of applications, from natural language understanding to problem-solving and decision-making. As the field continues to evolve, it will be exciting to see how these models can be further developed and applied to tackle increasingly complex reasoning challenges.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🧪

Scaling Synthetic Logical Reasoning Datasets with Context-Sensitive Declarative Grammars

Damien Sileo

Logical reasoning remains a challenge for natural language processing, but it can be improved by training language models to mimic theorem provers on procedurally generated problems. Previous work used domain-specific proof generation algorithms, which biases reasoning toward specific proof traces and limits auditability and extensibility. We present a simpler and more general declarative framework with flexible context-sensitive rules binding multiple languages (specifically, simplified English and the TPTP theorem-proving language). We construct first-order logic problems by selecting up to 32 premises and one hypothesis. We demonstrate that using semantic constraints during generation and careful English verbalization of predicates enhances logical reasoning without hurting natural English tasks. We use relatively small DeBERTa-v3 models to achieve state-of-the-art accuracy on the FOLIO human-authored logic dataset, surpassing GPT-4 in accuracy with or without an external solver by 12%.

6/18/2024

cs.CL

⚙️

A Symbolic Framework for Evaluating Mathematical Reasoning and Generalisation with Transformers

Jordan Meadows, Marco Valentino, Damien Teney, Andre Freitas

This paper proposes a methodology for generating and perturbing detailed derivations of equations at scale, aided by a symbolic engine, to evaluate the generalisability of Transformers to out-of-distribution mathematical reasoning problems. Instantiating the framework in the context of sequence classification tasks, we compare the capabilities of GPT-4, GPT-3.5, and a canon of fine-tuned BERT models, exploring the relationship between specific operators and generalisation failure via the perturbation of reasoning aspects such as symmetry and variable surface forms. Surprisingly, our empirical evaluation reveals that the average in-distribution performance of fine-tuned models surpasses GPT-3.5, and rivals GPT-4. However, perturbations to input reasoning can reduce their performance by up to 80 F1 points. Overall, the results suggest that the in-distribution performance of smaller open-source models may potentially rival GPT by incorporating appropriately structured derivation dependencies during training, and highlight a shared weakness between BERT and GPT involving a relative inability to decode indirect references to mathematical entities. We release the full codebase, constructed datasets, and fine-tuned models to encourage future progress in the field.

4/9/2024

cs.CL cs.LG

New!Assessing Logical Reasoning Capabilities of Encoder-Only Transformer Models

Paulo Pirozelli, Marcos M. Jos'e, Paulo de Tarso P. Filho, Anarosa A. F. Brand~ao, Fabio G. Cozman

Logical reasoning is central to complex human activities, such as thinking, debating, and planning; it is also a central component of many AI systems as well. In this paper, we investigate the extent to which encoder-only transformer language models (LMs) can reason according to logical rules. We ask whether those LMs can deduce theorems in propositional calculus and first-order logic; if their relative success in these problems reflects general logical capabilities; and which layers contribute the most to the task. First, we show for several encoder-only LMs that they can be trained, to a reasonable degree, to determine logical validity on various datasets. Next, by cross-probing fine-tuned models on these datasets, we show that LMs have difficulty in transferring their putative logical reasoning ability, which suggests that they may have learned dataset-specific features, instead of a general capability. Finally, we conduct a layerwise probing experiment, which shows that the hypothesis classification task is mostly solved through higher layers.

7/2/2024

cs.CL cs.AI

Disentangling Logic: The Role of Context in Large Language Model Reasoning Capabilities

Wenyue Hua, Kaijie Zhu, Lingyao Li, Lizhou Fan, Shuhang Lin, Mingyu Jin, Haochen Xue, Zelong Li, JinDong Wang, Yongfeng Zhang

This study intends to systematically disentangle pure logic reasoning and text understanding by investigating the contrast across abstract and contextualized logical problems from a comprehensive set of domains. We explore whether LLMs demonstrate genuine reasoning capabilities across various domains when the underlying logical structure remains constant. We focus on two main questions (1) Can abstract logical problems alone accurately benchmark an LLM's reasoning ability in real-world scenarios, disentangled from contextual support in practical settings? (2) Does fine-tuning LLMs on abstract logic problem generalize to contextualized logic problems and vice versa? To investigate these questions, we focus on standard propositional logic, specifically propositional deductive and abductive logic reasoning. In particular, we construct instantiated datasets for deductive and abductive reasoning with 4 levels of difficulty, encompassing 12 distinct categories or domains based on the categorization of Wikipedia. Our experiments aim to provide insights into disentangling context in logical reasoning and the true reasoning capabilities of LLMs and their generalization potential. The code and dataset are available at: https://github.com/agiresearch/ContextHub.

6/6/2024

cs.CL cs.AI cs.LG