Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks

2307.02477

Published 4/1/2024 by Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyurek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, Yoon Kim

cs.CL cs.AI

💬

Abstract

The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? To disentangle these effects, we propose an evaluation framework based on counterfactual task variants that deviate from the default assumptions underlying standard tasks. Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task-solving skills to an extent, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart these aspects of behavior.

Get summaries of the top AI research delivered straight to your inbox:

Introduction

The paper introduces a method for evaluating the ability of large language models (LMs) to generalize to new task variants, called counterfactual tasks. While LMs have shown impressive performance on various benchmarks, it is unclear whether this success stems from their general reasoning abilities or from recognizing and recalling specific tasks seen during training.

The authors propose creating counterfactual tasks by altering the conditions or rules of tasks that LMs perform well on, while keeping the general reasoning procedure the same. For example, instead of performing arithmetic in base-10 (the default), the counterfactual task would be to perform arithmetic in base-9.

They designed a suite of 11 counterfactual evaluation tasks across different categories and domains, including traditional NLP tasks, code generation, drawing, and spatial reasoning. These tasks test whether LMs can learn conceptual structures that mirror the non-linguistic world.

The authors evaluated several large language models, including GPT-4, GPT-3.5, Claude, and PaLM-2, on both the default and counterfactual tasks. The models exhibited above-random performance on counterfactual tasks, indicating some degree of task generalizability. However, their performance consistently and substantially degraded on the counterfactual variants compared to the default settings.

These results suggest that the models' success is at least partially supported by non-transferable, default-condition-specific behaviors rather than abstract, generalizable reasoning skills. The paper also discusses surprising relationships between model behavior on default and counterfactual tasks, such as correlations between performance, effectiveness of zero-shot chain-of-thought prompting, and interactions between task- and instance-level frequency effects.

Overall, the authors conclude that small variations on the default instantiations of tasks are challenging for models, indicating that the success of existing LMs should not be fully attributed to a fully general capacity for the target task.

Counterfactual Tasks

The paper describes an approach to evaluate language models on their ability to generalize to new task variants or conditions, rather than just new inputs. Each task is conceptualized as a function that maps an input to an output under certain assumptions or a "world model." For example, arithmetic operations assume a number base like base-10.

The authors refer to the typical set of assumptions, like base-10 for arithmetic, as the "default world." They introduce counterfactual worlds that deviate from these default conditions, like using a different number base. By evaluating language models on tasks in both the default and counterfactual worlds, they can measure if the models have overfit to the default conditions.

To ensure the models understand the specified counterfactual conditions, the authors use "counterfactual comprehension checks" - simple auxiliary tasks that test if the model grasps the counterfactual world appropriately. For example, checking if the model carries digits over 8 correctly when doing base-9 arithmetic.

Tasks

The paper evaluates large language models (LMs) on various tasks under both default and counterfactual settings to assess their robustness and understanding. The tasks include arithmetic in different bases, programming with different indexing conventions, syntactic reasoning with different word orders, logical reasoning with premises violating common sense, spatial reasoning with transformed coordinate systems, drawing objects with rotations/flips, music tasks with altered tunings/transpositions, chess openings with piece position swaps, and the SET card game with rule variations.

For each task, the models are given a counterfactual comprehension check (CCC) to verify their understanding of the altered setting. The results show that while LMs perform well on the default tasks, their performance consistently drops in the counterfactual conditions, despite often passing the CCC. This suggests that LMs may lack robust understanding and instead rely on patterns from their training data. The paper aims to disentangle memorization effects from true conceptual understanding through these counterfactual evaluations.

Results

The researchers evaluated several large language models (GPTs, PaLM, Claude) on various tasks under default and counterfactual conditions. They found the models performed substantially worse on the counterfactual task variants, even when prompted to reason step-by-step. While the models exhibited some ability to solve the counterfactual tasks, the large performance gaps between default and counterfactual conditions suggest the models overfit to the default conditions. This overfitting limits their abstract understanding of the tasks. For example, the models sometimes failed completely on counterfactual arithmetic problems despite near-perfect default performance. The researchers conclude the models rely on condition-specific implementations rather than general task comprehension.

Analysis

The paper explores how various factors affect the performance gap between default and counterfactual conditions for language models on different tasks. Some key findings:

More common or familiar counterfactual conditions (e.g. base 8, drop-D guitar tuning) lead to smaller performance drops compared to uncommon ones.
Performance drops are smaller when the counterfactual condition is closer to the default (e.g. base 9 vs 11 for arithmetic).
Across tasks and models, there is a strong correlation between default task performance and counterfactual performance. Better models on the default also tend to be better on counterfactuals.
However, larger models sometimes fail more on counterfactuals that contradict their training data, suggesting over-reliance on memorization.
Few-shot demonstrations help but do not fully eliminate the performance gap between default and counterfactual conditions.
Qualitative analysis shows even when models can complete counterfactual tasks, the outputs are often simplified or contain errors compared to default conditions.

The findings suggest models rely on both memorization and reasoning abilities, which co-exist on a continuum rather than being an either-or situation.

Discussion

The paper discusses whether large language models (LMs) trained on text data can generalize their reasoning abilities to unfamiliar counterfactual conditions, or if their performance is specific to the familiar default task setup they were trained on.

The authors hypothesize that while humans may initially struggle with unfamiliar counterfactual scenarios under time constraints, they ultimately have the competence and causal reasoning abilities to generalize to novel situations given sufficient time.

However, the authors argue that replicating human intelligence is not necessarily the goal for LMs. Their aim is to evaluate if LMs genuinely learn generalizable reasoning capabilities from training data, rather than just memorizing patterns specific to the default tasks.

The paper finds a notable performance gap between default and counterfactual task variants for LMs across several reasoning tasks. While more informative prompting can reduce this gap, it does not fully eliminate it, suggesting LMs may overfit to frequent patterns during pretraining rather than learning generalizable abstractions.

The authors posit that an ideal learner should structure representations to implement general-purpose abstractions that allow generalizing to new contexts, akin to how humans and scientific theories can extrapolate to drastically different scenarios. Their study indicates current LMs still struggle with such generalization.

Limitations

The paper discusses potential factors that could lead to underestimation or overestimation of a language model's true reasoning ability when using counterfactual tasks for evaluation.

Underestimation:

It is challenging to construct counterfactual tasks with the same difficulty as the default tasks. An objective difficulty measure may not exist.
Some counterfactual tasks involve additional steps, making them harder than the original task. This increased difficulty, rather than overfitting, could explain poor performance.

Overestimation:

It is difficult to ensure that counterfactual conditions are truly unseen during pretraining. If models have encountered some counterfactual conditions, the gap between default and counterfactual performance could be smaller than expected.

The paper distinguishes between counterfactual perturbations that fundamentally change the world model (e.g., arithmetic base) and more superficial perturbations where models could potentially exploit shortcuts (e.g., word replacements). For some tasks, models might identify simple mappings to revert the input to default conditions and leverage memorization, rather than reasoning under counterfactual conditions.

Finally, the counterfactual condition comprehension (CCC) task itself involves some reasoning, making it difficult to isolate the model's understanding of counterfactual conditions from its reasoning ability.

Related Work

The paper discusses evaluating how well large language models (LMs) can reason about and understand conceptual structures in hypothetical or counterfactual scenarios that deviate from the real world. Previous research has shown LMs can acquire grounded understanding of certain concepts like color and size through text training alone. However, this study found that while LMs can generalize to new concepts, they struggle with the reasoning process required for hypothetical scenarios.

The paper relates this issue to causal reasoning - LMs fail to robustly learn the causal effects of different world states on the evaluated tasks. It notes that "counterfactual" is an informal term in NLP referring to different types of perturbations or hypothetical scenarios. Some past work looked at counterfactuals still licensed within a default world model, finding LMs struggle with consistent reasoning in such cases. Other work examined using counterfactual data to test model robustness.

Most similar to this study, one paper showed that while LMs seem capable of some reasoning in counterfactual worlds, this is largely due to surface cues rather than true understanding. The current paper reveals that even recent LMs exhibit difficulties with reasoning about conceptual structures in scenarios deviating from the real world.

Conclusion

The researchers conducted an evaluation using counterfactual scenarios across 11 tasks. They found a significant decrease in the performance of language models when faced with counterfactual conditions, as opposed to the default task variants. This performance gap is attributed to the language models overfitting to the default task variants, which may be more prevalent in the pretraining data. The authors recommend that future analyses of language models should consider their abstract task ability separately from their observed task performance, especially when the evaluated task variants might be overrepresented in the pretraining corpora.

Acknowledgments

The authors express gratitude to various individuals for their contributions. They thank several people, listed alphabetically, for helpful discussions and feedback on the work. Additionally, they acknowledge a group of annotators who made the drawing evaluation possible. One author expresses personal thanks for guitar lessons relevant to components of the paper. The figure in the paper uses icons from a website, and the study received funding from multiple sources, including the MIT-IBM Watson AI Lab, the MIT Quest for Intelligence, and the National Science Foundation.

Appendix A Full Setups

The paper describes several tasks used to evaluate large language models' (LMs) ability to reason under counterfactual conditions.

For arithmetic, models are asked to perform operations in different number bases. For programming, models generate code or execute provided code under different indexing conventions (0-based vs 1-based).

For syntactic reasoning, models identify subjects and verbs in sentences with different word orders. For logical reasoning, models determine if conclusions follow from counterfactual premises.

For spatial reasoning, models locate objects under different coordinate system orientations. For drawing, models generate code to draw objects with different rotations/flips.

For music tasks, models provide fret placements for alternate instrument tunings and identify notes in different musical keys.

For chess, models validate move sequences under swapped piece positions. For the SET game, models identify set completions with inverted rules.

The paper also describes "counterfactual condition checks" to verify models understand the counterfactual conditions.

Appendix B Prompts

The paper discusses the specific prompts used to query language models, presented in Tables 1 to 17. Instead of providing templates, the authors include concrete prompts that embed test instances. Minor design decisions related to the prompts are explained in the respective table captions. The system message field was not utilized for any of the language models evaluated.

Appendix C Raw Results

Tables 18 to 34 present the numerical findings of the study. The data is displayed in tabular form, likely containing values, measurements, or statistics related to the research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Evidence from counterfactual tasks supports emergent analogical reasoning in large language models

Taylor Webb, Keith J. Holyoak, Hongjing Lu

We recently reported evidence that large language models are capable of solving a wide range of text-based analogy problems in a zero-shot manner, indicating the presence of an emergent capacity for analogical reasoning. Two recent commentaries have challenged these results, citing evidence from so-called `counterfactual' tasks in which the standard sequence of the alphabet is arbitrarily permuted so as to decrease similarity with materials that may have been present in the language model's training data. Here, we reply to these critiques, clarifying some misunderstandings about the test materials used in our original work, and presenting evidence that language models are also capable of generalizing to these new counterfactual task variants.

5/1/2024

cs.CL cs.AI

Eyes Can Deceive: Benchmarking Counterfactual Reasoning Abilities of Multi-modal Large Language Models

Yian Li, Wentao Tian, Yang Jiao, Jingjing Chen, Yu-Gang Jiang

Counterfactual reasoning, as a crucial manifestation of human intelligence, refers to making presuppositions based on established facts and extrapolating potential outcomes. Existing multimodal large language models (MLLMs) have exhibited impressive cognitive and reasoning capabilities, which have been examined across a wide range of Visual Question Answering (VQA) benchmarks. Nevertheless, how will existing MLLMs perform when faced with counterfactual questions? To answer this question, we first curate a novel textbf{C}ountertextbf{F}actual textbf{M}ultitextbf{M}odal reasoning benchmark, abbreviated as textbf{CFMM}, to systematically assess the counterfactual reasoning capabilities of MLLMs. Our CFMM comprises six challenging tasks, each including hundreds of carefully human-labeled counterfactual questions, to evaluate MLLM's counterfactual reasoning capabilities across diverse aspects. Through experiments, interestingly, we find that existing MLLMs prefer to believe what they see, but ignore the counterfactual presuppositions presented in the question, thereby leading to inaccurate responses. Furthermore, we evaluate a wide range of prevalent MLLMs on our proposed CFMM. The significant gap between their performance on our CFMM and that on several VQA benchmarks indicates that there is still considerable room for improvement in existing MLLMs toward approaching human-level intelligence. On the other hand, through boosting MLLMs performances on our CFMM in the future, potential avenues toward developing MLLMs with advanced intelligence can be explored.

4/22/2024

cs.CV cs.AI

💬

What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models

Letian Zhang, Xiaotong Zhai, Zhongkai Zhao, Yongshuo Zong, Xin Wen, Bingchen Zhao

Counterfactual reasoning, a fundamental aspect of human cognition, involves contemplating alternatives to established facts or past events, significantly enhancing our abilities in planning and decision-making. In light of the advancements in current multi-modal large language models, we explore their effectiveness in counterfactual reasoning. To facilitate this investigation, we introduce a novel dataset, C-VQA, specifically designed to test the counterfactual reasoning capabilities of modern multi-modal large language models. This dataset is constructed by infusing original questions with counterfactual presuppositions, spanning various types such as numerical and boolean queries. It encompasses a mix of real and synthetic data, representing a wide range of difficulty levels. Our thorough evaluations of contemporary vision-language models using this dataset have revealed substantial performance drops, with some models showing up to a 40% decrease, highlighting a significant gap between current models and human-like vision reasoning capabilities. We hope our dataset will serve as a vital benchmark for evaluating the counterfactual reasoning capabilities of models. Code and dataset are publicly available at https://bzhao.me/C-VQA/.

4/17/2024

cs.CL cs.CV cs.LG

LLMs for Generating and Evaluating Counterfactuals: A Comprehensive Study

Van Bach Nguyen, Paul Youssef, Jorg Schlotterer, Christin Seifert

As NLP models become more complex, understanding their decisions becomes more crucial. Counterfactuals (CFs), where minimal changes to inputs flip a model's prediction, offer a way to explain these models. While Large Language Models (LLMs) have shown remarkable performance in NLP tasks, their efficacy in generating high-quality CFs remains uncertain. This work fills this gap by investigating how well LLMs generate CFs for two NLU tasks. We conduct a comprehensive comparison of several common LLMs, and evaluate their CFs, assessing both intrinsic metrics, and the impact of these CFs on data augmentation. Moreover, we analyze differences between human and LLM-generated CFs, providing insights for future research directions. Our results show that LLMs generate fluent CFs, but struggle to keep the induced changes minimal. Generating CFs for Sentiment Analysis (SA) is less challenging than NLI where LLMs show weaknesses in generating CFs that flip the original label. This also reflects on the data augmentation performance, where we observe a large gap between augmenting with human and LLMs CFs. Furthermore, we evaluate LLMs' ability to assess CFs in a mislabelled data setting, and show that they have a strong bias towards agreeing with the provided labels. GPT4 is more robust against this bias and its scores correlate well with automatic metrics. Our findings reveal several limitations and point to potential future work directions.

5/3/2024

cs.CL cs.AI