Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds






Published 4/12/2024 by Victoria Basmov, Yoav Goldberg, Reut Tsarfaty



We evaluate LLMs' language understanding capacities on simple inference tasks that most humans find trivial. Specifically, we target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. We design evaluation sets for these tasks and conduct experiments in both zero-shot and chain-of-thought setups, and with multiple prompts and LLMs. The models exhibit moderate to low performance on these evaluation sets. Subsequent experiments show that embedding the premise in syntactic constructions that should preserve the entailment relations (presupposition triggers) or change them (non-factives), further confuses the models, causing them to either under-predict or over-predict certain entailment labels regardless of the true relation, and often disregarding the nature of the embedding context. Overall these results suggest that, despite LLMs' celebrated language understanding capacity, even the strongest models have blindspots with respect to certain types of entailments, and certain information-packaging structures act as ``blinds'' overshadowing the semantics of the embedded premise.

Create account to get full access


If you already have an account, we'll log you in


  • This paper evaluates the language understanding capabilities of large language models (LLMs) on simple inference tasks that most humans find easy.
  • The researchers specifically target three types of inferences: grammatically-specified entailments, premises with evidential adverbs of uncertainty, and monotonicity entailments.
  • They design evaluation sets for these tasks and conduct experiments in both zero-shot and chain-of-thought setups, using multiple prompts and LLMs.
  • The results show that the models exhibit moderate to low performance on these evaluation sets, and that certain syntactic constructions, such as presupposition triggers and non-factives, further confuse the models.

Plain English Explanation

The researchers wanted to test how well large language models (LLMs) can understand and reason about simple language concepts that most people find easy. They focused on three specific types of language inferences:

  1. Grammatically-specified entailments: These are logical relationships between sentences that are clearly defined by the grammar of the language. For example, "The cat is on the mat" entails (or implies) "The cat is on something."

  2. Premises with evidential adverbs of uncertainty: These are sentences that contain words like "maybe" or "possibly," which indicate the speaker is not completely certain about the information.

  3. Monotonicity entailments: These are logical relationships where the truth of a statement is maintained or reversed when certain words are added or removed.

The researchers designed test sets to evaluate how well LLMs could handle these types of inferences. They tried different setups, including having the models reason through the inferences step-by-step, and using various prompts and LLM models.

Overall, the LLMs performed moderately to poorly on these tasks. The researchers also found that certain ways of structuring the sentences, like using "presupposition triggers" or "non-factives," further confused the models, causing them to make mistakes in predicting the logical relationships, even when the underlying meaning should have been clear.

Technical Explanation

The paper investigates the language understanding capabilities of large language models (LLMs) on three types of simple inference tasks:

  1. Grammatically-specified entailments: The researchers test the models' ability to recognize logical entailment relationships that are specified by the grammatical structure of the sentences, such as the relationship between "The cat is on the mat" and "The cat is on something."

  2. Premises with evidential adverbs of uncertainty: The researchers evaluate how well the models handle premises containing words like "maybe" or "possibly," which indicate uncertainty about the information.

  3. Monotonicity entailments: The researchers test the models' understanding of logical relationships that are maintained or reversed when certain words are added or removed from the sentences.

The researchers design evaluation sets for these tasks and conduct experiments in both zero-shot and chain-of-thought setups, using multiple prompts and LLMs, including GPT-3, Chinchilla, and others.

The results show that the models exhibit moderate to low performance on these evaluation sets, suggesting that even the strongest LLMs have blindspots when it comes to certain types of entailments. Furthermore, the researchers find that embedding the premises in syntactic constructions that should preserve the entailment relations (presupposition triggers) or change them (non-factives) further confuses the models, causing them to either under-predict or over-predict certain entailment labels, often disregarding the nature of the embedding context.

Critical Analysis

The paper provides valuable insights into the limitations of current LLMs when it comes to basic language understanding and reasoning tasks. While LLMs have demonstrated impressive language capabilities, the results of this study suggest that they still struggle with certain types of logical inferences and are susceptible to being misled by specific syntactic constructions.

One potential limitation of the study is the relatively small scale of the evaluation sets. While the researchers designed the tasks to target specific linguistic phenomena, a larger and more diverse set of examples may be needed to fully capture the models' weaknesses. Additionally, the study focuses on a limited set of LLMs, and it would be interesting to see how a wider range of models, including newer and more advanced architectures, perform on these tasks.

Another area for further research could be investigating the underlying reasons why the models struggle with these types of inferences. Are the issues related to limitations in the models' knowledge representation, reasoning capabilities, or something else? Exploring these questions could lead to important insights for improving the language understanding abilities of LLMs.

Despite these potential limitations, the paper makes a valuable contribution to the ongoing efforts to understand the strengths and weaknesses of large language models. By identifying specific areas where these models fall short, the research can help guide the development of more robust and reliable natural language processing systems.


This paper highlights the limitations of current large language models (LLMs) when it comes to simple language understanding and reasoning tasks. The researchers find that even the strongest LLMs exhibit moderate to low performance on three types of inference tasks: grammatically-specified entailments, premises with evidential adverbs of uncertainty, and monotonicity entailments.

Furthermore, the researchers show that certain syntactic constructions, such as presupposition triggers and non-factives, can further confuse the models, causing them to make systematic errors in predicting the logical relationships, even when the underlying meaning should be clear.

These findings suggest that while LLMs have made remarkable progress in natural language processing, they still have significant blindspots when it comes to fundamental language understanding and reasoning. Addressing these limitations will be an important challenge for the field as it works to develop more robust and reliable language models that can truly match human-level language abilities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Easy Problems That LLMs Get Wrong

Easy Problems That LLMs Get Wrong

Sean Williams, James Huckle





We introduce a comprehensive Linguistic Benchmark designed to evaluate the limitations of Large Language Models (LLMs) in domains such as logical reasoning, spatial intelligence, and linguistic understanding, among others. Through a series of straightforward questions, it uncovers the significant limitations of well-regarded models to perform tasks that humans manage with ease. It also highlights the potential of prompt engineering to mitigate some errors and underscores the necessity for better training methodologies. Our findings stress the importance of grounding LLMs with human reasoning and common sense, emphasising the need for human-in-the-loop for enterprise applications. We hope this work paves the way for future research to enhance the usefulness and reliability of new models.

Read more



Bayesian Statistical Modeling with Predictors from LLMs

Michael Franke, Polina Tsvilodub, Fausto Carcassi





State of the art large language models (LLMs) have shown impressive performance on a variety of benchmark tasks and are increasingly used as components in larger applications, where LLM-based predictions serve as proxies for human judgements or decision. This raises questions about the human-likeness of LLM-derived information, alignment with human intuition, and whether LLMs could possibly be considered (parts of) explanatory models of (aspects of) human cognition or language use. To shed more light on these issues, we here investigate the human-likeness of LLMs' predictions for multiple-choice decision tasks from the perspective of Bayesian statistical modeling. Using human data from a forced-choice experiment on pragmatic language use, we find that LLMs do not capture the variance in the human data at the item-level. We suggest different ways of deriving full distributional predictions from LLMs for aggregate, condition-level data, and find that some, but not all ways of obtaining condition-level predictions yield adequate fits to human data. These results suggests that assessment of LLM performance depends strongly on seemingly subtle choices in methodology, and that LLMs are at best predictors of human behavior at the aggregate, condition-level, for which they are, however, not designed to, or usually used to, make predictions in the first place.

Read more



Evaluating the Deductive Competence of Large Language Models

Spencer M. Seals, Valerie L. Shalin





The development of highly fluent large language models (LLMs) has prompted increased interest in assessing their reasoning and problem-solving capabilities. We investigate whether several LLMs can solve a classic type of deductive reasoning problem from the cognitive science literature. The tested LLMs have limited abilities to solve these problems in their conventional form. We performed follow up experiments to investigate if changes to the presentation format and content improve model performance. We do find performance differences between conditions; however, they do not improve overall performance. Moreover, we find that performance interacts with presentation format and content in unexpected ways that differ from human performance. Overall, our results suggest that LLMs have unique reasoning biases that are only partially predicted from human reasoning performance and the human-generated language corpora that informs them.

Read more



A blind spot for large language models: Supradiegetic linguistic information

Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Jonathan St. Onge, Mikaela Fudolig, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds





Large Language Models (LLMs) like ChatGPT reflect profound changes in the field of Artificial Intelligence, achieving a linguistic fluency that is impressively, even shockingly, human-like. The extent of their current and potential capabilities is an active area of investigation by no means limited to scientific researchers. It is common for people to frame the training data for LLMs as text or even language. We examine the details of this framing using ideas from several areas, including linguistics, embodied cognition, cognitive science, mathematics, and history. We propose that considering what it is like to be an LLM like ChatGPT, as Nagel might have put it, can help us gain insight into its capabilities in general, and in particular, that its exposure to linguistic training data can be productively reframed as exposure to the diegetic information encoded in language, and its deficits can be reframed as ignorance of extradiegetic information, including supradiegetic linguistic information. Supradiegetic linguistic information consists of those arbitrary aspects of the physical form of language that are not derivable from the one-dimensional relations of context -- frequency, adjacency, proximity, co-occurrence -- that LLMs like ChatGPT have access to. Roughly speaking, the diegetic portion of a word can be thought of as its function, its meaning, as the information in a theoretical vector in a word embedding, while the supradiegetic portion of the word can be thought of as its form, like the shapes of its letters or the sounds of its syllables. We use these concepts to investigate why LLMs like ChatGPT have trouble handling palindromes, the visual characteristics of symbols, translating Sumerian cuneiform, and continuing integer sequences.

Read more
