Pragmatic inference of scalar implicature by LLMs

Read original: arXiv:2408.06673 - Published 8/14/2024 by Ye-eun Cho, Seong mook Kim

🤯

Overview

This study investigates how large language models (LLMs) like BERT and GPT-2 handle pragmatic inference of scalar implicature, such as the word "some."
The researchers conducted two experiments using cosine similarity and next sentence/token prediction to test the models' performance on these tasks.
The results suggest that BERT inherently incorporates pragmatic implicature, while GPT-2 encounters more difficulty inferring implicature within context.

Plain English Explanation

The study examines how large language models (LLMs), such as BERT and GPT-2, handle a particular aspect of language understanding called "scalar implicature."

Scalar implicature refers to the way we interpret words like "some" to mean "not all," even though the literal meaning of "some" is simply "one or more." For example, if someone says, "Some of the students passed the test," we typically understand this to mean that not all the students passed.

The researchers wanted to see how well BERT and GPT-2 could handle this type of pragmatic inference - drawing meaning beyond the literal words. They did this by running two experiments:

Seeing how the models interpreted "some" without any additional context. Both BERT and GPT-2 showed they understood "some" to imply "not all," just like humans do.
Presenting the models with a "Question Under Discussion" (QUD) as additional context. This affected the models differently - BERT maintained its performance, while GPT-2 struggled more with the pragmatic inference required.

The findings suggest that BERT has inherently incorporated this pragmatic understanding of "some," in line with the "Default" model of language processing. In contrast, GPT-2 seems to have more difficulty inferring implicature within a given context, which aligns better with the "Context-driven" model.

Technical Explanation

The researchers conducted two sets of experiments to investigate how large language models (LLMs) like BERT and GPT-2 handle pragmatic inference of scalar implicature.

In Experiment 1, they used cosine similarity to measure how the models interpret the word "some" in the absence of any additional context. The results showed that both BERT and GPT-2 interpret "some" to implicate "not all," consistent with how humans process scalar implicature.

Experiment 2 introduced a "Question Under Discussion" (QUD) as contextual cue. With this additional information, BERT maintained its performance in inferring pragmatic implicature. However, GPT-2 encountered processing difficulties when the QUD required pragmatic inference to fully understand the implicature.

These findings reveal differences in the theoretical approaches underlying BERT and GPT-2. BERT appears to inherently incorporate pragmatic implicature not all within the term "some," aligning with the Default model of language processing (Levinson, 2000). In contrast, GPT-2 seems to struggle more with inferring pragmatic implicature within a given context, consistent with the Context-driven model (Sperber and Wilson, 2002).

Critical Analysis

The study provides valuable insights into how leading LLMs handle pragmatic inference, but it also has some important limitations:

The experiments were conducted on a relatively small scale, with only two models (BERT and GPT-2) tested. It would be beneficial to expand the research to include a wider range of LLMs to see if the observed patterns hold true more broadly.
The researchers acknowledged that the models' performance may be influenced by the specific training data and fine-tuning approaches used. Further research is needed to understand how different training regimes and data sources impact pragmatic inference capabilities.
The study focused solely on scalar implicature using the word "some." While this is an important aspect of pragmatic inference, there are many other types of implicature and contextual cues that could be explored to gain a more comprehensive understanding of LLMs' language understanding abilities.

Despite these caveats, the study makes a valuable contribution by shedding light on the theoretical underpinnings of how leading LLMs process pragmatic meaning. The findings suggest that different architectural approaches and training methods may lead to divergent strengths and weaknesses in handling contextual inference, an important consideration as these models are deployed in real-world applications.

Conclusion

This study investigates how large language models (LLMs) like BERT and GPT-2 handle pragmatic inference of scalar implicature, such as interpreting "some" to mean "not all." The results suggest that BERT has inherently incorporated this pragmatic understanding, while GPT-2 encounters more difficulty inferring implicature within a given context.

These findings have important implications for our understanding of how LLMs process language and draw meaning beyond the literal words. As these models become increasingly prevalent, it will be crucial to continue exploring their strengths and limitations in handling the nuances of human communication.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤯

Pragmatic inference of scalar implicature by LLMs

Ye-eun Cho, Seong mook Kim

This study investigates how Large Language Models (LLMs), particularly BERT (Devlin et al., 2019) and GPT-2 (Radford et al., 2019), engage in pragmatic inference of scalar implicature, such as some. Two sets of experiments were conducted using cosine similarity and next sentence/token prediction as experimental methods. The results in experiment 1 showed that, both models interpret some as pragmatic implicature not all in the absence of context, aligning with human language processing. In experiment 2, in which Question Under Discussion (QUD) was presented as a contextual cue, BERT showed consistent performance regardless of types of QUDs, while GPT-2 encountered processing difficulties since a certain type of QUD required pragmatic inference for implicature. The findings revealed that, in terms of theoretical approaches, BERT inherently incorporates pragmatic implicature not all within the term some, adhering to Default model (Levinson, 2000). In contrast, GPT-2 seems to encounter processing difficulties in inferring pragmatic implicature within context, consistent with Context-driven model (Sperber and Wilson, 2002).

8/14/2024

💬

Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom

Shisen Yue, Siyuan Song, Xinyuan Cheng, Hai Hu

Understanding the non-literal meaning of an utterance is critical for large language models (LLMs) to become human-like social communicators. In this work, we introduce SwordsmanImp, the first Chinese multi-turn-dialogue-based dataset aimed at conversational implicature, sourced from dialogues in the Chinese sitcom $textit{My Own Swordsman}$. It includes 200 carefully handcrafted questions, all annotated on which Gricean maxims have been violated. We test eight close-source and open-source LLMs under two tasks: a multiple-choice question task and an implicature explanation task. Our results show that GPT-4 attains human-level accuracy (94%) on multiple-choice questions. CausalLM demonstrates a 78.5% accuracy following GPT-4. Other models, including GPT-3.5 and several open-source models, demonstrate a lower accuracy ranging from 20% to 60% on multiple-choice questions. Human raters were asked to rate the explanation of the implicatures generated by LLMs on their reasonability, logic and fluency. While all models generate largely fluent and self-consistent text, their explanations score low on reasonability except for GPT-4, suggesting that most LLMs cannot produce satisfactory explanations of the implicatures in the conversation. Moreover, we find LLMs' performance does not vary significantly by Gricean maxims, suggesting that LLMs do not seem to process implicatures derived from different maxims differently. Our data and code are available at https://github.com/sjtu-compling/llm-pragmatics.

8/1/2024

Experimental Pragmatics with Machines: Testing LLM Predictions for the Inferences of Plain and Embedded Disjunctions

Polina Tsvilodub, Paul Marty, Sonia Ramotowska, Jacopo Romoli, Michael Franke

Human communication is based on a variety of inferences that we draw from sentences, often going beyond what is literally said. While there is wide agreement on the basic distinction between entailment, implicature, and presupposition, the status of many inferences remains controversial. In this paper, we focus on three inferences of plain and embedded disjunctions, and compare them with regular scalar implicatures. We investigate this comparison from the novel perspective of the predictions of state-of-the-art large language models, using the same experimental paradigms as recent studies investigating the same inferences with humans. The results of our best performing models mostly align with those of humans, both in the large differences we find between those inferences and implicatures, as well as in fine-grained distinctions among different aspects of those inferences.

5/10/2024

⚙️

Analyzing Narrative Processing in Large Language Models (LLMs): Using GPT4 to test BERT

Patrick Krauss, Jannik Hosch, Claus Metzner, Andreas Maier, Peter Uhrig, Achim Schilling

The ability to transmit and receive complex information via language is unique to humans and is the basis of traditions, culture and versatile social interactions. Through the disruptive introduction of transformer based large language models (LLMs) humans are not the only entity to understand and produce language any more. In the present study, we have performed the first steps to use LLMs as a model to understand fundamental mechanisms of language processing in neural networks, in order to make predictions and generate hypotheses on how the human brain does language processing. Thus, we have used ChatGPT to generate seven different stylistic variations of ten different narratives (Aesop's fables). We used these stories as input for the open source LLM BERT and have analyzed the activation patterns of the hidden units of BERT using multi-dimensional scaling and cluster analysis. We found that the activation vectors of the hidden units cluster according to stylistic variations in earlier layers of BERT (1) than narrative content (4-5). Despite the fact that BERT consists of 12 identical building blocks that are stacked and trained on large text corpora, the different layers perform different tasks. This is a very useful model of the human brain, where self-similar structures, i.e. different areas of the cerebral cortex, can have different functions and are therefore well suited to processing language in a very efficient way. The proposed approach has the potential to open the black box of LLMs on the one hand, and might be a further step to unravel the neural processes underlying human language processing and cognition in general.

5/6/2024