Language models align with human judgments on key grammatical constructions

Read original: arXiv:2402.01676 - Published 9/2/2024 by Jennifer Hu, Kyle Mahowald, Gary Lupyan, Anna Ivanova, Roger Levy

Language models align with human judgments on key grammatical constructions

Overview

Language models (LMs) are large artificial intelligence systems trained on vast amounts of text data to generate human-like language.
Researchers examined how well these LMs align with human judgments on key grammatical constructions.
The study found that LMs generally perform well on these grammatical tests, suggesting they have acquired substantial human-like linguistic knowledge.

Plain English Explanation

Language models are powerful AI systems that can generate human-like text by learning patterns from huge datasets. Researchers wanted to see how well these LMs understand fundamental aspects of grammar compared to human intuitions.

They designed a series of grammatical tests covering topics like whether certain word combinations are acceptable or not. For example, they might ask if the sentence "The cat the dog chased ran away" sounds natural to people.

Surprisingly, the LMs performed quite well on these grammatical tests, often aligning closely with human judgments. This indicates the LMs have developed substantial linguistic knowledge, similar to how people intuitively grasp the rules of their native language.

Technical Explanation

The researchers evaluated the grammatical knowledge of several prominent language models, including GPT-3, InstructGPT, and BERT. They tested the models' ability to differentiate between acceptable and unacceptable grammatical constructions across a range of linguistic phenomena.

The experiments involved presenting the LMs and human participants with sentence pairs, and asking them to judge which sentence was more natural or grammatically correct. The sentence pairs were designed to test specific grammatical concepts, such as subject-verb agreement, wh-movement, and negative polarity items.

The results showed that the LMs generally performed well on these grammatical tests, often matching or even exceeding human-level performance. Statistical modeling revealed that the models' grammatical judgments could be predicted by their language modeling capabilities, suggesting they have internalized substantial grammatical knowledge.

Critical Analysis

While the LMs performed impressively on the grammatical tests, the researchers note that this may not necessarily translate to a complete or nuanced understanding of language. The tests focused on specific, well-defined grammatical phenomena, and the models may still struggle with more complex, contextual aspects of language use.

Additionally, the researchers caution that the LMs' performance should not be interpreted as evidence of general human-like intelligence. The models' success on these tests may be due to their impressive pattern-matching capabilities, rather than a deeper comprehension of language.

Further research is needed to better understand the limitations and potential biases of language models, and to explore how their linguistic knowledge compares to the more holistic and flexible language processing capabilities of humans.

Conclusion

This study demonstrates that state-of-the-art language models have developed a substantial level of grammatical knowledge, often aligning closely with human intuitions on key grammatical constructions. However, this should not be interpreted as evidence of true language understanding or general intelligence. Ongoing research is needed to further explore the capabilities and limitations of these powerful AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Language models align with human judgments on key grammatical constructions

Jennifer Hu, Kyle Mahowald, Gary Lupyan, Anna Ivanova, Roger Levy

Do large language models (LLMs) make human-like linguistic generalizations? Dentella et al. (2023) (DGL) prompt several LLMs (Is the following sentence grammatically correct in English?) to elicit grammaticality judgments of 80 English sentences, concluding that LLMs demonstrate a yes-response bias and a failure to distinguish grammatical from ungrammatical sentences. We re-evaluate LLM performance using well-established practices and find that DGL's data in fact provide evidence for just how well LLMs capture human behaviors. Models not only achieve high accuracy overall, but also capture fine-grained variation in human linguistic judgments.

9/2/2024

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Keyon Vafa, Ashesh Rambachan, Sendhil Mullainathan

What makes large language models (LLMs) impressive is also what makes them hard to evaluate: their diversity of uses. To evaluate these models, we must understand the purposes they will be used for. We consider a setting where these deployment decisions are made by people, and in particular, people's beliefs about where an LLM will perform well. We model such beliefs as the consequence of a human generalization function: having seen what an LLM gets right or wrong, people generalize to where else it might succeed. We collect a dataset of 19K examples of how humans make generalizations across 79 tasks from the MMLU and BIG-Bench benchmarks. We show that the human generalization function can be predicted using NLP methods: people have consistent structured ways to generalize. We then evaluate LLM alignment with the human generalization function. Our results show that -- especially for cases where the cost of mistakes is high -- more capable models (e.g. GPT-4) can do worse on the instances people choose to use them for, exactly because they are not aligned with the human generalization function.

6/4/2024

🧪

Testing AI on language comprehension tasks reveals insensitivity to underlying meaning

Vittoria Dentella, Fritz Guenther, Elliot Murphy, Gary Marcus, Evelina Leivada

Large Language Models (LLMs) are recruited in applications that span from clinical assistance and legal support to question answering and education. Their success in specialized tasks has led to the claim that they possess human-like linguistic capabilities related to compositional understanding and reasoning. Yet, reverse-engineering is bound by Moravec's Paradox, according to which easy skills are hard. We systematically assess 7 state-of-the-art models on a novel benchmark. Models answered a series of comprehension questions, each prompted multiple times in two settings, permitting one-word or open-length replies. Each question targets a short text featuring high-frequency linguistic constructions. To establish a baseline for achieving human-like performance, we tested 400 humans on the same prompts. Based on a dataset of n=26,680 datapoints, we discovered that LLMs perform at chance accuracy and waver considerably in their answers. Quantitatively, the tested models are outperformed by humans, and qualitatively their answers showcase distinctly non-human errors in language understanding. We interpret this evidence as suggesting that, despite their usefulness in various tasks, current AI models fall short of understanding language in a way that matches humans, and we argue that this may be due to their lack of a compositional operator for regulating grammatical and semantic information.

7/10/2024

💬

A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds?

Evelina Leivada, Gary Marcus, Fritz Gunther, Elliot Murphy

Modern Artificial Intelligence applications show great potential for language-related tasks that rely on next-word prediction. The current generation of Large Language Models (LLMs) have been linked to claims about human-like linguistic performance and their applications are hailed both as a step towards artificial general intelligence and as a major advance in understanding the cognitive, and even neural basis of human language. To assess these claims, first we analyze the contribution of LLMs as theoretically informative representations of a target cognitive system vs. atheoretical mechanistic tools. Second, we evaluate the models' ability to see the bigger picture, through top-down feedback from higher levels of processing, which requires grounding in previous expectations and past world experience. We hypothesize that since models lack grounded cognition, they cannot take advantage of these features and instead solely rely on fixed associations between represented words and word vectors. To assess this, we designed and ran a novel 'leet task' (l33t t4sk), which requires decoding sentences in which letters are systematically replaced by numbers. The results suggest that humans excel in this task whereas models struggle, confirming our hypothesis. We interpret the results by identifying the key abilities that are still missing from the current state of development of these models, which require solutions that go beyond increased system scaling.

9/5/2024