Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Read original: arXiv:2406.01382 - Published 6/4/2024 by Keyon Vafa, Ashesh Rambachan, Sendhil Mullainathan

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Overview

This paper examines how well large language models (LLMs) perform compared to human expectations.
It proposes a framework to measure the "human generalization function" - how people expect models to generalize to new tasks and situations.
The researchers conducted experiments to compare LLM performance to human predictions, uncovering gaps between model capabilities and user expectations.

Plain English Explanation

This research paper explores how well large language models (LLMs) like GPT-3 perform compared to what people expect them to be able to do. The researchers developed a framework to measure the "human generalization function" - in other words, how people think these AI models should be able to handle new tasks and situations that are different from their original training.

The team then ran a series of experiments to compare the actual performance of LLMs to what people predicted they could do. Interestingly, they found significant gaps between the models' real capabilities and people's expectations. In many cases, the LLMs were able to handle tasks much better or much worse than humans expected.

For example, link to "Evaluating LLMs at Temporal Generalization" the paper suggests that people tend to overestimate how well LLMs can handle tasks that require understanding recent events or information. And link to "Large Language Models: A Wikipedia-Style Survey Generation" the research indicates that people underestimate LLMs' ability to generate coherent, human-like text on a wide range of topics.

By uncovering these gaps between expectations and reality, the researchers hope to inform the development of LLMs that can better match human needs and intuitions. Understanding where people's mental models of AI diverge from actual capabilities is an important step towards building more transparent and trustworthy language technologies.

Technical Explanation

The key contribution of this paper is the development of a framework to measure the "human generalization function" - how people expect AI language models to perform on tasks that are different from their original training. Link to "Aspects of Human Memory in Large Language Models"

To assess this, the researchers conducted a series of experiments where they presented participants with descriptions of different language tasks and asked them to predict the performance of large language models like GPT-3. They then compared these human predictions to the actual results when the models attempted the same tasks.

The experiments spanned a variety of domains, including link to "Systematic Evaluation of Large Language Models on Natural Language" language generation, question answering, and reasoning. The researchers found that in many cases, people's expectations did not align with the models' true capabilities.

For example, the paper suggests that humans tend to overestimate how well LLMs can handle tasks that require reasoning about recent events or information. And link to "Large Language Models are Inconsistent, Biased Evaluators" the research indicates that people underestimate the models' ability to generate coherent, human-like text on a wide range of topics.

By uncovering these gaps between expectation and reality, the authors hope to inform the development of LLMs that better match human needs and intuitions. Understanding where people's mental models of AI diverge from actual capabilities is an important step towards building more transparent and trustworthy language technologies.

Critical Analysis

The researchers acknowledge several limitations and areas for further exploration in this work. For one, the experiments relied on relatively small sample sizes and focused primarily on English-language tasks. It would be valuable to expand the study to larger and more diverse participant pools, as well as to investigate language models in other languages.

Additionally, the paper does not delve deeply into the specific reasons why people's expectations diverged from the models' actual performance. Understanding the cognitive biases, heuristics, and knowledge gaps that shape human perceptions of AI could yield valuable insights.

It would also be interesting to explore how factors like task framing, model transparency, and participant expertise might influence the human generalization function. Link to "Aspects of Human Memory in Large Language Models" For example, do people's expectations change if they are provided with more information about the underlying model architecture and training data?

Overall, this research represents an important step towards building AI systems that better align with human needs and expectations. By continuing to investigate the gaps between perception and reality, the field can work towards developing language models that are more transparent, trustworthy, and responsive to user requirements.

Conclusion

This paper proposes a framework for measuring the "human generalization function" - how people expect large language models to perform on tasks that differ from their original training. Through a series of experiments, the researchers found significant gaps between human predictions and the actual capabilities of LLMs like GPT-3.

By uncovering these discrepancies, the work aims to inform the development of language technologies that better match user needs and intuitions. Understanding where people's mental models of AI diverge from reality is a crucial step towards building more transparent and trustworthy systems. Link to "Systematic Evaluation of Large Language Models on Natural Language"

As the field of large language models continues to advance, this type of research will be increasingly important in ensuring that the technology evolves in alignment with human expectations and values. The insights from this paper represent an important contribution towards that goal.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Keyon Vafa, Ashesh Rambachan, Sendhil Mullainathan

What makes large language models (LLMs) impressive is also what makes them hard to evaluate: their diversity of uses. To evaluate these models, we must understand the purposes they will be used for. We consider a setting where these deployment decisions are made by people, and in particular, people's beliefs about where an LLM will perform well. We model such beliefs as the consequence of a human generalization function: having seen what an LLM gets right or wrong, people generalize to where else it might succeed. We collect a dataset of 19K examples of how humans make generalizations across 79 tasks from the MMLU and BIG-Bench benchmarks. We show that the human generalization function can be predicted using NLP methods: people have consistent structured ways to generalize. We then evaluate LLM alignment with the human generalization function. Our results show that -- especially for cases where the cost of mistakes is high -- more capable models (e.g. GPT-4) can do worse on the instances people choose to use them for, exactly because they are not aligned with the human generalization function.

6/4/2024

Language models align with human judgments on key grammatical constructions

Jennifer Hu, Kyle Mahowald, Gary Lupyan, Anna Ivanova, Roger Levy

Do large language models (LLMs) make human-like linguistic generalizations? Dentella et al. (2023) (DGL) prompt several LLMs (Is the following sentence grammatically correct in English?) to elicit grammaticality judgments of 80 English sentences, concluding that LLMs demonstrate a yes-response bias and a failure to distinguish grammatical from ungrammatical sentences. We re-evaluate LLM performance using well-established practices and find that DGL's data in fact provide evidence for just how well LLMs capture human behaviors. Models not only achieve high accuracy overall, but also capture fine-grained variation in human linguistic judgments.

9/2/2024

Large Language Models Assume People are More Rational than We Really are

Ryan Liu, Jiayi Geng, Joshua C. Peterson, Ilia Sucholutsky, Thomas L. Griffiths

In order for AI systems to communicate effectively with people, they must understand how we make decisions. However, people's decisions are not always rational, so the implicit internal models of human decision-making in Large Language Models (LLMs) must account for this. Previous empirical evidence seems to suggest that these implicit models are accurate -- LLMs offer believable proxies of human behavior, acting how we expect humans would in everyday interactions. However, by comparing LLM behavior and predictions to a large dataset of human decisions, we find that this is actually not the case: when both simulating and predicting people's choices, a suite of cutting-edge LLMs (GPT-4o & 4-Turbo, Llama-3-8B & 70B, Claude 3 Opus) assume that people are more rational than we really are. Specifically, these models deviate from human behavior and align more closely with a classic model of rational choice -- expected value theory. Interestingly, people also tend to assume that other people are rational when interpreting their behavior. As a consequence, when we compare the inferences that LLMs and people draw from the decisions of others using another psychological dataset, we find that these inferences are highly correlated. Thus, the implicit decision-making models of LLMs appear to be aligned with the human expectation that other people will act rationally, rather than with how people actually act.

7/31/2024

💬

From Text to Insight: Leveraging Large Language Models for Performance Evaluation in Management

Ning Li, Huaikang Zhou, Mingze Xu

This study explores the potential of Large Language Models (LLMs), specifically GPT-4, to enhance objectivity in organizational task performance evaluations. Through comparative analyses across two studies, including various task performance outputs, we demonstrate that LLMs can serve as a reliable and even superior alternative to human raters in evaluating knowledge-based performance outputs, which are a key contribution of knowledge workers. Our results suggest that GPT ratings are comparable to human ratings but exhibit higher consistency and reliability. Additionally, combined multiple GPT ratings on the same performance output show strong correlations with aggregated human performance ratings, akin to the consensus principle observed in performance evaluation literature. However, we also find that LLMs are prone to contextual biases, such as the halo effect, mirroring human evaluative biases. Our research suggests that while LLMs are capable of extracting meaningful constructs from text-based data, their scope is currently limited to specific forms of performance evaluation. By highlighting both the potential and limitations of LLMs, our study contributes to the discourse on AI role in management studies and sets a foundation for future research to refine AI theoretical and practical applications in management.

8/13/2024