Temperature-scaling surprisal estimates improve fit to human reading times -- but does it do so for the right reasons?

Read original: arXiv:2311.09325 - Published 7/4/2024 by Tong Liu, Iza v{S}krjanec, Vera Demberg

🤷

Overview

This paper investigates the relationship between the information-theoretic measure of surprisal, which predicts human language processing difficulty, and the predictions of large language models (LLMs).
The researchers explore how temperature-scaling the probabilities predicted by LLMs can improve the models' ability to estimate word surprisal and better predict human reading times.
The paper provides insights into how well-calibrated LLMs are in comparison to human language processing, and how adjusting the models' confidence levels can lead to better psycholinguistic predictive power.

Plain English Explanation

When we read or listen to language, some words are more surprising or difficult to process than others. This previous research has shown that a measure called "surprisal" - the negative log probability of a word in its context - can predict how hard a word is for humans to process.

However, it's not clear how best to estimate these probabilities that determine surprisal. Past work has suggested that language models with lower "perplexity" (a measure of how well they predict language) would make better surprisal estimates. But recent research has found that for very large language models, their ability to predict human reading times actually decreases.

One reason for this could be that large language models are more confident in their predictions than humans are, because they've seen much more data. This paper explores whether adjusting the confidence of these models, using a technique called "temperature scaling," can improve their ability to estimate surprisal and predict reading times.

Technical Explanation

The researchers first show that the calibration (how well-aligned the model's confidence is with its accuracy) of large language models generally improves as the models get larger. This rules out poor calibration as the reason for the decreased psycholinguistic predictive power of very large models.

Next, the researchers find that temperature-scaling the probabilities predicted by large language models leads to a significant improvement in the models' ability to predict human reading times, across multiple reading time datasets. This improvement is up to 89% in terms of the change in log likelihood.

The researchers attribute this improvement to the models' better handling of words composed of multiple subword tokens. By adjusting the models' confidence, temperature scaling allows the models to better capture the surprisal of these more complex words, which are important for predicting human language processing difficulty.

Critical Analysis

The paper provides a valuable contribution by demonstrating a simple technique - temperature scaling - that can substantially improve the ability of large language models to estimate human language processing difficulty. This is an important step towards better aligning these powerful AI systems with human cognitive profiles.

However, the paper does not explore the reasons behind the decreased psycholinguistic predictive power of very large language models in depth. While the authors rule out poor calibration as a factor, there may be other model-specific or data-related reasons that warrant further investigation.

Additionally, the paper focuses on reading time as the sole measure of language processing difficulty. Other psycholinguistic measures, such as eye-tracking or event-related potentials, could provide a more comprehensive understanding of how well language models capture human language processing.

Further research could also investigate whether temperature scaling has similar benefits for other language-related tasks, such as predicting human judgments of sentence naturality or complexity.

Conclusion

This paper demonstrates that adjusting the confidence of large language models through temperature scaling can significantly improve their ability to estimate word surprisal and predict human reading times. This suggests that the mismatch between language model predictions and human cognitive profiles may be, at least in part, a matter of calibration.

By bridging the gap between AI language models and human language processing, this research brings us closer to developing AI systems that can better understand and interact with human users. This prior work on the scaling properties of language models, combined with the insights from this paper, highlights the importance of carefully calibrating these powerful AI systems to better emulate human cognition.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤷

Temperature-scaling surprisal estimates improve fit to human reading times -- but does it do so for the right reasons?

Tong Liu, Iza v{S}krjanec, Vera Demberg

A wide body of evidence shows that human language processing difficulty is predicted by the information-theoretic measure surprisal, a word's negative log probability in context. However, it is still unclear how to best estimate these probabilities needed for predicting human processing difficulty -- while a long-standing belief held that models with lower perplexity would provide more accurate estimates of word predictability, and therefore lead to better reading time predictions, recent work has shown that for very large models, psycholinguistic predictive power decreases. One reason could be that language models might be more confident of their predictions than humans, because they have had exposure to several magnitudes more data. In this paper, we test what effect temperature-scaling of large language model (LLM) predictions has on surprisal estimates and their predictive power of reading times of English texts. Firstly, we show that calibration of large language models typically improves with model size, i.e. poorer calibration cannot account for poorer fit to reading times. Secondly, we find that temperature-scaling probabilities lead to a systematically better fit to reading times (up to 89% improvement in delta log likelihood), across several reading time corpora. Finally, we show that this improvement in fit is chiefly driven by words that are composed of multiple subword tokens.

7/4/2024

Language models emulate certain cognitive profiles: An investigation of how predictability measures interact with individual differences

Patrick Haller, Lena S. Bolliger, Lena A. Jager

To date, most investigations on surprisal and entropy effects in reading have been conducted on the group level, disregarding individual differences. In this work, we revisit the predictive power of surprisal and entropy measures estimated from a range of language models (LMs) on data of human reading times as a measure of processing effort by incorporating information of language users' cognitive capacities. To do so, we assess the predictive power of surprisal and entropy estimated from generative LMs on reading data obtained from individuals who also completed a wide range of psychometric tests. Specifically, we investigate if modulating surprisal and entropy relative to cognitive scores increases prediction accuracy of reading times, and we examine whether LMs exhibit systematic biases in the prediction of reading times for cognitively high- or low-performing groups, revealing what type of psycholinguistic subject a given LM emulates. Our study finds that in most cases, incorporating cognitive capacities increases predictive power of surprisal and entropy on reading times, and that generally, high performance in the psychometric tests is associated with lower sensitivity to predictability effects. Finally, our results suggest that the analyzed LMs emulate readers with lower verbal intelligence, suggesting that for a given target group (i.e., individuals with high verbal intelligence), these LMs provide less accurate predictability estimates.

8/6/2024

Testing the Predictions of Surprisal Theory in 11 Languages

Ethan Gotlieb Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, Roger P. Levy

A fundamental result in psycholinguistics is that less predictable words take a longer time to process. One theoretical explanation for this finding is Surprisal Theory (Hale, 2001; Levy, 2008), which quantifies a word's predictability as its surprisal, i.e. its negative log-probability given a context. While evidence supporting the predictions of Surprisal Theory have been replicated widely, most have focused on a very narrow slice of data: native English speakers reading English texts. Indeed, no comprehensive multilingual analysis exists. We address this gap in the current literature by investigating the relationship between surprisal and reading times in eleven different languages, distributed across five language families. Deriving estimates from language models trained on monolingual and multilingual corpora, we test three predictions associated with surprisal theory: (i) whether surprisal is predictive of reading times; (ii) whether expected surprisal, i.e. contextual entropy, is predictive of reading times; (iii) and whether the linking function between surprisal and reading times is linear. We find that all three predictions are borne out crosslinguistically. By focusing on a more diverse set of languages, we argue that these results offer the most robust link to-date between information theory and incremental language processing across languages.

9/12/2024

💬

Psychometric Predictive Power of Large Language Models

Tatsuki Kuribayashi, Yohei Oseki, Timothy Baldwin

Instruction tuning aligns the response of large language models (LLMs) with human preferences. Despite such efforts in human--LLM alignment, we find that instruction tuning does not always make LLMs human-like from a cognitive modeling perspective. More specifically, next-word probabilities estimated by instruction-tuned LLMs are often worse at simulating human reading behavior than those estimated by base LLMs. In addition, we explore prompting methodologies for simulating human reading behavior with LLMs. Our results show that prompts reflecting a particular linguistic hypothesis improve psychometric predictive power, but are still inferior to small base models. These findings highlight that recent advancements in LLMs, i.e., instruction tuning and prompting, do not offer better estimates than direct probability measurements from base LLMs in cognitive modeling. In other words, pure next-word probability remains a strong predictor for human reading behavior, even in the age of LLMs.

4/16/2024