Testing the Predictions of Surprisal Theory in 11 Languages

Read original: arXiv:2307.03667 - Published 9/12/2024 by Ethan Gotlieb Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, Roger P. Levy
Total Score


Testing the Predictions of Surprisal Theory in 11 Languages

Sign in to get full access


If you already have an account, we'll log you in


  • This paper tests the predictions of surprisal theory, which suggests that language processing difficulty is related to the predictability of words in context.
  • The researchers analyzed reading times across 11 diverse languages to determine whether surprisal models can accurately predict processing difficulty.
  • The study provides empirical evidence supporting surprisal theory as a general principle of language processing.

Plain English Explanation

When we read or listen to language, our brains constantly try to predict what word will come next based on the context. Surprisal theory suggests that the more surprising or unpredictable a word is, the more difficult it is for our brains to process.

In this study, the researchers wanted to test whether surprisal theory holds true across a wide range of languages. They analyzed reading times for people reading text in 11 different languages, ranging from English to Mandarin Chinese. The researchers found that a surprisal-based model could accurately predict the reading times, supporting the idea that predictability is a fundamental principle of how our brains process language, regardless of the specific language.

Technical Explanation

The researchers used a technique called "self-paced reading" to measure processing difficulty. In this method, participants read text one word at a time, and the time they spend on each word is recorded. The researchers then compared these reading times to the surprisal values predicted by a statistical language model for each word.

Surprisal theory proposes that words that are less predictable in their context require more mental effort to process, resulting in longer reading times. The researchers tested this idea across 11 diverse languages, including English, Mandarin Chinese, Russian, and Turkish, to see if surprisal could consistently predict reading difficulty.

The results showed that the surprisal-based model was able to explain a significant portion of the variance in reading times across all 11 languages. This provides strong evidence that surprisal is a general principle of language processing, rather than something specific to a particular language.

Critical Analysis

One limitation of the study is that it only examined reading times, which may not capture all aspects of language processing difficulty. The researchers acknowledge that other measures, such as eye-tracking or neuroimaging data, could provide additional insights.

Additionally, the study focused on surprisal based on statistical language models, which may not fully capture the complex contextual and semantic factors that influence human language processing. Further research could explore other models of predictability, such as those based on cognitive or linguistic theories.

Overall, this study makes a compelling case for surprisal theory as a general principle of language processing, but there is still more work to be done to fully understand the cognitive mechanisms underlying this phenomenon.


This paper provides strong empirical support for the idea that the predictability of words in context is a fundamental driver of language processing difficulty. By analyzing reading times across 11 diverse languages, the researchers have demonstrated the broad applicability of surprisal theory as a general principle of human language processing.

These findings have important implications for our understanding of how the brain handles the complex task of comprehending language. The study suggests that the ability to anticipate upcoming words is a crucial aspect of efficient language processing, which could inform theories of language acquisition, bilingualism, and disorders affecting language abilities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Testing the Predictions of Surprisal Theory in 11 Languages
Total Score


Testing the Predictions of Surprisal Theory in 11 Languages

Ethan Gotlieb Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, Roger P. Levy

A fundamental result in psycholinguistics is that less predictable words take a longer time to process. One theoretical explanation for this finding is Surprisal Theory (Hale, 2001; Levy, 2008), which quantifies a word's predictability as its surprisal, i.e. its negative log-probability given a context. While evidence supporting the predictions of Surprisal Theory have been replicated widely, most have focused on a very narrow slice of data: native English speakers reading English texts. Indeed, no comprehensive multilingual analysis exists. We address this gap in the current literature by investigating the relationship between surprisal and reading times in eleven different languages, distributed across five language families. Deriving estimates from language models trained on monolingual and multilingual corpora, we test three predictions associated with surprisal theory: (i) whether surprisal is predictive of reading times; (ii) whether expected surprisal, i.e. contextual entropy, is predictive of reading times; (iii) and whether the linking function between surprisal and reading times is linear. We find that all three predictions are borne out crosslinguistically. By focusing on a more diverse set of languages, we argue that these results offer the most robust link to-date between information theory and incremental language processing across languages.

Read more


Language models emulate certain cognitive profiles: An investigation of how predictability measures interact with individual differences
Total Score


Language models emulate certain cognitive profiles: An investigation of how predictability measures interact with individual differences

Patrick Haller, Lena S. Bolliger, Lena A. Jager

To date, most investigations on surprisal and entropy effects in reading have been conducted on the group level, disregarding individual differences. In this work, we revisit the predictive power of surprisal and entropy measures estimated from a range of language models (LMs) on data of human reading times as a measure of processing effort by incorporating information of language users' cognitive capacities. To do so, we assess the predictive power of surprisal and entropy estimated from generative LMs on reading data obtained from individuals who also completed a wide range of psychometric tests. Specifically, we investigate if modulating surprisal and entropy relative to cognitive scores increases prediction accuracy of reading times, and we examine whether LMs exhibit systematic biases in the prediction of reading times for cognitively high- or low-performing groups, revealing what type of psycholinguistic subject a given LM emulates. Our study finds that in most cases, incorporating cognitive capacities increases predictive power of surprisal and entropy on reading times, and that generally, high performance in the psychometric tests is associated with lower sensitivity to predictability effects. Finally, our results suggest that the analyzed LMs emulate readers with lower verbal intelligence, suggesting that for a given target group (i.e., individuals with high verbal intelligence), these LMs provide less accurate predictability estimates.

Read more



Total Score


Temperature-scaling surprisal estimates improve fit to human reading times -- but does it do so for the right reasons?

Tong Liu, Iza v{S}krjanec, Vera Demberg

A wide body of evidence shows that human language processing difficulty is predicted by the information-theoretic measure surprisal, a word's negative log probability in context. However, it is still unclear how to best estimate these probabilities needed for predicting human processing difficulty -- while a long-standing belief held that models with lower perplexity would provide more accurate estimates of word predictability, and therefore lead to better reading time predictions, recent work has shown that for very large models, psycholinguistic predictive power decreases. One reason could be that language models might be more confident of their predictions than humans, because they have had exposure to several magnitudes more data. In this paper, we test what effect temperature-scaling of large language model (LLM) predictions has on surprisal estimates and their predictive power of reading times of English texts. Firstly, we show that calibration of large language models typically improves with model size, i.e. poorer calibration cannot account for poorer fit to reading times. Secondly, we find that temperature-scaling probabilities lead to a systematically better fit to reading times (up to 89% improvement in delta log likelihood), across several reading time corpora. Finally, we show that this improvement in fit is chiefly driven by words that are composed of multiple subword tokens.

Read more


On the Role of Context in Reading Time Prediction
Total Score


On the Role of Context in Reading Time Prediction

Andreas Opedal, Eleanor Chodroff, Ryan Cotterell, Ethan Gotlieb Wilcox

We present a new perspective on how readers integrate context during real-time language comprehension. Our proposals build on surprisal theory, which posits that the processing effort of a linguistic unit (e.g., a word) is an affine function of its in-context information content. We first observe that surprisal is only one out of many potential ways that a contextual predictor can be derived from a language model. Another one is the pointwise mutual information (PMI) between a unit and its context, which turns out to yield the same predictive power as surprisal when controlling for unigram frequency. Moreover, both PMI and surprisal are correlated with frequency. This means that neither PMI nor surprisal contains information about context alone. In response to this, we propose a technique where we project surprisal onto the orthogonal complement of frequency, yielding a new contextual predictor that is uncorrelated with frequency. Our experiments show that the proportion of variance in reading times explained by context is a lot smaller when context is represented by the orthogonalized predictor. From an interpretability standpoint, this indicates that previous studies may have overstated the role that context has in predicting reading times.

Read more
