Language models are better than humans at next-token prediction

Read original: arXiv:2212.11281 - Published 7/16/2024 by Buck Shlegeris, Fabien Roger, Lawrence Chan, Euan McLean

💬

Overview

Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code
However, language models are not trained to perform well at these tasks, they are trained to accurately predict the next token given previous tokens in tokenized text
It is unclear whether language models are better or worse than humans at next token prediction
Experiments were performed to directly compare humans and language models on next-token prediction

Plain English Explanation

Language models like GPT-3 are trained to predict the next word or "token" in a sequence of text. This is different from having them perform tasks like answering questions or writing code, which they are not directly trained for. It's unclear whether these language models are actually better or worse than humans at just predicting the next token.

To test this, researchers ran two experiments to directly compare human and language model performance on next-token prediction. One experiment looked at top-1 accuracy, meaning how often the model or human correctly predicted the very next token. The other experiment measured "perplexity", which is a way to quantify how surprised or uncertain the model or human is about the next token.

Surprisingly, the researchers found that even relatively small language models like the GPT-3 Ada model consistently outperformed humans on both of these next-token prediction metrics. Humans were worse than the language models at this specific task, despite language models not being directly trained for it.

Technical Explanation

The researchers conducted two experiments to directly compare humans and language models on next-token prediction:

Top-1 Accuracy: They measured how often the model or human correctly predicted the very next token in a sequence.
Perplexity: They looked at a metric called perplexity, which quantifies how uncertain or "surprised" the model or human is about the next token. Lower perplexity indicates better performance.

In both experiments, they found that even relatively small language models like GPT-3 Ada outperformed humans consistently. Humans were worse than the language models at this specific task of predicting the next token, even though the language models were not directly trained for it.

Critical Analysis

The paper acknowledges that language models are not optimized for tasks like question-answering or code generation, but rather for the more narrow objective of next-token prediction. This raises the question of whether these results truly reflect a fundamental capability gap between humans and language models, or if it's an artifact of the specific task being measured.

Additionally, the paper does not explore why humans performed worse than language models on next-token prediction. It's possible that factors like attention, memory, or rapid processing speed give language models an advantage in this particular task. Further research would be needed to better understand the underlying reasons.

It's also worth considering whether these findings would hold true for more complex, context-dependent language tasks that require deeper understanding, reasoning, and generation. The superiority of language models may be limited to narrow, token-level prediction.

Conclusion

This research suggests that current language models, even relatively small ones, have surpassed human capabilities when it comes to the specific task of next-token prediction. While language models are not optimized for more complex natural language tasks, this study highlights the remarkable token-level performance of these models compared to humans.

These findings could have implications for the development of more powerful language AI systems and our understanding of the strengths and limitations of both human and machine language processing. However, further research is needed to explore the broader implications and understand the underlying reasons for the performance gap observed in this study.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Language models are better than humans at next-token prediction

Buck Shlegeris, Fabien Roger, Lawrence Chan, Euan McLean

Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code. However, language models are not trained to perform well at these tasks, they are trained to accurately predict the next token given previous tokes in tokenized text. It is not clear whether language models are better or worse than humans at next token prediction. To try to answer this question, we performed two distinct experiments to directly compare humans and language models on this front: one measuring top-1 accuracy and the other measuring perplexity. In both experiments, we find humans to be consistently emph{worse} than even relatively small language models like GPT3-Ada at next-token prediction.

7/16/2024

🧪

Testing AI on language comprehension tasks reveals insensitivity to underlying meaning

Vittoria Dentella, Fritz Guenther, Elliot Murphy, Gary Marcus, Evelina Leivada

Large Language Models (LLMs) are recruited in applications that span from clinical assistance and legal support to question answering and education. Their success in specialized tasks has led to the claim that they possess human-like linguistic capabilities related to compositional understanding and reasoning. Yet, reverse-engineering is bound by Moravec's Paradox, according to which easy skills are hard. We systematically assess 7 state-of-the-art models on a novel benchmark. Models answered a series of comprehension questions, each prompted multiple times in two settings, permitting one-word or open-length replies. Each question targets a short text featuring high-frequency linguistic constructions. To establish a baseline for achieving human-like performance, we tested 400 humans on the same prompts. Based on a dataset of n=26,680 datapoints, we discovered that LLMs perform at chance accuracy and waver considerably in their answers. Quantitatively, the tested models are outperformed by humans, and qualitatively their answers showcase distinctly non-human errors in language understanding. We interpret this evidence as suggesting that, despite their usefulness in various tasks, current AI models fall short of understanding language in a way that matches humans, and we argue that this may be due to their lack of a compositional operator for regulating grammatical and semantic information.

7/10/2024

💬

209

Better & Faster Large Language Models via Multi-token Prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozi`ere, David Lopez-Paz, Gabriel Synnaeve

Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3 times faster at inference, even with large batch sizes.

5/1/2024

💬

A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds?

Evelina Leivada, Gary Marcus, Fritz Gunther, Elliot Murphy

Modern Artificial Intelligence applications show great potential for language-related tasks that rely on next-word prediction. The current generation of Large Language Models (LLMs) have been linked to claims about human-like linguistic performance and their applications are hailed both as a step towards artificial general intelligence and as a major advance in understanding the cognitive, and even neural basis of human language. To assess these claims, first we analyze the contribution of LLMs as theoretically informative representations of a target cognitive system vs. atheoretical mechanistic tools. Second, we evaluate the models' ability to see the bigger picture, through top-down feedback from higher levels of processing, which requires grounding in previous expectations and past world experience. We hypothesize that since models lack grounded cognition, they cannot take advantage of these features and instead solely rely on fixed associations between represented words and word vectors. To assess this, we designed and ran a novel 'leet task' (l33t t4sk), which requires decoding sentences in which letters are systematically replaced by numbers. The results suggest that humans excel in this task whereas models struggle, confirming our hypothesis. We interpret the results by identifying the key abilities that are still missing from the current state of development of these models, which require solutions that go beyond increased system scaling.

9/5/2024