LLMs are Not Just Next Token Predictors

Read original: arXiv:2408.04666 - Published 8/12/2024 by Stephen M. Downes, Patrick Forber, Alex Grzankowski

⚙️

Overview

Large language models (LLMs) are statistical models that learn to predict the next word in a sequence of text through stochastic gradient descent.
The common view is that LLMs are simply next token predictors, but this view is seen as selling LLMs short.
There are important explanations of LLM behavior and capabilities that are lost when they are reduced to just next token predictors.
The paper makes an analogy with a once prominent research program in biology explaining evolution and development from the gene's eye view.

Plain English Explanation

Large language models (LLMs) are AI systems that have been trained on massive amounts of text data to predict the next word in a sequence. The way they do this is through a mathematical technique called stochastic gradient descent, which allows the model to learn patterns in the data and make accurate predictions.

While it's true that LLMs are engineered to be good at predicting the next word, the authors of this paper argue that this doesn't fully capture the power and complexity of these models. LLMs as Function Approximators: Terminology, Taxonomy, Questions There are important insights about how LLMs work and what they are capable of that get lost when we view them as just next word predictors.

To illustrate this point, the authors make an analogy to a research program in biology that used to explain evolution and development from the "gene's eye view." LLMs as Universal Auto-Regressive Next-Token Predictors Just as that biological perspective didn't capture the full complexity of living systems, the authors argue that the "next word predictor" view of LLMs is an oversimplification.

Technical Explanation

The paper makes the case that while LLMs are engineered using next token prediction, and trained based on their success at this task, a reduction to just next token predictor sells LLMs short. The authors argue that there are important explanations of LLM behavior and capabilities that are lost when we engage in this kind of reduction.

To illustrate this point, the authors draw an analogy with a once prominent research program in biology that sought to explain evolution and development from the gene's eye view. Bayesian Statistical Modeling of Predictors from LLMs Just as that biological perspective did not capture the full complexity of living systems, the authors argue that the "next word predictor" view of LLMs is an oversimplification that fails to account for the rich dynamics and emergent behaviors of these powerful AI models.

Critical Analysis

The paper raises an important point about the limitations of viewing LLMs solely as next word predictors. Misinforming LLMs: Vulnerabilities, Challenges, Opportunities While this simplification may be useful in some contexts, it can overlook crucial aspects of how these models work and what they are capable of.

The authors' analogy to the "gene's eye view" in biology is thought-provoking, as it suggests that the reductionist approach to understanding LLMs may be missing the forest for the trees. LLMs: Understanding Natural Language Revealed However, the paper does not provide a clear alternative framework for understanding LLMs, leaving the reader to wonder what a more holistic perspective might entail.

Additionally, the paper does not address the practical implications of moving beyond the next word predictor view. It remains to be seen how this shift in perspective might impact the development, deployment, and evaluation of LLMs in real-world applications.

Conclusion

This paper argues that viewing LLMs solely as next word predictors is an oversimplification that fails to capture the rich complexity and emergent behaviors of these powerful AI models. The authors draw an analogy to a similar reductionist approach in biology, suggesting that a more holistic understanding of LLMs is necessary.

While the paper raises important points, it does not provide a clear alternative framework for understanding LLMs. Further research and discussion will be needed to explore how a more nuanced perspective on these models might shape their development and application in the years to come.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

⚙️

LLMs are Not Just Next Token Predictors

Stephen M. Downes, Patrick Forber, Alex Grzankowski

LLMs are statistical models of language learning through stochastic gradient descent with a next token prediction objective. Prompting a popular view among AI modelers: LLMs are just next token predictors. While LLMs are engineered using next token prediction, and trained based on their success at this task, our view is that a reduction to just next token predictor sells LLMs short. Moreover, there are important explanations of LLM behavior and capabilities that are lost when we engage in this kind of reduction. In order to draw this out, we will make an analogy with a once prominent research program in biology explaining evolution and development from the gene's eye view.

8/12/2024

A Law of Next-Token Prediction in Large Language Models

Hangfeng He, Weijie J. Su

Large language models (LLMs) have been widely employed across various application domains, yet their black-box nature poses significant challenges to understanding how these models process input data internally to make predictions. In this paper, we introduce a precise and quantitative law that governs the learning of contextualized token embeddings through intermediate layers in pre-trained LLMs for next-token prediction. Our findings reveal that each layer contributes equally to enhancing prediction accuracy, from the lowest to the highest layer -- a universal phenomenon observed across a diverse array of open-source LLMs, built on architectures such as Transformer, RWKV, and Mamba. We demonstrate that this law offers new perspectives and insights to inform and guide practices in LLM development and applications, including model scaling, pre-training tasks, and information flow. Overall, our law enables more fine-grained approaches to the design, training, and interpretation of LLMs through scrutinizing their internal data processing mechanisms.

8/27/2024

🤿

Bayesian Statistical Modeling with Predictors from LLMs

Michael Franke, Polina Tsvilodub, Fausto Carcassi

State of the art large language models (LLMs) have shown impressive performance on a variety of benchmark tasks and are increasingly used as components in larger applications, where LLM-based predictions serve as proxies for human judgements or decision. This raises questions about the human-likeness of LLM-derived information, alignment with human intuition, and whether LLMs could possibly be considered (parts of) explanatory models of (aspects of) human cognition or language use. To shed more light on these issues, we here investigate the human-likeness of LLMs' predictions for multiple-choice decision tasks from the perspective of Bayesian statistical modeling. Using human data from a forced-choice experiment on pragmatic language use, we find that LLMs do not capture the variance in the human data at the item-level. We suggest different ways of deriving full distributional predictions from LLMs for aggregate, condition-level data, and find that some, but not all ways of obtaining condition-level predictions yield adequate fits to human data. These results suggests that assessment of LLM performance depends strongly on seemingly subtle choices in methodology, and that LLMs are at best predictors of human behavior at the aggregate, condition-level, for which they are, however, not designed to, or usually used to, make predictions in the first place.

6/14/2024

🔎

Auto-Regressive Next-Token Predictors are Universal Learners

Eran Malach

Large language models display remarkable capabilities in logical and mathematical reasoning, allowing them to solve complex tasks. Interestingly, these abilities emerge in networks trained on the simple task of next-token prediction. In this work, we present a theoretical framework for studying auto-regressive next-token predictors. We demonstrate that even simple models such as linear next-token predictors, trained on Chain-of-Thought (CoT) data, can approximate any function efficiently computed by a Turing machine. We introduce a new complexity measure -- length complexity -- which measures the number of intermediate tokens in a CoT sequence required to approximate some target function, and analyze the interplay between length complexity and other notions of complexity. Finally, we show experimentally that simple next-token predictors, such as linear networks and shallow Multi-Layer Perceptrons (MLPs), display non-trivial performance on text generation and arithmetic tasks. Our results demonstrate that the power of today's LLMs can be attributed, to a great extent, to the auto-regressive next-token training scheme, and not necessarily to a particular choice of architecture.

7/31/2024