Not All Language Model Features Are Linear

Read original: arXiv:2405.14860 - Published 5/24/2024 by Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, Max Tegmark

💬

Overview

This paper proposes that language models may use multi-dimensional representations, rather than just one-dimensional "features," to perform computations.
The researchers develop a method to automatically find and analyze these multi-dimensional representations in large language models like GPT-2 and Mistral 7B.
They identify specific examples of these multi-dimensional features, like circular representations of days of the week and months of the year, and show how the models use them to solve tasks involving modular arithmetic.
The paper provides evidence that these circular features are fundamental to the models' computations on these tasks.

Plain English Explanation

The researchers behind this paper explored whether language models might use more complex, multi-dimensional representations of concepts, rather than just simple one-dimensional "features." They developed a way to automatically identify these multi-dimensional representations in large language models like GPT-2 and Mistral 7B.

One of the key findings was the discovery of circular representations for things like days of the week and months of the year. These circular features allowed the models to efficiently perform computations involving modular arithmetic, like figuring out what day of the week a date falls on. The researchers showed that these circular features were fundamental to the models' ability to solve these types of tasks, rather than just being a byproduct.

This suggests that language models may implement more sophisticated "cognitive-like" representations and computations, rather than just simple one-dimensional feature manipulation as proposed by the linear representation hypothesis. It also raises interesting questions about the inherent biases and limitations of how language models represent and reason about the world.

Technical Explanation

The core idea of this paper is to challenge the linear representation hypothesis, which proposes that language models perform computations by manipulating one-dimensional representations of concepts (called "features"). Instead, the researchers explore whether some language model representations may be inherently multi-dimensional.

To do this, they first develop a rigorous definition of "irreducible" multi-dimensional features - ones that cannot be decomposed into either independent or non-co-occurring lower-dimensional features. Armed with this definition, they design a scalable method using sparse autoencoders to automatically identify multi-dimensional features in large language models like GPT-2 and Mistral 7B.

Using this approach, the researchers identify some striking examples of interpretable multi-dimensional features, such as circular representations of days of the week and months of the year. They then show how these exact circular features are used by the models to solve computational problems involving modular arithmetic related to days and months.

Finally, the paper provides evidence that these circular features are indeed the fundamental unit of computation for these tasks. They conduct intervention experiments on Mistral 7B and Llama 3 8B that demonstrate the importance of these circular representations. Additionally, they are able to further decompose the hidden states for these tasks into interpretable components that reveal more instances of these circular features.

Critical Analysis

The paper makes a compelling case that at least some language models employ multi-dimensional representations that go beyond the simple one-dimensional "features" proposed by the linear representation hypothesis. The discovery of the interpretable circular features for days and months, and the evidence that these are central to the models' computations, is a significant finding.

However, the paper does not address the broader limitations and biases inherent in how language models represent and reason about the world. While the multi-dimensional features may be more sophisticated, they may still suffer from systematic biases and blind spots in their understanding.

Additionally, the paper focuses on a relatively narrow set of tasks and model architectures. It remains to be seen whether these findings generalize to a wider range of language models and applications. Further research is needed to fully understand the symbolic and reasoning capabilities of these multi-dimensional representations.

Conclusion

This paper challenges the prevailing view that language models rely solely on one-dimensional feature representations. Instead, it provides compelling evidence that at least some models employ more sophisticated, multi-dimensional representations to perform computations. The discovery of interpretable circular features for concepts like days and months, and their central role in solving relevant tasks, is a significant advancement in our understanding of language model representations and capabilities.

While this research raises interesting questions about the cognitive-like nature of language model representations, it also highlights the need for continued critical analysis and exploration of their limitations and biases. Ultimately, this work contributes to our evolving understanding of how large language models work and their potential implications for artificial intelligence.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Not All Language Model Features Are Linear

Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, Max Tegmark

Recent work has proposed the linear representation hypothesis: that language models perform computation by manipulating one-dimensional representations of concepts (features) in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in days of the week and months of the year. Finally, we provide evidence that these circular features are indeed the fundamental unit of computation in these tasks with intervention experiments on Mistral 7B and Llama 3 8B, and we find further circular representations by breaking down the hidden states for these tasks into interpretable components.

5/24/2024

💬

The Linear Representation Hypothesis and the Geometry of Large Language Models

Kiho Park, Yo Joong Choe, Victor Veitch

Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely related questions: What does linear representation actually mean? And, how do we make sense of geometric notions (e.g., cosine similarity or projection) in the representation space? To answer these, we use the language of counterfactuals to give two formalizations of linear representation, one in the output (word) representation space, and one in the input (sentence) space. We then prove these connect to linear probing and model steering, respectively. To make sense of geometric notions, we use the formalization to identify a particular (non-Euclidean) inner product that respects language structure in a sense we make precise. Using this causal inner product, we show how to unify all notions of linear representation. In particular, this allows the construction of probes and steering vectors using counterfactual pairs. Experiments with LLaMA-2 demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product.

7/19/2024

Representations as Language: An Information-Theoretic Framework for Interpretability

Henry Conklin, Kenny Smith

Large scale neural models show impressive performance across a wide array of linguistic tasks. Despite this they remain, largely, black-boxes - inducing vector-representations of their input that prove difficult to interpret. This limits our ability to understand what they learn, and when the learn it, or describe what kinds of representations generalise well out of distribution. To address this we introduce a novel approach to interpretability that looks at the mapping a model learns from sentences to representations as a kind of language in its own right. In doing so we introduce a set of information-theoretic measures that quantify how structured a model's representations are with respect to its input, and when during training that structure arises. Our measures are fast to compute, grounded in linguistic theory, and can predict which models will generalise best based on their representations. We use these measures to describe two distinct phases of training a transformer: an initial phase of in-distribution learning which reduces task loss, then a second stage where representations becoming robust to noise. Generalisation performance begins to increase during this second phase, drawing a link between generalisation and robustness to noise. Finally we look at how model size affects the structure of the representational space, showing that larger models ultimately compress their representations more than their smaller counterparts.

6/5/2024

The Geometry of Categorical and Hierarchical Concepts in Large Language Models

Kiho Park, Yo Joong Choe, Yibo Jiang, Victor Veitch

Understanding how semantic meaning is encoded in the representation spaces of large language models is a fundamental problem in interpretability. In this paper, we study the two foundational questions in this area. First, how are categorical concepts, such as {'mammal', 'bird', 'reptile', 'fish'}, represented? Second, how are hierarchical relations between concepts encoded? For example, how is the fact that 'dog' is a kind of 'mammal' encoded? We show how to extend the linear representation hypothesis to answer these questions. We find a remarkably simple structure: simple categorical concepts are represented as simplices, hierarchically related concepts are orthogonal in a sense we make precise, and (in consequence) complex concepts are represented as polytopes constructed from direct sums of simplices, reflecting the hierarchical structure. We validate these theoretical results on the Gemma large language model, estimating representations for 957 hierarchically related concepts using data from WordNet.

6/4/2024