The Linear Representation Hypothesis and the Geometry of Large Language Models

Read original: arXiv:2311.03658 - Published 7/19/2024 by Kiho Park, Yo Joong Choe, Victor Veitch

💬

Overview

The paper explores the "linear representation hypothesis" - the idea that high-level concepts are represented as linear directions in a representation space.
It addresses two key questions: what does linear representation actually mean, and how do we make sense of geometric notions (like cosine similarity) in the representation space?
The paper provides formal definitions of linear representation in the output (word) space and the input (sentence) space, and shows how these connect to linear probing and model steering, respectively.
It introduces a particular (non-Euclidean) inner product that respects language structure, and uses this to unify different notions of linear representation.
Experiments with the LLaMA-2 language model demonstrate the existence of linear representations of concepts, and the connections to interpretation and control.

Plain English Explanation

The paper explores the idea that high-level concepts in language models are represented as linear directions in a multi-dimensional space. This is known as the "linear representation hypothesis." The researchers wanted to better understand what this actually means and how it relates to the geometric properties of the representation space.

To do this, they provided formal definitions of linear representation in both the output (word) space and the input (sentence) space. These definitions show how linear representation connects to two important techniques: linear probing and model steering. Linear probing is a way to interpret what a model has learned, while model steering is a way to control a model's behavior.

The researchers also introduced a special type of inner product (a way of measuring similarity) that better captures the structure of language. Using this, they were able to unify different notions of linear representation and show how to construct useful probes and steering vectors.

Experiments on the LLaMA-2 language model provided evidence for the existence of linear representations of concepts, and demonstrated the connections to interpretation and control of the model's behavior.

Technical Explanation

The paper formalizes the "linear representation hypothesis" using the language of counterfactuals. It provides two definitions - one for the output (word) representation space, and one for the input (sentence) space.

The output space definition states that a concept is linearly represented if there exists a vector v such that the cosine similarity between v and the representation of any word w is monotonically related to the probability of w given the concept. This connects to linear probing, a technique for interpreting what a model has learned.

The input space definition states that a concept is linearly represented if there exists a vector v such that the projection of the sentence representation onto v is monotonically related to the probability of the sentence given the concept. This connects to model steering, a technique for controlling a model's behavior.

To make sense of geometric notions like cosine similarity in the representation space, the paper introduces a particular (non-Euclidean) inner product that respects language structure. This "causal inner product" is defined using counterfactual pairs, and allows the unification of different notions of linear representation.

Experiments on the LLaMA-2 language model demonstrate the existence of linear representations of concepts, and show how the formalization connects to interpretation and control techniques like those described in Representations as Language and Vectoring Languages.

Critical Analysis

The paper provides a rigorous formal framework for understanding linear representation in language models, and demonstrates its connections to important techniques like linear probing and model steering. However, the formalism relies on counterfactual pairs, which can be challenging to obtain in practice.

Additionally, the paper focuses on linear representations, but language models may also exhibit more complex, non-linear representations of concepts. Further research is needed to understand the full range of representational strategies used by large language models.

The experiments on LLaMA-2 provide evidence for the existence of linear representations, but it would be valuable to see this validated on a broader range of language models and tasks. The generalization of these findings to more diverse domains remains an open question.

Overall, the paper makes an important contribution to our understanding of language model representations, but there are still many open questions and avenues for further research in this area.

Conclusion

This paper introduces a formal framework for understanding the "linear representation hypothesis" in language models, which posits that high-level concepts are represented as linear directions in a multi-dimensional space. By defining linear representation in both the output (word) and input (sentence) spaces, the researchers were able to connect this idea to powerful techniques like linear probing and model steering.

The use of a novel "causal inner product" allowed the researchers to unify different notions of linear representation, and experiments on the LLaMA-2 model provided evidence for the existence of such linear representations. This work advances our fundamental understanding of how language models encode and utilize conceptual knowledge, with potential implications for model interpretation, control, and further advancements in the field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

The Linear Representation Hypothesis and the Geometry of Large Language Models

Kiho Park, Yo Joong Choe, Victor Veitch

Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely related questions: What does linear representation actually mean? And, how do we make sense of geometric notions (e.g., cosine similarity or projection) in the representation space? To answer these, we use the language of counterfactuals to give two formalizations of linear representation, one in the output (word) representation space, and one in the input (sentence) space. We then prove these connect to linear probing and model steering, respectively. To make sense of geometric notions, we use the formalization to identify a particular (non-Euclidean) inner product that respects language structure in a sense we make precise. Using this causal inner product, we show how to unify all notions of linear representation. In particular, this allows the construction of probes and steering vectors using counterfactual pairs. Experiments with LLaMA-2 demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product.

7/19/2024

The Geometry of Categorical and Hierarchical Concepts in Large Language Models

Kiho Park, Yo Joong Choe, Yibo Jiang, Victor Veitch

Understanding how semantic meaning is encoded in the representation spaces of large language models is a fundamental problem in interpretability. In this paper, we study the two foundational questions in this area. First, how are categorical concepts, such as {'mammal', 'bird', 'reptile', 'fish'}, represented? Second, how are hierarchical relations between concepts encoded? For example, how is the fact that 'dog' is a kind of 'mammal' encoded? We show how to extend the linear representation hypothesis to answer these questions. We find a remarkably simple structure: simple categorical concepts are represented as simplices, hierarchically related concepts are orthogonal in a sense we make precise, and (in consequence) complex concepts are represented as polytopes constructed from direct sums of simplices, reflecting the hierarchical structure. We validate these theoretical results on the Gemma large language model, estimating representations for 957 hierarchically related concepts using data from WordNet.

6/4/2024

💬

Not All Language Model Features Are Linear

Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, Max Tegmark

Recent work has proposed the linear representation hypothesis: that language models perform computation by manipulating one-dimensional representations of concepts (features) in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in days of the week and months of the year. Finally, we provide evidence that these circular features are indeed the fundamental unit of computation in these tasks with intervention experiments on Mistral 7B and Llama 3 8B, and we find further circular representations by breaking down the hidden states for these tasks into interpretable components.

5/24/2024

💬

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks, Max Tegmark

Large Language Models (LLMs) have impressive capabilities, but are prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM's internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we use high-quality datasets of simple true/false statements to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM's forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that at sufficient scale, LLMs linearly represent the truth or falsehood of factual statements. We also show that simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputs.

8/20/2024