Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

Read original: arXiv:2405.12522 - Published 5/22/2024 by Charles O'Neill, Thang Bui

💬

Overview

This paper introduces a method for efficiently and robustly discovering interpretable circuits in large language models using discrete sparse autoencoders.
The proposed approach addresses key limitations of existing techniques, such as computational complexity and sensitivity to hyperparameters.
The method involves training sparse autoencoders on carefully designed positive and negative examples, where the model can only correctly predict the next token for the positive examples.
The learned representations of attention head outputs are then used to identify the attention heads involved in specific computations, without the need for expensive ablations or architectural modifications.

Plain English Explanation

The paper presents a new way to understand how large language models, like those used for tasks like translation or chatbots, work under the hood. Large language models are complex, and it can be challenging to figure out exactly how they make decisions.

The researchers developed a technique that uses sparse autoencoders - a type of neural network that can learn efficient representations of data. They train these autoencoders on carefully selected examples, where the model has to correctly predict the next word in a sequence.

The key insight is that the patterns the autoencoder learns in these examples can reveal which parts of the language model (called "attention heads") are responsible for specific computations. By looking at how the attention heads' outputs overlap for different examples, the researchers can identify the attention heads involved in tasks like understanding indirect objects, comparing quantities, or completing documentation.

Compared to previous methods, this approach is much faster (taking seconds instead of hours) and more reliable, while requiring only a small number of example texts. The researchers believe this technique opens up new ways to understand how large language models work under the hood and could lead to more interpretable and controllable language models in the future.

Technical Explanation

The paper proposes a method for efficiently and robustly discovering interpretable circuits in large language models using discrete sparse autoencoders. The key elements of the approach are:

Sparse Autoencoder Training: The researchers train sparse autoencoders on carefully designed positive and negative examples, where the model can only correctly predict the next token for the positive examples. This forces the autoencoder to learn representations that capture the relevant computations for the positive examples.
Attention Head Representation: The learned representations of the attention head outputs are hypothesized to signal when a head is engaged in specific computations. By discretizing these representations into integer codes, the researchers can measure the overlap between codes unique to positive examples for each head.
Circuit Identification: The overlap between unique codes for each attention head is used to directly identify the attention heads involved in the circuits responsible for the target computations, without the need for expensive ablations or architectural modifications.

The researchers evaluate their method on three well-studied tasks: indirect object identification, greater-than comparisons, and docstring completion. They show that their approach achieves higher precision and recall in recovering ground-truth circuits compared to state-of-the-art baselines, while reducing runtime from hours to seconds. Notably, the method requires only 5-10 text examples for each task to learn robust representations.

Critical Analysis

The paper presents a promising approach for efficient and scalable mechanistic interpretability of large language models. The use of discrete sparse autoencoders to identify the attention heads involved in specific computations is a novel and compelling idea. The authors' claim that their method can achieve high performance with only a small number of examples is particularly interesting, as it suggests the potential for practical application in real-world scenarios.

However, the paper does not address some potential limitations of the approach. For example, it's unclear how well the method would generalize to more complex or nuanced language tasks, or how sensitive the results might be to the choice of positive and negative examples. Additionally, the paper does not discuss the potential for the learned attention head representations to be influenced by factors other than the target computations, which could complicate the interpretation of the results.

Furthermore, while the paper demonstrates the effectiveness of the proposed method on three specific tasks, it would be helpful to see the approach applied to a wider range of language tasks to better understand its broader applicability and limitations.

Overall, the paper presents an interesting and promising direction for improving the interpretability of large language models. However, further research and validation would be needed to fully assess the strengths and weaknesses of the approach and its potential real-world impact.

Conclusion

This paper introduces an efficient and robust method for discovering interpretable circuits in large language models using discrete sparse autoencoders. The proposed approach addresses key limitations of existing techniques, such as computational complexity and sensitivity to hyperparameters, by training sparse autoencoders on carefully designed positive and negative examples and using the learned representations of attention head outputs to identify the attention heads involved in specific computations.

The researchers demonstrate the effectiveness of their method on three well-studied tasks, achieving higher precision and recall in recovering ground-truth circuits compared to state-of-the-art baselines, while significantly reducing runtime. The findings highlight the promise of discrete sparse autoencoders for scalable and efficient mechanistic interpretability, offering a new direction for analyzing the inner workings of large language models.

As language models continue to grow in size and complexity, the ability to understand and interpret their decision-making processes becomes increasingly important. This work represents a valuable contribution to the field of interpretable AI, with the potential to lead to more transparent and controllable language models in the future.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

Charles O'Neill, Thang Bui

This paper introduces an efficient and robust method for discovering interpretable circuits in large language models using discrete sparse autoencoders. Our approach addresses key limitations of existing techniques, namely computational complexity and sensitivity to hyperparameters. We propose training sparse autoencoders on carefully designed positive and negative examples, where the model can only correctly predict the next token for the positive examples. We hypothesise that learned representations of attention head outputs will signal when a head is engaged in specific computations. By discretising the learned representations into integer codes and measuring the overlap between codes unique to positive examples for each head, we enable direct identification of attention heads involved in circuits without the need for expensive ablations or architectural modifications. On three well-studied tasks - indirect object identification, greater-than comparisons, and docstring completion - the proposed method achieves higher precision and recall in recovering ground-truth circuits compared to state-of-the-art baselines, while reducing runtime from hours to seconds. Notably, we require only 5-10 text examples for each task to learn robust representations. Our findings highlight the promise of discrete sparse autoencoders for scalable and efficient mechanistic interpretability, offering a new direction for analysing the inner workings of large language models.

5/22/2024

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupr'e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu

Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and sparsity objectives and the presence of dead latents. We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. Additionally, we find modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release training code and autoencoders for open-source models, as well as a visualizer.

6/7/2024

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, J'anos Kram'ar, Rohin Shah, Neel Nanda

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage -- systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects. Through training SAEs on LMs of up to 7B parameters we find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity.

5/1/2024

Transcoders Find Interpretable LLM Feature Circuits

Jacob Dunefsky, Philippe Chlenski, Neel Nanda

A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language models difficult. In particular, interpretable features -- such as those found by sparse autoencoders (SAEs) -- are typically linear combinations of extremely many neurons, each with its own nonlinearity to account for. Circuit analysis in this setting thus either yields intractably large circuits or fails to disentangle local and global behavior. To address this we explore transcoders, which seek to faithfully approximate a densely activating MLP layer with a wider, sparsely-activating MLP layer. We successfully train transcoders on language models with 120M, 410M, and 1.4B parameters, and find them to perform at least on par with SAEs in terms of sparsity, faithfulness, and human-interpretability. We then introduce a novel method for using transcoders to perform weights-based circuit analysis through MLP sublayers. The resulting circuits neatly factorize into input-dependent and input-invariant terms. Finally, we apply transcoders to reverse-engineer unknown circuits in the model, and we obtain novel insights regarding the greater-than circuit in GPT2-small. Our results suggest that transcoders can prove effective in decomposing model computations involving MLPs into interpretable circuits. Code is available at https://github.com/jacobdunefsky/transcoder_circuits.

6/19/2024