Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Read original: arXiv:2409.04478 - Published 9/10/2024 by Maheep Chaudhary, Atticus Geiger

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Overview

This paper evaluates the use of open-source sparse autoencoders to disentangle factual knowledge in GPT-2 Small, a smaller version of the popular language model GPT-2.
Sparse autoencoders are a type of neural network that can learn compact representations of data while preserving key information.
The researchers assess how well these sparse autoencoders can extract and separate factual knowledge from GPT-2 Small.

Plain English Explanation

The paper looks at using a specific type of artificial neural network called a sparse autoencoder to understand the inner workings of a language model called GPT-2 Small.

GPT-2 is a powerful AI system that can generate human-like text, but it's not always clear what kind of information and knowledge it has learned. The researchers wanted to see if they could use sparse autoencoders to untangle or disentangle the different types of knowledge GPT-2 Small has, especially the factual information it has picked up.

Sparse autoencoders are designed to find compact, efficient ways to represent data while preserving the most important parts. The researchers hoped that by applying these sparse autoencoders to GPT-2 Small, they could extract and separate out the model's factual knowledge from other aspects of its language understanding.

Technical Explanation

The paper describes experiments where the researchers used several open-source sparse autoencoder models to analyze the internal representations of the GPT-2 Small language model.

They trained the sparse autoencoders on the activations of the different layers within GPT-2 Small, with the goal of getting the autoencoders to disentangle the factual knowledge encoded in the model from other types of information. The researchers then evaluated how well the sparse autoencoders were able to extract and isolate the factual knowledge.

The technical approach involved using a variety of evaluation metrics to assess the disentanglement, including probing tasks, linear classification, and visualization techniques. The results provided insights into which sparse autoencoder architectures and training setups were most effective at separating out the factual knowledge in GPT-2 Small.

Critical Analysis

The paper provides a thorough and well-designed study on using sparse autoencoders to analyze the inner workings of large language models like GPT-2. The researchers acknowledge several limitations and caveats, such as the fact that their evaluation is limited to just the smaller GPT-2 Small model, and that the factual knowledge extraction is not perfect.

One potential issue is that the definition and identification of "factual knowledge" is inherently challenging and subjective. The paper does not go into depth on how the researchers determined what constitutes factual knowledge versus other types of information. This could be an area for further research and refinement.

Additionally, the paper does not explore how the extracted factual knowledge could be leveraged or used in practice. More work may be needed to understand the practical applications and implications of this type of model analysis.

Overall, the paper makes a valuable contribution to the growing body of research on interpreting and understanding the internal representations of large language models. The insights gained could inform future work on improving model transparency and controllability.

Conclusion

This paper presents an interesting approach to disentangling the factual knowledge encoded within the GPT-2 Small language model using open-source sparse autoencoder models. The results provide useful insights into the capabilities and limitations of this technique, which could have broader implications for improving the interpretability and controllability of large language models. While further research is needed, this work represents an important step forward in our understanding of how these powerful AI systems learn and represent information.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Maheep Chaudhary, Atticus Geiger

A popular new method in mechanistic interpretability is to train high-dimensional sparse autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of analysis. However, the body of evidence on whether SAE feature spaces are useful for causal analysis is underdeveloped. In this work, we use the RAVEL benchmark to evaluate whether SAEs trained on hidden representations of GPT-2 small have sets of features that separately mediate knowledge of which country a city is in and which continent it is in. We evaluate four open-source SAEs for GPT-2 small against each other, with neurons serving as a baseline, and linear features learned via distributed alignment search (DAS) serving as a skyline. For each, we learn a binary mask to select features that will be patched to change the country of a city without changing the continent, or vice versa. Our results show that SAEs struggle to reach the neuron baseline, and none come close to the DAS skyline. We release code here: https://github.com/MaheepChaudhary/SAE-Ravel

9/10/2024

Disentangling Dense Embeddings with Sparse Autoencoders

Charles O'Neill, Christine Ye, Kartheik Iyer, John F. Wu

Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks. We present one of the first applications of SAEs to dense text embeddings from large language models, demonstrating their effectiveness in disentangling semantic concepts. By training SAEs on embeddings of over 420,000 scientific paper abstracts from computer science and astronomy, we show that the resulting sparse representations maintain semantic fidelity while offering interpretability. We analyse these learned features, exploring their behaviour across different model capacities and introducing a novel method for identifying ``feature families'' that represent related concepts at varying levels of abstraction. To demonstrate the practical utility of our approach, we show how these interpretable features can be used to precisely steer semantic search, allowing for fine-grained control over query semantics. This work bridges the gap between the semantic richness of dense embeddings and the interpretability of sparse representations. We open source our embeddings, trained sparse autoencoders, and interpreted features, as well as a web app for exploring them.

8/2/2024

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, J'anos Kram'ar, Rohin Shah, Neel Nanda

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage -- systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects. Through training SAEs on LMs of up to 7B parameters we find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity.

5/1/2024

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Aleksandar Makelov, George Lange, Neel Nanda

Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in realistic scenarios makes validating recent approaches, such as sparse dictionary learning, elusive. To address this challenge, we propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them against emph{supervised} feature dictionaries. First, we demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task. Second, we use the supervised dictionaries to develop and contextualize evaluations of unsupervised dictionaries along the same three axes. We apply this framework to the indirect object identification (IOI) task using GPT-2 Small, with sparse autoencoders (SAEs) trained on either the IOI or OpenWebText datasets. We find that these SAEs capture interpretable features for the IOI task, but they are less successful than supervised features in controlling the model. Finally, we observe two qualitative phenomena in SAE training: feature occlusion (where a causally relevant concept is robustly overshadowed by even slightly higher-magnitude ones in the learned features), and feature over-splitting (where binary features split into many smaller, less interpretable features). We hope that our framework will provide a useful step towards more objective and grounded evaluations of sparse dictionary learning methods.

5/21/2024