Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Read original: arXiv:2408.05147 - Published 8/20/2024 by Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, J'anos Kram'ar, Anca Dragan, Rohin Shah, Neel Nanda

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Overview

This paper introduces a new sparse autoencoder architecture called "Gemma Scope" that can be deployed on the Gemma 2 hardware platform.
Sparse autoencoders are a type of neural network that can learn efficient data representations by enforcing sparsity constraints.
The Gemma Scope architecture is designed to enable the simultaneous deployment of multiple sparse autoencoders across a wide range of applications.

Plain English Explanation

The paper presents a new type of neural network called a "sparse autoencoder" that can learn to efficiently represent and compress data. These sparse autoencoders are designed to be deployed widely across many applications using a specialized hardware platform called "Gemma 2".

The key idea behind sparse autoencoders is to force the network to only use a small number of its "neurons" or processing units to represent the input data. This sparsity constraint helps the network learn a more compact and efficient representation of the data.

The Gemma Scope architecture is designed to take advantage of this sparsity by allowing many sparse autoencoders to run simultaneously on the Gemma 2 hardware. This could enable a wide range of applications, from image reconstruction to disentangling dense embeddings, all running on the same specialized hardware.

Technical Explanation

The paper introduces a new sparse autoencoder architecture called "Gemma Scope" that is designed for deployment on the Gemma 2 hardware platform. Sparse autoencoders are a type of neural network that enforces sparsity constraints during training, allowing the model to learn compact and efficient representations of data.

The key innovation of the Gemma Scope architecture is its ability to support the simultaneous deployment of multiple sparse autoencoders on the Gemma 2 hardware. This is achieved through a modular design that allows individual sparse autoencoder instances to be "stacked" and run in parallel on the available hardware resources.

The authors demonstrate the effectiveness of the Gemma Scope approach through a series of experiments, showing that it can achieve state-of-the-art performance on a range of tasks, including image reconstruction, feature disentanglement, and curve detection. The parallel deployment of multiple sparse autoencoders is shown to provide significant performance and efficiency advantages compared to deploying a single, more complex model.

Critical Analysis

The Gemma Scope architecture and its deployment on the Gemma 2 hardware platform present an interesting and potentially impactful approach to leveraging sparse autoencoders for a wide range of applications. The ability to run multiple sparse autoencoders in parallel could unlock new possibilities for efficient and scalable neural network-based solutions.

However, the paper does not provide much detail on the Gemma 2 hardware itself, its capabilities, or the specific challenges involved in deploying the Gemma Scope architecture on this platform. Additionally, while the experiments demonstrate the effectiveness of the approach on a few selected tasks, more extensive evaluation across a broader range of applications would be valuable to fully assess the generalizability and versatility of the Gemma Scope system.

It would also be useful to see a more thorough discussion of the potential limitations and edge cases of the Gemma Scope architecture, as well as any challenges or trade-offs encountered during its development and deployment. A deeper exploration of these aspects could help identify areas for further research and refinement.

Conclusion

The Gemma Scope architecture and its deployment on the Gemma 2 hardware platform represent an innovative approach to leveraging sparse autoencoders for a wide range of applications. By enabling the parallel execution of multiple sparse autoencoder instances, the system aims to unlock new possibilities for efficient and scalable neural network-based solutions.

While the paper demonstrates the effectiveness of the Gemma Scope approach on a few selected tasks, further research and evaluation are needed to fully assess its generalizability and versatility across a broader range of applications. Addressing the potential limitations and challenges of the architecture could also help drive the field forward and unlock new avenues for exploration.

Overall, the Gemma Scope system presents an exciting step forward in the development of efficient and scalable neural network architectures, with the potential to have a significant impact on the field of machine learning and its real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, J'anos Kram'ar, Anca Dragan, Rohin Shah, Neel Nanda

Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse decomposition of a neural network's latent representations into seemingly interpretable features. Despite recent excitement about their potential, research applications outside of industry are limited by the high cost of training a comprehensive suite of SAEs. In this work, we introduce Gemma Scope, an open suite of JumpReLU SAEs trained on all layers and sub-layers of Gemma 2 2B and 9B and select layers of Gemma 2 27B base models. We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison. We evaluate the quality of each SAE on standard metrics and release these results. We hope that by releasing these SAE weights, we can help make more ambitious safety and interpretability research easier for the community. Weights and a tutorial can be found at https://huggingface.co/google/gemma-scope and an interactive demo can be found at https://www.neuronpedia.org/gemma-scope

8/20/2024

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, J'anos Kram'ar, Neel Nanda

Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse -- two objectives that are in tension. In this paper, we introduce JumpReLU SAEs, which achieve state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations, compared to other recent advances such as Gated and TopK SAEs. We also show that this improvement does not come at the cost of interpretability through manual and automated interpretability studies. JumpReLU SAEs are a simple modification of vanilla (ReLU) SAEs -- where we replace the ReLU with a discontinuous JumpReLU activation function -- and are similarly efficient to train and run. By utilising straight-through-estimators (STEs) in a principled manner, we show how it is possible to train JumpReLU SAEs effectively despite the discontinuous JumpReLU function introduced in the SAE's forward pass. Similarly, we use STEs to directly train L0 to be sparse, instead of training on proxies such as L1, avoiding problems like shrinkage.

8/2/2024

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, J'anos Kram'ar, Rohin Shah, Neel Nanda

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage -- systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects. Through training SAEs on LMs of up to 7B parameters we find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity.

5/1/2024

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Maheep Chaudhary, Atticus Geiger

A popular new method in mechanistic interpretability is to train high-dimensional sparse autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of analysis. However, the body of evidence on whether SAE feature spaces are useful for causal analysis is underdeveloped. In this work, we use the RAVEL benchmark to evaluate whether SAEs trained on hidden representations of GPT-2 small have sets of features that separately mediate knowledge of which country a city is in and which continent it is in. We evaluate four open-source SAEs for GPT-2 small against each other, with neurons serving as a baseline, and linear features learned via distributed alignment search (DAS) serving as a skyline. For each, we learn a binary mask to select features that will be patched to change the country of a city without changing the continent, or vice versa. Our results show that SAEs struggle to reach the neuron baseline, and none come close to the DAS skyline. We release code here: https://github.com/MaheepChaudhary/SAE-Ravel

9/10/2024