The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Read original: arXiv:2405.10928 - Published 5/21/2024 by Lucius Bushnaq, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun, Jake Mendel, Kaarel Hanni, Avery Griffin, Jorn Stohler, Magdalena Wache, Marius Hobbhahn

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Overview

Introduces a method called the "Local Interaction Basis" to identify computationally-relevant and sparsely interacting features in neural networks
Aims to provide mechanistic interpretability by revealing the underlying structure of neural network computations
Leverages the concept of
degeneracy
to uncover simplified feature representations that are both informative and robust

Plain English Explanation

The paper presents a new technique called the "Local Interaction Basis" that can help us better understand how neural networks work under the hood. The key idea is to identify the most important and

sparsely interacting

features that a neural network is using to make its predictions. <a href="https://aimodels.fyi/papers/arxiv/using-degeneracy-loss-landscape-mechanistic-interpretability">By exploiting the concept of "degeneracy"</a>, the method can uncover simplified feature representations that capture the essential computations happening in the network, rather than getting bogged down in complex interactions between many different features.

This is important because it can provide

mechanistic interpretability

- it allows us to see the underlying structure and logic of how the neural network is solving the problem, rather than just treating it like a black box. Understanding these simplified computational mechanisms could lead to more robust and efficient neural network architectures in the future.

The approach aims to find a

local interaction basis

- a small set of features that can largely explain the network's behavior in a particular region of the input space. This differs from global explanations that try to capture the entire network's logic with a single set of features, which can miss important nuances.

Technical Explanation

The key steps of the "Local Interaction Basis" method are:

Compute Degeneracy: The authors leverage the concept of
degeneracy
- the ability of different feature combinations to produce similar outputs. <a href="https://aimodels.fyi/papers/arxiv/using-degeneracy-loss-landscape-mechanistic-interpretability">By analyzing the network's loss landscape</a>, they can identify highly degenerate regions where multiple different feature combinations produce similar predictions.
Identify Relevant Features: Within these degenerate regions, the method identifies the
smallest
set of features that can still largely account for the network's behavior. This set of relevant features forms the "Local Interaction Basis".
Quantify Interactions: The authors then analyze the
strength
of the interactions between the relevant features in the basis. This allows them to identify which features are most important and which ones have the sparsest interactions.

Through experiments on various neural network architectures and datasets, the authors demonstrate that the Local Interaction Basis can provide meaningful mechanistic insights, revealing simplified computational mechanisms that are both informative and robust.

Critical Analysis

The paper makes a compelling case for the value of the Local Interaction Basis in providing mechanistic interpretability for neural networks. The focus on

sparse

and

local

feature interactions is particularly interesting, as it aligns with the intuition that biological neural systems rely on a relatively small number of salient features to make decisions.

However, the authors acknowledge that the method is still limited in its ability to fully capture the complexity of neural network computations. The Local Interaction Basis may miss important global or higher-order interactions that are not easily captured by the local analysis. <a href="https://aimodels.fyi/papers/arxiv/learned-feature-representations-are-biased-by-complexity">There are also open questions about how the complexity of the feature representations themselves may impact the interpretability of the results.</a>

Additionally, the authors note that the method currently requires access to the network's internals, which may limit its practical applicability in real-world scenarios where the model is treated as a black box. <a href="https://aimodels.fyi/papers/arxiv/pure-turning-polysemantic-neurons-into-pure-features">Further research is needed to explore how these insights could be extracted without full access to the network's architecture and parameters.</a>

Conclusion

The "Local Interaction Basis" proposed in this paper represents an important step towards providing mechanistic interpretability for neural networks. By identifying the most relevant and sparsely interacting features that drive a network's behavior, the method can shed light on the underlying computational logic, potentially leading to more robust and efficient neural network architectures in the future.

While the approach has limitations, it demonstrates the value of

exploiting degeneracy

to uncover simplified feature representations that capture the essential computations happening in a neural network. <a href="https://aimodels.fyi/papers/arxiv/explaining-explainability-understanding-concept-activation-vectors">Further advancements in this direction could significantly advance our understanding of how neural networks make decisions and lead to more

explainable

AI systems.</a>

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun, Jake Mendel, Kaarel Hanni, Avery Griffin, Jorn Stohler, Magdalena Wache, Marius Hobbhahn

Mechanistic interpretability aims to understand the behavior of neural networks by reverse-engineering their internal computations. However, current methods struggle to find clear interpretations of neural network activations because a decomposition of activations into computational features is missing. Individual neurons or model components do not cleanly correspond to distinct features or functions. We present a novel interpretability method that aims to overcome this limitation by transforming the activations of the network into a new basis - the Local Interaction Basis (LIB). LIB aims to identify computational features by removing irrelevant activations and interactions. Our method drops irrelevant activation directions and aligns the basis with the singular vectors of the Jacobian matrix between adjacent layers. It also scales features based on their importance for downstream computation, producing an interaction graph that shows all computationally-relevant features and interactions in a model. We evaluate the effectiveness of LIB on modular addition and CIFAR-10 models, finding that it identifies more computationally-relevant features that interact more sparsely, compared to principal component analysis. However, LIB does not yield substantial improvements in interpretability or interaction sparsity when applied to language models. We conclude that LIB is a promising theory-driven approach for analyzing neural networks, but in its current form is not applicable to large language models.

5/21/2024

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hanni, Cindy Wu, Marius Hobbhahn

Mechanistic Interpretability aims to reverse engineer the algorithms implemented by neural networks by studying their weights and activations. An obstacle to reverse engineering neural networks is that many of the parameters inside a network are not involved in the computation being implemented by the network. These degenerate parameters may obfuscate internal structure. Singular learning theory teaches us that neural network parameterizations are biased towards being more degenerate, and parameterizations with more degeneracy are likely to generalize further. We identify 3 ways that network parameters can be degenerate: linear dependence between activations in a layer; linear dependence between gradients passed back to a layer; ReLUs which fire on the same subset of datapoints. We also present a heuristic argument that modular networks are likely to be more degenerate, and we develop a metric for identifying modules in a network that is based on this argument. We propose that if we can represent a neural network in a way that is invariant to reparameterizations that exploit the degeneracies, then this representation is likely to be more interpretable, and we provide some evidence that such a representation is likely to have sparser interactions. We introduce the Interaction Basis, a tractable technique to obtain a representation that is invariant to degeneracies from linear dependence of activations or Jacobians.

5/21/2024

✨

Learned feature representations are biased by complexity, learning order, position, and more

Andrew Kyle Lampinen, Stephanie C. Y. Chan, Katherine Hermann

Representation learning, and interpreting learned representations, are key areas of focus in machine learning and neuroscience. Both fields generally use representations as a means to understand or improve a system's computations. In this work, however, we explore surprising dissociations between representation and computation that may pose challenges for such efforts. We create datasets in which we attempt to match the computational role that different features play, while manipulating other properties of the features or the data. We train various deep learning architectures to compute these multiple abstract features about their inputs. We find that their learned feature representations are systematically biased towards representing some features more strongly than others, depending upon extraneous properties such as feature complexity, the order in which features are learned, and the distribution of features over the inputs. For example, features that are simpler to compute or learned first tend to be represented more strongly and densely than features that are more complex or learned later, even if all features are learned equally well. We also explore how these biases are affected by architectures, optimizers, and training regimes (e.g., in transformers, features decoded earlier in the output sequence also tend to be represented more strongly). Our results help to characterize the inductive biases of gradient-based representation learning. We then illustrate the downstream effects of these biases on various commonly-used methods for analyzing or intervening on representations. These results highlight a key challenge for interpretability $-$ or for comparing the representations of models and brains $-$ disentangling extraneous biases from the computationally important aspects of a system's internal representations.

9/24/2024

PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits

Maximilian Dreyer, Erblina Purelku, Johanna Vielhaben, Wojciech Samek, Sebastian Lapuschkin

The field of mechanistic interpretability aims to study the role of individual neurons in Deep Neural Networks. Single neurons, however, have the capability to act polysemantically and encode for multiple (unrelated) features, which renders their interpretation difficult. We present a method for disentangling polysemanticity of any Deep Neural Network by decomposing a polysemantic neuron into multiple monosemantic virtual neurons. This is achieved by identifying the relevant sub-graph (circuit) for each pure feature. We demonstrate how our approach allows us to find and disentangle various polysemantic units of ResNet models trained on ImageNet. While evaluating feature visualizations using CLIP, our method effectively disentangles representations, improving upon methods based on neuron activations. Our code is available at https://github.com/maxdreyer/PURE.

4/10/2024