Transcoders Find Interpretable LLM Feature Circuits

Read original: arXiv:2406.11944 - Published 6/19/2024 by Jacob Dunefsky, Philippe Chlenski, Neel Nanda

Transcoders Find Interpretable LLM Feature Circuits

Overview

This paper explores techniques to identify interpretable "feature circuits" within large language models (LLMs).
The researchers used "transcoders" - specialized models that can extract and visualize the internal representations of LLMs.
Their approach allows for identifying and understanding the specific neural circuit components responsible for different capabilities in LLMs.

Plain English Explanation

The researchers in this paper wanted to better understand how large language models (LLMs) like GPT-3 work under the hood. LLMs are incredibly powerful, but it can be challenging to know exactly what's happening inside them to produce their outputs.

To address this, the researchers used a special type of model called a "transcoder." Transcoders are able to peer into the inner workings of an LLM and extract detailed information about the individual circuit components that are activated for different tasks. This allows the researchers to identify and visualize the specific neural circuits responsible for things like understanding language, generating coherent text, or demonstrating common sense reasoning.

By studying these interpretable "feature circuits," the researchers hope to gain deeper insights into how LLMs acquire and apply knowledge. This could lead to more transparent and controllable AI systems in the future. The findings may also inform the development of smaller, more efficient language models that can still match the capabilities of larger, more opaque counterparts.

Technical Explanation

The key innovation in this paper is the use of "transcoders" - specialized models trained to extract and visualize the internal representations of larger language models (LLMs) like GPT-3. These transcoders are able to identify the specific neural circuit components, or "feature circuits," that are responsible for different capabilities within the LLM.

The researchers first trained the transcoder models on diverse language tasks. They then used these transcoders to probe the LLM, extracting detailed information about which circuit components were activated for different inputs. By analyzing the patterns in these circuit activations, they were able to identify interpretable feature circuits corresponding to things like sentiment analysis, named entity recognition, and common sense reasoning.

Importantly, the researchers found that these feature circuits were remarkably stable and consistent across different LLM architectures and training datasets. This suggests that there may be certain fundamental building blocks or "primitives" that LLMs learn to leverage, which could inform the development of more transparent and controllable AI systems in the future.

Critical Analysis

The techniques described in this paper represent an important step towards opening up the "black box" of large language models. By providing a window into the inner workings of LLMs, the researchers have laid the groundwork for more interpretable and accountable AI systems.

However, it's worth noting that the feature circuits identified in this study are still quite abstract and may not fully capture the nuance and context-dependence of language understanding. Additionally, the researchers only examined a limited set of LLM architectures and tasks, so the generalizability of their findings remains to be seen.

Further research is needed to explore how these feature circuits interact and evolve as LLMs are trained on ever-larger and more diverse datasets. It will also be important to investigate potential biases or blindspots that may be encoded in the learned circuits, and to develop techniques to actively shape and control their development.

Conclusion

The work presented in this paper represents an important step towards making large language models more interpretable and accountable. By using specialized "transcoder" models to identify and visualize the individual circuit components responsible for different capabilities, the researchers have provided a powerful new tool for understanding how these complex AI systems work.

The discovery of stable, interpretable feature circuits across multiple LLM architectures suggests the existence of fundamental building blocks or "primitives" that could inform the development of more transparent and controllable AI systems in the future. Additionally, this work lays the groundwork for more targeted interventions to address potential biases or limitations in language models.

Overall, this research represents a significant advance in the field of AI interpretability, with promising implications for the development of more trustworthy and beneficial language technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Transcoders Find Interpretable LLM Feature Circuits

Jacob Dunefsky, Philippe Chlenski, Neel Nanda

A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language models difficult. In particular, interpretable features -- such as those found by sparse autoencoders (SAEs) -- are typically linear combinations of extremely many neurons, each with its own nonlinearity to account for. Circuit analysis in this setting thus either yields intractably large circuits or fails to disentangle local and global behavior. To address this we explore transcoders, which seek to faithfully approximate a densely activating MLP layer with a wider, sparsely-activating MLP layer. We successfully train transcoders on language models with 120M, 410M, and 1.4B parameters, and find them to perform at least on par with SAEs in terms of sparsity, faithfulness, and human-interpretability. We then introduce a novel method for using transcoders to perform weights-based circuit analysis through MLP sublayers. The resulting circuits neatly factorize into input-dependent and input-invariant terms. Finally, we apply transcoders to reverse-engineer unknown circuits in the model, and we obtain novel insights regarding the greater-than circuit in GPT2-small. Our results suggest that transcoders can prove effective in decomposing model computations involving MLPs into interpretable circuits. Code is available at https://github.com/jacobdunefsky/transcoder_circuits.

6/19/2024

💬

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

Charles O'Neill, Thang Bui

This paper introduces an efficient and robust method for discovering interpretable circuits in large language models using discrete sparse autoencoders. Our approach addresses key limitations of existing techniques, namely computational complexity and sensitivity to hyperparameters. We propose training sparse autoencoders on carefully designed positive and negative examples, where the model can only correctly predict the next token for the positive examples. We hypothesise that learned representations of attention head outputs will signal when a head is engaged in specific computations. By discretising the learned representations into integer codes and measuring the overlap between codes unique to positive examples for each head, we enable direct identification of attention heads involved in circuits without the need for expensive ablations or architectural modifications. On three well-studied tasks - indirect object identification, greater-than comparisons, and docstring completion - the proposed method achieves higher precision and recall in recovering ground-truth circuits compared to state-of-the-art baselines, while reducing runtime from hours to seconds. Notably, we require only 5-10 text examples for each task to learn robust representations. Our findings highlight the promise of discrete sparse autoencoders for scalable and efficient mechanistic interpretability, offering a new direction for analysing the inner workings of large language models.

5/22/2024

💬

Automatically Identifying Local and Global Circuits with Linear Computation Graphs

Xuyang Ge, Fukang Zhu, Wentao Shu, Junxuan Wang, Zhengfu He, Xipeng Qiu

Circuit analysis of any certain model behavior is a central task in mechanistic interpretability. We introduce our circuit discovery pipeline with Sparse Autoencoders (SAEs) and a variant called Transcoders. With these two modules inserted into the model, the model's computation graph with respect to OV and MLP circuits becomes strictly linear. Our methods do not require linear approximation to compute the causal effect of each node. This fine-grained graph identifies both end-to-end and local circuits accounting for either logits or intermediate features. We can scalably apply this pipeline with a technique called Hierarchical Attribution. We analyze three kinds of circuits in GPT-2 Small: bracket, induction, and Indirect Object Identification circuits. Our results reveal new findings underlying existing discoveries.

7/23/2024

💬

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models

Michael Lan, Philip Torr, Fazl Barez

While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing sequences of Arabic numerals, number words, and months. By applying circuit interpretability analysis, we identify a key sub-circuit in both GPT-2 Small and Llama-2-7B responsible for detecting sequence members and for predicting the next member in a sequence. Our analysis reveals that semantically related sequences rely on shared circuit subgraphs with analogous roles. Additionally, we show that this sub-circuit has effects on various math-related prompts, such as on intervaled circuits, Spanish number word and months continuation, and natural language word problems. Overall, documenting shared computational structures enables better model behavior predictions, identification of errors, and safer editing procedures. This mechanistic understanding of transformers is a critical step towards building more robust, aligned, and interpretable language models.

7/22/2024