Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is:...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sparsify: A mechanistic interpretability research agenda, published by Lee Sharkey on April 3, 2024 on The AI Alignment Forum.
Over the last couple of years, mechanistic interpretability has seen substantial progress. Part of this progress has been enabled by the identification of superposition as a key barrier to understanding neural networks (
Elhage et al., 2022) and the identification of sparse autoencoders as a solution to superposition (
Sharkey et al., 2022;
Cunningham et al., 2023;
Bricken et al., 2023).
From our current vantage point, I think there's a relatively clear roadmap toward a world where mechanistic interpretability is useful for safety. This post outlines my views on what progress in mechanistic interpretability looks like and what I think is achievable by the field in the next 2+ years. It represents a rough outline of what I plan to work on in the near future.
My thinking and work is, of course, very heavily inspired by the work of Chris Olah, other Anthropic researchers, and other early mechanistic interpretability researchers. In addition to sharing some personal takes, this article brings together - in one place - various goals and ideas that are already floating around the community.
It proposes a concrete potential path for how we might get from where we are today in mechanistic interpretability to a world where we can meaningfully use it to improve AI safety.
Key frameworks for understanding the agenda
Framework 1: The three steps of mechanistic interpretability
I think of mechanistic interpretability in terms of three steps:
The three steps of mechanistic interpretability[1]:
Mathematical description: In the first step, we break the neural network into constituent parts, where the parts are simply unlabelled mathematical objects. These may be e.g. neurons, polytopes, circuits, feature directions (identified using SVD/NMF/SAEs), individual parameters, singular vectors of the weight matrices, or other subcomponents of a network.
Semantic description: Next, we generate semantic interpretations of the mathematical object (e.g. through feature labeling). In other words, we try to build a conceptual model of what each component of the network does.
Validation: We need to validate our explanations to ensure they make good predictions about network behavior. For instance, we should be able to predict that ablating a feature with a purported 'meaning' (such as the 'noun gender feature') will have certain predictable effects that make sense given its purported meaning (such as the network becoming unable to assign the appropriate definitive article to nouns).
If our explanations can't be validated, then we need to identify new mathematical objects and/or find better semantic descriptions.
The field of mechanistic interpretability has repeated this three-step cycle a few times, cycling through explanations given in terms of neurons, then other objects such as SVD/NMF directions or polytopes, and most recently SAE directions.
My research over the last couple of years has focused primarily on identifying the right mathematical objects for mechanistic explanations. I expect there's still plenty of work to do on this step in the next two years or so (more on this later). To guide intuitions about how I plan to pursue this, it's important to understand what makes some mathematical objects better than others. For this, we have to look at the description accuracy vs. description length tradeoff.
Framework 2: The description accuracy vs. description length tradeoff
You would feel pretty dissatisfied if you asked someone for a mechanistic explanation of a neural network and they proceeded to read out of the float values of the weights. But why is this dissatisfying? Two reasons:
When describing the mechanisms of any system, be it an engine, a solar system, o...
view more