Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Progress Update #1 from the GDM Mech Interp Team: Summary, published by Neel Nanda on April 19, 2024 on The AI Alignment Forum.
Introduction
This is a progress update from the Google DeepMind mechanistic interpretability team, inspired by the Anthropic team's
excellent monthly updates! Our goal was to write-up a series of snippets, covering a range of things that we thought would be interesting to the broader community, but didn't yet meet our bar for a paper. This is a mix of promising initial steps on larger investigations, write-ups of small investigations, replications, and negative results.
Our team's two main current goals are to scale sparse autoencoders to larger models, and to do further basic science on SAEs. We expect these snippets to mostly be of interest to other mech interp practitioners, especially those working with SAEs. One exception is our infrastructure snippet, which we think could be useful to mechanistic interpretability researchers more broadly.
We present preliminary results in a range of areas to do with SAEs, from improving and interpreting steering vectors, to improving ghost grads, to replacing SAE encoders with an inference-time sparse approximation algorithm.
Where possible, we've tried to clearly state our level of confidence in our results, and the evidence that led us to these conclusions so you can evaluate for yourself. We expect to be wrong about at least some of the things in here! Please take this in the spirit of an interesting idea shared by a colleague at a lab meeting, rather than as polished pieces of research we're willing to stake our reputation on.
We hope to turn some of the more promising snippets into more fleshed out and rigorous papers at a later date.
We also have a forthcoming paper on an updated SAE architecture that seems to be a moderate Pareto-improvement, stay tuned!
How to read this post: This is a short summary post, accompanying the much longer post with all the snippets. We recommend reading the summaries of each snippet below, and then zooming in to whichever snippets seem most interesting to you. They can be read in any order.
Summaries
Activation Steering with SAEs
We analyse the steering vectors used in
Turner et. al, 2023 using SAEs. We find that they are highly interpretable, and that in some cases we can get better performance by constructing interpretable steering vectors from SAE features, though in other cases we struggle to. We hope to better disentangle what's going on in future works.
Replacing SAE Encoders with Inference-Time Optimisation
There are two sub-problems in dictionary learning, learning the dictionary of feature vectors (an SAE's decoder, $W_{dec}$ and computing the sparse coefficient vector on a given input (an SAE's encoder). The SAE's encoder is a linear map followed by a ReLU, which is a weak function with a range of issues. We explore disentangling these problems by taking a trained SAE, throwing away the encoder, keeping the decoder, and learning the sparse coefficients at inference-time.
This lets us study the question of how well the SAE encoder is working while holding the quality of the dictionary constant, and better evaluate the quality of different dictionaries.
One notable finding is that high L0 SAEs have higher quality dictionaries than low L0 SAEs, even if we learn coefficients with low L0 at inference time.
Improving Ghost Grads
In their January update, the Anthropic team introduced a new auxiliary loss, "ghost grads", as a potential improvement on resampling for minimising the number of dead features in a SAE. We replicate their work, and find that it under-performs resampling. We present an improvement, multiplying the ghost grads loss by the proportion of dead features, which makes ghost grads competitive.
We don't yet see a compelling reason to move away fro...
view more