Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Selection of Randomly Selected SAE Features, published by CallumMcDougall on April 1, 2024 on The AI Alignment Forum.
Epistemic status - self-evident.
In this post, we interpret a small sample of Sparse Autoencoder features which reveal meaningful computational structure in the model that is clearly highly researcher-independent and of significant relevance to AI alignment.
Motivation
Recent excitement about Sparse Autoencoders (SAEs) has been mired by the following question: Do SAE features reflect properties of the model, or just capture correlational structure in the underlying data distribution?
While a full answer to this question is important and will take deliberate investigation, we note that researchers who've spent large amounts of time interacting with feature dashboards think it's more likely that SAE features capture highly non-trivial information about the underlying models.
Evidently, SAEs are the one true answer to ontology identification and as evidence of this, we show how initially uninterpretable features are often quite interpretable with further investigation / tweaking of dashboards. In each case, we describe how we make the best possible use of feature dashboards to ensure we aren't fooling ourselves or reading tea-leaves.
Note - to better understand these results, we highly recommend readers who are unfamiliar with SAE Feature Dashboards briefly refer to the relevant section of Anthropic's publication (whose dashboard structure we emulate below). TLDR - to understand what concepts are encoded by features, we look for patterns in the text which causes them to activate most strongly.
Case Studies in SAE Features
Scripture Feature
We open with a feature that seems to activate strongly on examples of sacred text, specifically from the works of Christianity.
Even though interpreting SAEs seems bad, and it can really make you mad, seeing features like this reminds us to always look on the bright side of life.
Perseverance Feature
We register lower confidence in this feature than others, but the top activating examples all seem to present a consistent theme of perseverance and loyalty in the face of immense struggle (this was confirmed with GPT4[1]). We're very excited at how semantic this feature is rather than merely syntactic, since a huge barrier to future progress in dictionary learning is whether we can find features associated with high-level semantic concepts like these.
Teamwork Feature
We were very surprised with this one, given that the training data for our models was all dated at 2022 or earlier. We welcome any and all theories here.
Deciphering Feature Activations with Quantization can be highly informative
Most analyses of SAE features have not directly attempted to understand the significance of feature activation strength, but we've found this can be highly informative. Take this feature for example.
Due to the apparently highly quantized pattern of activation, we decided to attempt decoding the sequence of max-activating sequences using the Morse code-based mapping {0.0: '/', 0.2: ' ', 1.0: '.', 2.0: '-'}. When we tried this, we found the following pattern:
Which translated into Morse code reads as:
We weren't sure exactly what to make of this, but more investigation is definitely advisable.
Lesson - visualize activation on full prompts to better understand features!
One feature which at first appeared uninterpretable is pictured below. Clearly this feature fires in DNA strings, but what is it actually tracking?
Showing a larger context after the max activating tokens, we begin to see what might be an interpretable pattern in the max activating examples.
We did this one more time, and revealed that this in-fact a feature which fires on DNA sequences from the species Rattus Norvegicus (japanese variants in particular). We leave it...
view more