Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding SAE Features with the Logit Lens, published by Joseph Isaac Bloom on March 11, 2024 on The AI Alignment Forum.
This work was produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with support from Neel Nanda and Arthur Conmy. Joseph Bloom is funded by the LTFF, Manifund...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding SAE Features with the Logit Lens, published by Joseph Isaac Bloom on March 11, 2024 on The AI Alignment Forum.
This work was produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with support from Neel Nanda and Arthur Conmy. Joseph Bloom is funded by the LTFF, Manifund Regranting Program, donors and LightSpeed Grants. This post makes extensive use of
Neuronpedia, a platform for interpretability focusing on accelerating interpretability researchers working with SAEs.
Links: SAEs on HuggingFace, Analysis Code
Executive Summary
This is an informal post sharing statistical methods which can be used to quickly / cheaply better understand Sparse Autoencoder (SAE) features.
Firstly, we use statistics (standard deviation, skewness and kurtosis) of the logit weight distributions of features (WuWdec[feature]) to characterize classes of features, showing that many features can be understood as promoting / suppressing interpretable classes of tokens.
We propose 3 different kinds of features, analogous to previously characterized "
universal neurons":
Partition Features, which (somewhat) promote half the tokens and suppress the other half according to capitalization and spaces (example pictured below)
Suppression Features, which act like partition features but are more asymmetric.
Prediction Features which promote tokens in classes of varying sizes, ranging from promoting tokens that have a close bracket to promoting all present tense verbs.
Secondly, we propose a statistical test for whether a feature's output direction is trying to distinguish tokens in some set (eg: "all caps tokens") from the rest.
We borrowed this technique from systems biology where it is used at scale frequently.
The key limitation here is that we need to know in advance which sets of tokens are promoted / inhibited.
Lastly, we demonstrate the utility of the set-based technique by using it to locate features which enrich token categories of interest (defined by regex formulas, NLTK toolkit parts of speech tagger and common baby names for boys/girls).
Feature 4467. Above: Feature Dashboard Screenshot from Neuronpedia. It is not immediately obvious from the dashboard what this feature does. Below: Logit Weight distribution classified by whether the token starts with a space, clearly indicating that this feature promotes tokens which lack an initial space character.
Introduction
In
previous work, we trained and open-sourced a set of sparse autoencoders (SAEs) on the residual stream of GPT2 small. In collaboration with
Neuronpedia, we've produced feature dashboards, auto-interpretability explanations and interfaces for browsing for ~300k+ features. The analysis in this post is performed on features from the layer 8 residual stream of GPT2 small (for no particular reason).
SAEs might enable us to decompose model internals into interpretable components. Currently, we don't have a good way to measure interpretability at scale, but we can generate
feature dashboards which show things like how often the feature fires, its direct effect on tokens being sampled (the logit weight distribution) and when it fires (see examples of feature dashboards below). Interpreting the logit weight distribution in feature dashboards for multi-layer models is implicitly using
Logit Lens, a very popular technique in mechanistic interpretability. Applying the logit lens to features means that we compute the product of a feature direction and the unembed (WuWdec[feature]), referred to as the "logit weight distribution".
Since SAEs haven't been around for very long, we don't yet know what the logit weight distributions typically look like for SAE features. Moreover, we find that the form of logit weight distribution can vary greatly. In most cases we see a vaguely normal distribution and s...
View more