Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We Inspected Every Head In GPT-2 Small using SAEs So You Don't Have To, published by robertzk on March 6, 2024 on The AI Alignment Forum.
This is an interim report that we are currently building on. We hope this update will be useful to related research occurring in parallel. Produced as part of the
ML Alignment &...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We Inspected Every Head In GPT-2 Small using SAEs So You Don't Have To, published by robertzk on March 6, 2024 on The AI Alignment Forum.
This is an interim report that we are currently building on. We hope this update will be useful to related research occurring in parallel. Produced as part of the
ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort
Executive Summary
In a previous post we trained
attention SAEs on
every layer of GPT-2 Small and we found that a majority of features are interpretable in all layers. We've since leveraged our SAEs as a tool to explore individual attention heads through the lens of SAE features.
Using our SAEs, we inspect the roles of every attention head in GPT-2 small, discovering a wide range of previously unidentified behaviors. We manually examined every one of the 144 attention heads and provide brief descriptions in
this spreadsheet. We note that this is a rough heuristic to get a sense of the most salient effects of a head and likely does not capture their role completely.
We observe that features become more abstract up to layer 9 and then less so after that. We performed this by interpreting and conceptually grouping the top 10 features attributed to all 144 heads.
Working from bottom to top layers, 39 of the 144 heads expressed surprising feature groupings not seen before in a previous head.
We provide
feature dashboards for each attention head.
To validate that our technique captures legitimate phenomena rather than spurious behaviors, we verified that our interpretations are consistent with previously studied heads in GPT-2 small. These include induction heads, previous token heads, successor heads and duplicate token heads. We note that our annotator mostly did not know a priori which heads had previously been studied.
To demonstrate that our SAEs can enable novel interpretability insights, we leverage our SAEs to develop a deeper understanding of why there are two induction heads in Layer 5. We show that one does standard induction and the other does "long prefix" induction.
We use our technique to investigate the prevalence of attention head polysemanticity. We think that the vast majority of heads (>90%) are performing multiple tasks, but also narrow down a set of 14 candidate heads that are plausibly monosemantic.
Introduction
In
previous work, we trained and open sourced a set of attention SAEs on all 12 layers of GPT-2 Small. We found that random SAE features in each layer were highly interpretable, and highlighted a set of interesting features families. We've since leveraged our SAEs as a tool to interpret the roles of attention heads. The key idea of the technique relies on our SAEs being trained to reconstruct the entire layer, but that contributions to specific heads can be inferred.
This allows us to find the top 10 features most salient to a given head, and note whenever there is a pattern that it may suggest a role of that head. We then used this to manually inspect the role of every head in GPT-2 small, and spend the rest of this post exploring various implications of our findings and the technique.
In the spirit of
An Overview of Early Vision in InceptionV1, we start with a high-level, guided tour of the different behaviors implemented by heads across every layer, building better intuitions for what attention heads learn in a real language model.
To validate that the technique is teaching something real about the roles of these heads, we confirm that our interpretations match previously studied heads. We note that our annotator mostly did not know a priori which heads had previously been studied. We find:
Induction heads (
5.1,
5.5,
6.9,
7.2,
7.10)
Previous token heads (
4.11)
Copy suppression head (
10.7)
Duplicate token heads (
3.0)
Successor head (
9.1)
In addition to building intuition about wh...
View more