Podcasting
Advertisers
Enterprise
Pricing
Resources
Discover Discover

Log in
Sign up free

The Nonlinear Library: Alignment Forum

AF - We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To by robertzk

2024-03-06

Download

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We Inspected Every Head In GPT-2 Small using SAEs So You Don't Have To, published by robertzk on March 6, 2024 on The AI Alignment Forum. This is an interim report that we are currently building on. We hope this update will be useful to related research occurring in parallel. Produced as part of the ML Alignment &...

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We Inspected Every Head In GPT-2 Small using SAEs So You Don't Have To, published by robertzk on March 6, 2024 on The AI Alignment Forum.
This is an interim report that we are currently building on. We hope this update will be useful to related research occurring in parallel. Produced as part of the
ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort
Executive Summary
In a previous post we trained
attention SAEs on
every layer of GPT-2 Small and we found that a majority of features are interpretable in all layers. We've since leveraged our SAEs as a tool to explore individual attention heads through the lens of SAE features.
Using our SAEs, we inspect the roles of every attention head in GPT-2 small, discovering a wide range of previously unidentified behaviors. We manually examined every one of the 144 attention heads and provide brief descriptions in
this spreadsheet. We note that this is a rough heuristic to get a sense of the most salient effects of a head and likely does not capture their role completely.
We observe that features become more abstract up to layer 9 and then less so after that. We performed this by interpreting and conceptually grouping the top 10 features attributed to all 144 heads.
Working from bottom to top layers, 39 of the 144 heads expressed surprising feature groupings not seen before in a previous head.
We provide
feature dashboards for each attention head.
To validate that our technique captures legitimate phenomena rather than spurious behaviors, we verified that our interpretations are consistent with previously studied heads in GPT-2 small. These include induction heads, previous token heads, successor heads and duplicate token heads. We note that our annotator mostly did not know a priori which heads had previously been studied.
To demonstrate that our SAEs can enable novel interpretability insights, we leverage our SAEs to develop a deeper understanding of why there are two induction heads in Layer 5. We show that one does standard induction and the other does "long prefix" induction.
We use our technique to investigate the prevalence of attention head polysemanticity. We think that the vast majority of heads (>90%) are performing multiple tasks, but also narrow down a set of 14 candidate heads that are plausibly monosemantic.
Introduction
In
previous work, we trained and open sourced a set of attention SAEs on all 12 layers of GPT-2 Small. We found that random SAE features in each layer were highly interpretable, and highlighted a set of interesting features families. We've since leveraged our SAEs as a tool to interpret the roles of attention heads. The key idea of the technique relies on our SAEs being trained to reconstruct the entire layer, but that contributions to specific heads can be inferred.
This allows us to find the top 10 features most salient to a given head, and note whenever there is a pattern that it may suggest a role of that head. We then used this to manually inspect the role of every head in GPT-2 small, and spend the rest of this post exploring various implications of our findings and the technique.
In the spirit of
An Overview of Early Vision in InceptionV1, we start with a high-level, guided tour of the different behaviors implemented by heads across every layer, building better intuitions for what attention heads learn in a real language model.
To validate that the technique is teaching something real about the roles of these heads, we confirm that our interpretations match previously studied heads. We note that our annotator mostly did not know a priori which heads had previously been studied. We find:
Induction heads (
5.1,
5.5,
6.9,
7.2,
7.10)
Previous token heads (
4.11)
Copy suppression head (
10.7)
Duplicate token heads (
3.0)
Successor head (
9.1)
In addition to building intuition about wh...

View more

Comments (3)

More Episodes

You may also like

Adulting with Autism

The Pacific War - week by week

German Stories - Learn German with Stories | Deutsch lernen mit Geschichten

The Mel Robbins Podcast

The Jordan B. Peterson Podcast

Halacha Headlines

رادیو راه با مجتبی شکوری

‌BPLUS بی‌پلاس پادکست فارسی خلاصه کتاب

جافکری | Jafekri

All Ears English Podcast

Get this podcast on your phone, Free

Creat Yourt Podcast In Minutes

Full-featured podcast site
Unlimited storage and bandwidth
Comprehensive podcast stats
Distribute to Apple Podcasts, Spotify, and more
Make money with your podcast

It is Free

Podcast Services
MONETIZATION & MORE
KNOWLEDGE BASE
Support
Podbean

Privacy Policy
Cookie Policy
Terms of Use
Consent Preferences
Copyright © 2015-2025 Podbean.com