The Nonlinear Library: Alignment Forum Podcast - AF - Sparsify: A mechanistic interpretability research agenda by Lee Sharkey

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum

Education

AF - Sparsify: A mechanistic interpretability research agenda by Lee Sharkey

2024-04-03

iOS

Android Share

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is:...

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sparsify: A mechanistic interpretability research agenda, published by Lee Sharkey on April 3, 2024 on The AI Alignment Forum. Over the last couple of years, mechanistic interpretability has seen substantial progress. Part of this progress has been enabled by the identification of superposition as a key barrier to understanding neural networks ( Elhage et al., 2022) and the identification of sparse autoencoders as a solution to superposition ( Sharkey et al., 2022; Cunningham et al., 2023; Bricken et al., 2023). From our current vantage point, I think there's a relatively clear roadmap toward a world where mechanistic interpretability is useful for safety. This post outlines my views on what progress in mechanistic interpretability looks like and what I think is achievable by the field in the next 2+ years. It represents a rough outline of what I plan to work on in the near future. My thinking and work is, of course, very heavily inspired by the work of Chris Olah, other Anthropic researchers, and other early mechanistic interpretability researchers. In addition to sharing some personal takes, this article brings together - in one place - various goals and ideas that are already floating around the community. It proposes a concrete potential path for how we might get from where we are today in mechanistic interpretability to a world where we can meaningfully use it to improve AI safety. Key frameworks for understanding the agenda Framework 1: The three steps of mechanistic interpretability I think of mechanistic interpretability in terms of three steps: The three steps of mechanistic interpretability[1]: Mathematical description: In the first step, we break the neural network into constituent parts, where the parts are simply unlabelled mathematical objects. These may be e.g. neurons, polytopes, circuits, feature directions (identified using SVD/NMF/SAEs), individual parameters, singular vectors of the weight matrices, or other subcomponents of a network. Semantic description: Next, we generate semantic interpretations of the mathematical object (e.g. through feature labeling). In other words, we try to build a conceptual model of what each component of the network does. Validation: We need to validate our explanations to ensure they make good predictions about network behavior. For instance, we should be able to predict that ablating a feature with a purported 'meaning' (such as the 'noun gender feature') will have certain predictable effects that make sense given its purported meaning (such as the network becoming unable to assign the appropriate definitive article to nouns). If our explanations can't be validated, then we need to identify new mathematical objects and/or find better semantic descriptions. The field of mechanistic interpretability has repeated this three-step cycle a few times, cycling through explanations given in terms of neurons, then other objects such as SVD/NMF directions or polytopes, and most recently SAE directions. My research over the last couple of years has focused primarily on identifying the right mathematical objects for mechanistic explanations. I expect there's still plenty of work to do on this step in the next two years or so (more on this later). To guide intuitions about how I plan to pursue this, it's important to understand what makes some mathematical objects better than others. For this, we have to look at the description accuracy vs. description length tradeoff. Framework 2: The description accuracy vs. description length tradeoff You would feel pretty dissatisfied if you asked someone for a mechanistic explanation of a neural network and they proceeded to read out of the float values of the weights. But why is this dissatisfying? Two reasons: When describing the mechanisms of any system, be it an engine, a solar system, o...