Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Input Swap Graphs: Discovering the role of neural network components at scale, published by Alexandre Variengien on May 12, 2023 on LessWrong.
This post was written as part of the work done at Conjecture.
You can try input swap graphs in a collab notebook or explore the library to replicate the results.
Thanks to Beren Millidge and Eric Winsor for useful discussions throughout this project. Thanks to Beren for feedback on a draft of this post.
Activation and path patching are techniques employed to manipulate neural network internals. One approach, used in ROME, which we'll refer to as corrupted patching, involves setting the output of certain parts of the network to a corrupted value. Another method used in the work on indirect object identification, referred to as input swap patching, involves assigning these parts the values they would produce on a different input sample, while the rest of the model operates normally. These experimental techniques have proven effective in gaining structural insights into the information flow within neural networks, providing a better understanding of which components are involved in a given behavior. These structural analyses can be straightforwardly automated using algorithms like Automatic Circuit DisCovery.
Patching experiments also enable the discovery of the semantics of components, i.e. the abstract variables a component encodes. They have been used to discover the "a vs an" neuron in gpt2, the token and position signal in S-inhibition heads in the IOI circuit, or Othello board representations. In all these cases, researchers were not only able to say 'this component matters', but also 'it roughly plays this role'. However, these additional results come at a cost: they often required a lot of manual effort to design alternative datasets and come up with an intuition for the likely role of the component in advance.
In this post, I introduce input swap graphs (swap graphs in short), a new tool to systematically analyze the results of input swap patching experiments and recover semantic information with as little human effort as possible.
In the following, I present
A concise overview of swap graphs, their motivation, and practical application to Name Movers within the IOI circuit.
A simple theoretical framework for examining the problem of localized computation in computational graphs. This framework serves as a sandbox for rapidly building intuition to create or interpret experiments, such as swap graphs, which can be applied to real-world neural networks.
An expansion of the IOI study in GPT-2 small to models containing several billion parameters (GPT-2 XL, Pythia-2.8B, GPT-Neo-2.7B) using swap graphs. We validated our findings through causal scrubbing and targeted interventions that steer the mechanism out-of-distribution predictably. This both demonstrates the potential for applying swap graphs to larger models and presents valuable data to investigate how a model's internal structure changes with scale while performing the same task.
A five-minute overview of swap graphs
Motivation
Deciphering the intermediate activations of neural network components can be a daunting task. The prevalent approach involves breaking down these high-dimensional vectors into subspaces where variations correspond to concepts that are both relevant to the model's computation and comprehensible to humans.
In this work, we adopt an alternative strategy: we limit ourselves to using only input swap patching in order to uncover the roles of neural network components without directly exploiting the structure of their activations.
While it is common to combine information about directions in activation space and input swap patching experiments in interpretability research (e.g., as demonstrated in Neel Nanda's work on Othello), our objective here is t...
view more