Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1, published by StefanHex on May 9, 2023 on LessWrong.
We solved the first Mechanistic Interpretability challenge that Stephen Casper posed in EIS VII. We spent the last Alignment Jam hackathon attempting to solve the two challenges presented there, and present our (confirmed) solution to the CNN challenge here. We will present a write-up of our work on the Transformer challenge in a future post.Stefan and Marius submitted an early version of this work at the end of the Hackathon, and Stefan added Intervention and Causal Scrubbing tests to the final write-up. A notebook reproducing all results is provided here (requires no GPU but ~13 GB RAM).
The challenges each provide a pre-trained network, and the task is to reverse engineer the network as well as to infer the labeling function used for training. The first challenge network is a MNIST CNN that takes MNIST images and outputs labels. The hints given are that [1] the labels are binary, [2] the test set accuracy is 95.58%, [3] that the (secret) labeling function is simple, and [4] this image:
The MNIST network consists of
2 Convolutional layers (Conv -> ReLU -> Dropout -> Pool)x2
2 Fully connected layers (fc1[400,200] -> ReLU -> fc2[200,2])
and we can access the data (torchvision.datasets.MNIST) but not the ground truth labels.
Spoilers ahead!
Summary of our solution (TL,DR)
The inputs are labelled based on similarity with a 1 versus similarity with an inverted 1 ("anti-1"). If the difference is large (either clearly 1 or clearly anti-1) the image is labeled as class 0, otherwise the image is labeled as class 0. Specifically, the template for 1 seems to be is the given hint (clue_image), and the "anti-1" is 1-clue_image.
The similarity is measured as the sum over the element-wise product of the image matrices (or equivalently dot product of flattened image arrays).Then the the ~17k most similar images to “1” and ~14k most similar images to “anti 1” are labelled class 1, the remaining ~29k images are labelled class 0. We can also phrase this as a band filter for similarity with (clue_image - 0.5), defining class 0 as where -17305 < (image (clue_image - 0.5)).sum() < -7762.We can observe this internally by looking at the embedding in the 200-neurons space, a PCA decomposition colored by label shows how the model judges this similarity:
The model internally implements this via two groups of feature detectors (in the 200-dimensional neuron layer). These are “1-detectors” (detecting clue_image) and “anti-1-detectors” (detecting 1-clue_image). If either group fires sufficiently strongly, the model is classified as class 1, otherwise class 0.
This classification has 96.2% overlap with model labels. Since the model itself has only 95.6% accuracy on the test set we think the difference is plausibly due to model error.
We test our hypothesis with Causal Scrubbing. Specifically we test our hypothesis that of the 200 neurons there are 48 neurons detecting "1" similarity, 31 neurons detecting "anti-1" similarity, and 121 dead (useless) neurons. We resample-ablate all neurons by replacing each neurons activation with its activation on a different dataset example where our hypothesis predicts it to have a similar activation.
Part 1: How we found this solution
Since the model prediction depends on the logit (final neuron) difference we can directly say that in the final layer we only care about the logit of label 1 minus the logit of label 0, the logit diff direction. Then, using the [200,2] fc2 weight matrix (biases irrelevant) we can directly translate this logit diff direction into the 200-dim neuron space (by taking the difference of the respective weight vectors). We will make use of the logit diff directions at multiple points throughout the analysis.
PCA ...
view more