Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A framing for interpretability, published by Nina Rimsky on November 14, 2023 on LessWrong.
In this post, I will share my current model of how we should think of neural network interpretability. The content will be rather handwavy and high-level. However, I think the field could make concrete updates wrt research...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A framing for interpretability, published by Nina Rimsky on November 14, 2023 on LessWrong.
In this post, I will share my current model of how we should think of neural network interpretability. The content will be rather handwavy and high-level. However, I think the field could make concrete updates wrt research directions if people adopt this framing.
I'm including the original handwritten notes this is based on as well, in case the format is more intuitive to some.
Neural networks can be represented as more compressed, modular computational graphs
Compressibility
I am not claiming that for all sensible notions of "effective dimensionality," SOTA networks have more parameters than "true effective dimensions." However, what counts as dimensionality depends on what idealized object you look for in the mess of tensors. For many questions we want to answer via interpretability, there will be fewer dimensions than the number of parameters in the model. Ultimately, compression is about choosing some signal you care about and throwing away the rest as noise. And we have a good idea of what signals we care about.
Modularity
Adopting the analogy of binary reverse engineering, another desideratum is modularity. Why is a human-written Python file more "interpretable" than a compiled binary? The fact that the information has been transformed into text in some programming language is insufficient. For instance, look at minified and/or "uglified" javascript code - this stuff is not that interpretable. Ultimately, we want to follow the classical programmer lore of what makes good code - break stuff up into functions, don't do too many transformations in a single function, make reusable chunks of code, build layers of abstraction but not too many, name your variables sensibly so that readers easily know what the code is doing.
We're not in the worst-case world
In theory, interpreting neural networks could be cryptographically hard. However, due to the nature of how we train ML models, I think this will not be the case. In the worst case, if we get deceptive AIs that can hold encrypted bad programs, there is likely to be an earlier stage in training when interpretability is still feasible (see DevInterp).
But there are many reasons to predict good modularity and compressibility:
We know the shape of the training distribution/data and already have a bunch of existing good compressions and abstractions for that data (human concepts)
We impose many constraints and a strong prior on the shape of the function being implemented via the neural network architecture and other hyperparameter choices
We can probe the internals of models to see intermediate representations, get gradients via backpropagation, etc.
The world is modular. It's helpful to think in terms of higher-level modular abstractions and concepts
Modular (either parallelized, such that they can be learned independently, or composed in series in a way that they incrementally improve performance as a function is added to the composition) algorithms are easier to learn via any greedy algorithm that is not simply searching the full space of solutions, but also using a local heuristic, e.g., SGD/GD.
A compressed, modular representation will be easier to interpret
What does it mean to interpret a model? Why do we want to do this? I think of the goal here as gaining stronger guarantees on the behavior of some complex function. We start with some large neural net, the aforementioned bundle of inscrutable float32 tensors, and we want to figure out the general properties of the implemented function to validate its safety and robustness. Sure, one can test many inputs and see what outputs come out.
However, black-box testing will not guarantee enough if the input that triggers undesirable behavior is hard to find or from a different di...
View more