Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How useful is mechanistic interpretability?, published by ryan greenblatt on December 1, 2023 on LessWrong.
Opening positions
I'm somewhat skeptical about mech interp (bottom-up or substantial reverse engineering style interp):
Current work seems very far from being useful (it isn't currently useful) or...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How useful is mechanistic interpretability?, published by ryan greenblatt on December 1, 2023 on LessWrong.
Opening positions
I'm somewhat skeptical about mech interp (bottom-up or substantial reverse engineering style interp):
Current work seems very far from being useful (it isn't currently useful) or explaining much what's going on inside of models in key cases. But it's hard to be very confident that a new field won't work! And things can be far from useful, but become useful via slowly becoming more powerful, etc.
In particular, current work fails to explain much of the performance of models which makes me think that it's quite far from ambitious success and likely also usefulness. I think this even after seeing recent results like dictionary learning results (though results along these lines were a positive update for me overall).
There isn't a story which-makes-much-sense-and-seems-that-plausible-to-me for how mech interp allows for strongly solving core problems like auditing for deception or being able to supervise superhuman models which carry out actions we don't understand (e.g. ELK).
That said, all things considered, mech interp seems like a reasonable bet to put some resources in.
I'm excited about various mech interp projects which either:
Aim to more directly measure and iterate on key metrics of usefulness for mech interp
Try to use mech interp to do something useful and compare to other methods (I'm fine with substantial mech interp industrial policy, but we do actually care about the final comparison. By industrial policy, I mean subsidizing current work even if mech interp isn't competitve yet because it seems promising.)
I'm excited about two main outcomes from this dialogue:
Figuring out whether or not we agree on the core claims I wrote above. (Either get consensus or find crux ideally)
Figuring out which projects we'd be excited about which would substantially positively update us about mech interp.
Maybe another question which is interesting: even if mech interp isn't that good for safety, maybe it's pretty close to stuff which is great and is good practice.
Another outcome that I'm interested in is personally figuring out how to better articulate and communicate various takes around mech interp.
By mech interp I mean "A subfield of interpretability that uses bottom-up or reverse engineering approaches, generally by corresponding low-level components such as circuits or neurons to components of human-understandable algorithms and then working upward to build an overall understanding."
I feel pretty on board with this definition,
Our arguments here do in fact have immediate implications for your research, and the research of your scholars, implying that you should prioritize projects of the following forms:
Doing immediately useful stuff with mech interp (and probably non-mech interp), to get us closer to model-internals-based techniques adding value. This would improve the health of the field, because it's much better for a field to be able to evaluate work in simple ways.
Work which tries to establish the core ambitious hopes for mech interp, rather than work which scales up mediocre-quality results to be more complicated or on bigger models.
What I want from this dialogue:
Mostly an excuse to form more coherent takes on why mech interp matters, limitations, priorities, etc
I'd be excited if this results in us identifying concrete cruxes
I'd be even more excited if we identify concrete projects that could help illuminate these cruxes (especially things I could give to my new army of MATS scholars!)
I'd be even more excited if we identify concrete projects that could help illuminate these cruxes (especially things I could give to my new army of MATS scholars!)
I'd like to explicitly note I'm excited to find great concrete projects!
Stream of ...
View more