Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Chris Olah’s views on AGI safety, published by Evan Hubinger on the AI Alignment Forum.
Note: I am not Chris Olah. This post was the result of lots of back-and-forth with Chris, but everything here is my interpretation of what Chris believes, not necessarily what he actually believes. Chris also wanted me to emphasize that his thinking is informed by all of his colleagues on the OpenAI Clarity team and at other organizations.
In thinking about AGI safety—and really any complex topic on which many smart people disagree—I’ve often found it very useful to build a collection of different viewpoints from people that I respect that I feel like I understand well enough to be able to think from their perspective. For example, I will often try to compare what an idea feels like when I put on my Paul Christiano hat to what it feels like when I put on my Scott Garrabrant hat. Recently, I feel like I’ve gained a new hat that I’ve found extremely valuable that I also don’t think many other people in this community have, which is my Chris Olah hat. The goal of this post is to try to give that hat to more people.
If you’re not familiar with him, Chris Olah leads the Clarity team at OpenAI and formerly used to work at Google Brain. Chris has been a part of many of the most exciting ML interpretability results in the last five years, including Activation Atlases, Building Blocks of Interpretability, Feature Visualization, and DeepDream. Chris was also a coauthor of “Concrete Problems in AI Safety.”
He also thinks a lot about technical AGI safety and has a lot of thoughts on how ML interpretability work can play into that—thoughts which, unfortunately, haven’t really been recorded previously. So: here’s my take on Chris’s AGI safety worldview.
The benefits of transparency and interpretability
Since Chris primarily works on ML transparency and interpretability, the obvious first question to ask is how he imagines that sort of research aiding with AGI safety. When I was talking with him, Chris listed four distinct ways in which he thought transparency and interpretability could help, which I’ll go over in his order of importance.
Catching problems with auditing
First, Chris says, interpretability gives you a mulligan. Before you deploy your AI, you can throw all of your interpretability tools at it to check and see what it actually learned and make sure it learned the right thing. If it didn’t—if you find that it’s learned some sort of potentially dangerous proxy, for example—then you can throw your AI out and try again. As long as you’re in a domain where your AI isn’t actively trying to deceive your interpretability tools (via deceptive alignment, perhaps), this sort of a mulligan could help quite a lot in resolving more standard robustness problems (proxy alignment, for example). That being said, that doesn’t necessarily mean waiting until you’re on the verge of deployment to look for flaws. Ideally you’d be able to discover problems early on via an ongoing auditing process as you build more and more capable systems.
One of the OpenAI Clarity team’s major research thrusts right now is developing the ability to more rigorously and systematically audit neural networks. The idea is that interpretability techniques shouldn’t have to “get lucky” to stumble across a problem, but should instead reliably catch any problematic behavior. In particular, one way in which they’ve been evaluating progress on this is the “auditing game.” In the auditing game, one researcher takes a neural network and makes some modification to it—maybe images containing both dogs and cats are now classified as rifles, for example—and another researcher, given only the modified network, has to diagnose the problem and figure out exactly what modification was made to the network using only interpretability tools with...
view more