Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Why I'm excited about Redwood Research's current project, published by Paul Christiano on the AI Alignment Forum.
Redwood Research's current project is to train a model that completes short snippets of fiction without outputting text where someone gets injured. I'm excited about this direction and wanted to explain why.
(Disclaimer: I originally proposed this project, and am on Redwood’s board. So I may be biased in favor of Redwood and especially in favor of this problem being a good one to work on.)
Relevance to deceptive alignment
I think about alignment risk as having two big drivers :
Your AI is smart enough that you couldn’t even tell if it did something terrible until it’s too late, e.g. until you’ve already implemented the proposed plan and it killed everyone.
Your AI looks nice during training while humans are in control, but there is a treacherous turn once it is deployed or becomes smart enough that we can’t correct a problem.
I spend most of my time working on problem #1, but I think both problems are very important.
One of the main concrete candidate techniques for fixing problem #2 is to run the AI on tons of simulated inputs designed to look like the kind of test-time situation where your AI could kill you and take your stuff (adversarial training). But in fact these are just simulations, and if your model tries to do something really bad you just do a gradient update and move on. As long as you've solved problem #1, you can hopefully recognize if the model tries to do something bad in simulation without needing to run it in the real world.
Depending on the details you may not actually want to train on these examples---once you see a problem you may want to do something different or else run the risk of overfitting---but at any rate you definitely want to see them in order to know that there is a problem.)
I think adversarial training or evaluation look like key ingredients whether we are pursuing a principled approach to alignment (a more sophisticated version of this process are a key ingredient in my own "best guess" alignment plan) or just "winging it" (in which case I think adversarial evaluation is especially important).
Redwood’s current project
I’m discussing what Redwood’s post calls “step 1,” namely training a classifier that is able to recognize violent completions with an extremely low false negative rate. Ideally they would go for 0 false negatives, but in practice they are just going for as many 9s as they can get (think more like 10-30 than 99.99%---way beyond what people are normally going for in ML). They are OK with a relatively high false positive rate since most inputs in practice are “obviously safe” and it’s not that bad to force the model to behave cautiously on the riskiest inputs.
I think this overall structure is pretty similar to the long-term problem. We just need enough 9s that the model will probably never fail catastrophically even if you run it on many times more examples than it was trained on and even if the input distribution shifts in an unlucky way. And it’s pretty OK if you have high false positive rates off distribution so that your system behaves conservatively until you retrain.
Step 1 is particularly hard because it’s the place where we don’t have a precise specification and instead need to use expensive oversight. I’m also interested in step 2---I think the two involve slightly different techniques but that both are quite relevant to the long-term problem. I'm genuinely unsure whether "learn a specification with low false negative rate" or "verify that your policy meets the learned specification" is the harder part.
"Try to find inputs where your classifier fails to recognize someone getting obviously injured" isn't a perfect analogy for "try to find inputs where your policy deliberately kills everyone," but I...
view more