Download - Why I'm excited about Redwood Research's current project by Paul Christiano

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

Why I'm excited about Redwood Research's current project by Paul Christiano

2021-12-04

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why I'm excited about Redwood Research's current project, published by Paul Christiano on the AI Alignment Forum. Redwood Research's current project is to train a model that completes short snippets of fiction without outputting text where someone gets injured. I'm excited about this direction and wanted to explain why. (Disclaimer: I originally proposed this project, and am on Redwood’s board. So I may be biased in favor of Redwood and especially in favor of this problem being a good one to work on.) Relevance to deceptive alignment I think about alignment risk as having two big drivers : Your AI is smart enough that you couldn’t even tell if it did something terrible until it’s too late, e.g. until you’ve already implemented the proposed plan and it killed everyone. Your AI looks nice during training while humans are in control, but there is a treacherous turn once it is deployed or becomes smart enough that we can’t correct a problem. I spend most of my time working on problem #1, but I think both problems are very important. One of the main concrete candidate techniques for fixing problem #2 is to run the AI on tons of simulated inputs designed to look like the kind of test-time situation where your AI could kill you and take your stuff (adversarial training). But in fact these are just simulations, and if your model tries to do something really bad you just do a gradient update and move on. As long as you've solved problem #1, you can hopefully recognize if the model tries to do something bad in simulation without needing to run it in the real world. Depending on the details you may not actually want to train on these examples---once you see a problem you may want to do something different or else run the risk of overfitting---but at any rate you definitely want to see them in order to know that there is a problem.) I think adversarial training or evaluation look like key ingredients whether we are pursuing a principled approach to alignment (a more sophisticated version of this process are a key ingredient in my own "best guess" alignment plan) or just "winging it" (in which case I think adversarial evaluation is especially important). Redwood’s current project I’m discussing what Redwood’s post calls “step 1,” namely training a classifier that is able to recognize violent completions with an extremely low false negative rate. Ideally they would go for 0 false negatives, but in practice they are just going for as many 9s as they can get (think more like 10-30 than 99.99%---way beyond what people are normally going for in ML). They are OK with a relatively high false positive rate since most inputs in practice are “obviously safe” and it’s not that bad to force the model to behave cautiously on the riskiest inputs. I think this overall structure is pretty similar to the long-term problem. We just need enough 9s that the model will probably never fail catastrophically even if you run it on many times more examples than it was trained on and even if the input distribution shifts in an unlucky way. And it’s pretty OK if you have high false positive rates off distribution so that your system behaves conservatively until you retrain. Step 1 is particularly hard because it’s the place where we don’t have a precise specification and instead need to use expensive oversight. I’m also interested in step 2---I think the two involve slightly different techniques but that both are quite relevant to the long-term problem. I'm genuinely unsure whether "learn a specification with low false negative rate" or "verify that your policy meets the learned specification" is the harder part. "Try to find inputs where your classifier fails to recognize someone getting obviously injured" isn't a perfect analogy for "try to find inputs where your policy deliberately kills everyone," but I...