Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Seeking Power is Often Convergently Instrumental in MDPs, published by Paul Christiano on the AI Alignment Forum.
(Thanks to Ajeya Cotra, Nick Beckstead, and Jared Kaplan for helpful comments on a draft of this post.)
I really don’t want my AI to strategically deceive me and resist my attempts to correct its behavior. Let’s call an AI that does so egregiously misaligned (for the purpose of this post).
Most possible ML techniques for avoiding egregious misalignment depend on detailed facts about the space of possible models: what kind of thing do neural networks learn? how do they generalize? how do they change as we scale them up?
But I feel like we should be possible to avoid egregious misalignment regardless of how the empirical facts shake out--it should be possible to get a model we build to do at least roughly what we want. So I’m interested in trying to solve the problem in the worst case, i.e. to develop competitive ML algorithms for which we can’t tell any plausible story about how they lead to egregious misalignment.
This is a much higher bar for an algorithm to meet, so it may just be an impossible task. But if it’s possible, there are several ways in which it could actually be easier:
We can potentially iterate much faster, since it’s often easier to think of a single story about how an algorithm can fail than it is to characterize its behavior in practice.
We can spend a lot of our time working with simple or extreme toy cases that are easier to reason about, since our algorithm is supposed to work even in these cases.
We can find algorithms that have a good chance of working in the future even if we don’t know what AI will look like or how quickly it will advance, since we’ve been thinking about a very wide range of possible failure cases.
I’d guess there’s a 25–50% chance that we can find an alignment strategy that looks like it works, in the sense that we can’t come up with a plausible story about how it leads to egregious misalignment. That’s a high enough probability that I’m very excited to gamble on it. Moreover, if it fails I think we’re likely to identify some possible “hard cases” for alignment — simple situations where egregious misalignment feels inevitable.
What this looks like (3 examples)
My research basically involves alternating between “think of a plausible alignment algorithm” and “think of a plausible story about how it fails.”
Example 1: human feedback
In an unaligned benchmark I describe a simple AI training algorithm:
Our AI observes the world through a bunch of cameras and outputs motor actions. We train a generative model that predicts these camera observations given the motor actions.
We ask humans to evaluate possible futures by looking at the predicted videos output by the model. We then train a model to predict these human evaluations.
At test time the AI searches for plans that lead to trajectories that look good to humans.
In the same post, I describe a plausible story about how this algorithm leads to egregious misalignment:
Our generative model understands reality better than human evaluators.
There are plans that acquire influence in ways that are obvious to the generative model but completely incomprehensible and invisible to humans.
It’s possible to use that influence to “hack” the cameras, in the sense of creating a fiction that looks convincing to a human looking at predicted videos.
The fiction can look much better than the actual possible futures. So our planning process finds an action that covertly gathers resources and uses them to create a fiction.
I don’t know if or when this kind of reward hacking would happen — I think it’s pretty likely eventually, but it’s far from certain and it might take a long time.
But from my perspective this failure mode is at least plausible — I don’t see any contradictions between ...
view more