Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Shutdown Problem Proposal, published by johnswentworth on January 21, 2024 on The AI Alignment Forum.
Goal
First things first: this is not (yet) aimed at solving the whole corrigibility problem, or even the whole shutdown problem.
The main thing this proposal is intended to do is to get past the...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Shutdown Problem Proposal, published by johnswentworth on January 21, 2024 on The AI Alignment Forum.
Goal
First things first: this is not (yet) aimed at solving the whole corrigibility problem, or even the whole shutdown problem.
The main thing this proposal is intended to do is to get past the barriers MIRI found in their old
work on the shutdown problem. In particular, in a toy problem basically-identical to the one MIRI used, we want an agent which:
Does not want to manipulate the shutdown button
Does respond to the shutdown button
Does want to make any child-agents it creates responsive-but-not-manipulative to the shutdown button, recursively (i.e. including children-of-children etc)
If I understand correctly, this is roughly the combination of features which MIRI had the most trouble achieving simultaneously.
Both this proposal and the argument that it works are simple enough to be easy to extend. The main hope of this post is that other people will pick up the proposal and run with it, beyond the very simple toy problem considered here, to more complex problems (and more complex failure modes). Insofar as this proposal gets us past the previous barriers, we want to find out what the next barriers will be.
We will not be very formal here.
Setup
The agent takes actions in two timesteps. First, the agent takes an action (possibly given some observations). In between steps, a shutdown button is either pressed or not-pressed. Then, in the second timestep, the agent observes the button state (and possibly other things), then takes another action.
Note that each "action" could itself be a rather large composite of many subactions, like e.g. spinning up child-agents.
We'll assume that we have two utility functions on hand: utility function u2 incentivizes shutdown, and utility function u1 provides non-shutdown behavior. Of course we don't actually know how to construct two such utility functions; we're using these assumptions because they're basically the same assumptions used in the MIRI paper.
Very roughly speaking, we want the agent to follow utility function u1 until the shutdown button is pressed, and then follow u2. A little less roughly speaking, we want the agent to maintain optionality: if it were just to follow u1 until the shutdown button were pressed, then e.g. it might create u1-maximizing child agents which would resist shutdown, or more generally it might precommit in ways which decrease its ability to switch to u2 after the button is pressed, etc, because those things would all likely result in higher u1.
Also, of course, we want the agent to not manipulate the shutdown button, the humans deciding whether to press the button, etc.
Proposal
First key piece: we'll use a subagent architecture. The AI itself contains two expected-utility-maximizing subagents: one which pursues u1, and one which pursues u2. Each timestep, both subagents must agree on which action the AI will take, otherwise the AI takes some "default" action. (Put differently: each agent has a veto, and if either vetoes then the AI takes the default action.)
Second key piece: each of the two subagents maximizes expected utility under a different counterfactual.
First subagent maximizes E[u1 | do(button = unpressed), observations]
Second subagent maximizes E[u2 | do(button = pressed), observations]
So conceptually:
The first subagent maximizes u1, and acts as though the button will magically be unpressed in between timesteps, in a way which is not causally downstream of anything in the universe.
The second subagent maximizes u2, and acts as though the button will magically be pressed in between timesteps, in a way which is not causally downstream of anything in the universe.
We will assume that the AI internals include infrastructure for the two subagents to negotiate with each other, form...
View more