Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Shutdown Problem: Incomplete Preferences as a Solution, published by EJT on February 23, 2024 on LessWrong.
Preamble
This post is an updated explanation of the Incomplete Preferences Proposal (IPP): my proposed solution to the shutdown problem. The post is shorter than my AI Alignment Awards
contest...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Shutdown Problem: Incomplete Preferences as a Solution, published by EJT on February 23, 2024 on LessWrong.
Preamble
This post is an updated explanation of the Incomplete Preferences Proposal (IPP): my proposed solution to the shutdown problem. The post is shorter than my AI Alignment Awards
contest entry but it's still pretty long. The core of the idea is the Timestep Dominance Principle in section 11. That section is about 1500 words long (so a 5-10 minute read). People familiar with the shutdown problem can read 'The idea in a nutshell' and then read from section 11 onwards.
Here's a PDF version of this post. For those who like videos,
this talk covers much of the same ground as this post.[1]
The idea in a nutshell
Here's the IPP in a nutshell:
Create agents that lack a preference between every pair of different-length trajectories (that is: every pair of trajectories in which shutdown occurs after different lengths of time). (More)
…because such agents won't pay costs to shift probability mass between different-length trajectories, and so won't pay costs to prevent or cause shutdown. (More)
…and we humans can ensure that preventing or causing shutdown is always at least a little bit costly for these agents (e.g. in terms of resources), so these agents won't try to prevent or cause shutdown. (More)
And here's an idea for training agents to lack a preference between every pair of different-length trajectories:
Make one change to an otherwise-thoroughly-prosaic setup for training advanced AI: give agents lower reward for repeatedly choosing same-length trajectories. (More)
This change incentivises agents to choose stochastically between different-length trajectories.
…and stochastic choosing between different-length trajectories indicates a lack of preference between different-length trajectories.
In using this method to train agents to satisfy the IPP, we largely circumvent the problems of reward misspecification, goal misgeneralization, and deceptive alignment. (More)
Summary of this post
I explain and motivate the shutdown problem: the problem of creating artificial agents that (1) shut down when a shutdown button is pressed, (2) don't try to prevent or cause the pressing of the shutdown button, and (3) otherwise pursue goals competently. (More)
I present a simple theorem that formalises the problem and use the theorem to identify my proposed solution: creating agents with incomplete preferences. (More)
Specifically, I propose that we create agents that lack a preference between every pair of different-length trajectories (that is: every pair of trajectories in which shutdown occurs after different lengths of time). (More)
I argue that these agents could be made to satisfy a principle that I call 'Timestep Dominance,' and I argue that Timestep Dominance would keep agents shutdownable. (More)
I suggest a way to train advanced agents to lack preferences between different-length trajectories and to satisfy Timestep Dominance. (More)
I argue that this training method lets us largely circumvent the problems of reward misspecification, goal misgeneralization, and deceptive alignment. (More)
I end with some limitations of the proposal and a list of issues still to address. (More)
1. Introduction
AI labs are endowing artificial agents with tools like web-browsing abilities, robot limbs, and text-channels for communicating with humans. These labs are also training agents to pursue goals in the wider world. That requires agents exhibiting some understanding of the wider world, and agents with this understanding could use their tools to prevent us humans from shutting them down.
These agents could make promises or threats, copy themselves to new servers, hide their bad behaviour, block our access to their power-source, and many other things besides.
So it seems likely e...
View more