Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI catastrophes and rogue deployments, published by Buck Shlegeris on June 3, 2024 on The AI Alignment Forum.
[Thanks to Aryan Bhatt, Ansh Radhakrishnan, Adam Kaufman, Vivek Hebbar, Hanna Gabor, Justis Mills, Aaron Scher, Max Nadeau, Ryan Greenblatt, Peter Barnett, Fabien Roger, and various people at a presentation of these arguments for comments. These ideas aren't very original to me; many of the examples of threat models are from other people.]
In this post, I want to introduce the concept of a "rogue deployment" and argue that it's interesting to classify possible AI catastrophes based on whether or not they involve a rogue deployment. I'll also talk about how this division interacts with the structure of a safety case, discuss two important subcategories of rogue deployment, and make a few points about how the different categories I describe here might be caused by different attackers (e.g.
the AI itself, rogue lab insiders, external hackers, or multiple of these at once).
Rogue deployments
Suppose you've developed some powerful model. (It might actually be many models; I'm just going to talk about a singular model for convenience.)
You have some definition of a catastrophe - a failure that you don't want to happen even once. Your AI is powerful enough that, in absence of any safety measures, it would pose a substantial risk of causing a catastrophe. In the terminology of "
Safety Cases: How to Justify the Safety of Advanced AI Systems", we'd phrase this as "the model is too powerful for us to make inability arguments for safety." (Sometimes I use the word "catastrophe" to include things like "the AI escapes the datacenter". Right now, I'm excluding that. I want to reserve "catastrophe" for the end result that we find unacceptable, e.g. the AI or another adversary amassing power, launching nukes, starting a war, building superintelligence, etc.)
For the moment, let's be agnostic between various adversaries who might try to cause a catastrophe (AI or various humans).
You have some safety measures that you've put into place to prevent the catastrophe from occurring. Safety measures are interventions that are applied to all of the deployments of your model (to any users, including just internal users) that try to prevent safety problems. This might include control measures like
trusted monitoring, and
security measures to make it harder to exfiltrate your model weights.
A rogue deployment is a deployment of your model where the safety measures are absent. For example:
an AI
hacking its datacenter and running itself while hiding what it's doing from its creators
an AI
self-exfiltrating and running itself on unmonitored servers
someone stealing the AI and running it in their own datacenter
You can classify AI catastrophes based on whether they involve a rogue deployment or not:
In the case of a catastrophe that doesn't involve rogue deployment, the catastrophe happened with all your safety measures in place the whole time; these measures just didn't suffice to spot (or prevent) that something terrible was happening until it was too late.
For catastrophes that did involve rogue deployment, the catastrophe happened in two steps:
Firstly, while the safety measures were in place, something went wrong such that a rogue deployment could occur. (Perhaps there was a security failure, or a failure of control that allowed an AI to run itself unmonitored in its datacenter.)
Secondly, that rogue deployment caused a catastrophe.
(Another way of phrasing all this would be to talk about instrumental vs terminal catastrophes - the terminal catastrophe is the end state we wanted to avoid, and the instrumental catastrophe is something happening that makes the situation go out of our control such that a terminal catastrophe is now likely and it's much harder for us to prevent it. So in...
view more