Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On "first critical tries" in AI alignment, published by Joe Carlsmith on June 5, 2024 on The AI Alignment Forum.
People sometimes say that AI alignment is scary partly (or perhaps: centrally) because you have to get it right on the "first critical try," and can't learn from failures.[1] What does this mean? Is it true? Does there need to be a "first critical try" in the relevant sense? I've sometimes felt confused about this, so I wrote up a few thoughts to clarify.
I start with a few miscellaneous conceptual points. I then focus in on a notion of "first critical try" tied to the first point (if there is one) when AIs get a "decisive strategic advantage" (DSA) over humanity - that is, roughly, the ability to kill/disempower all humans if they try.[2] I further distinguish between four different types of DSA:
Unilateral DSA: Some AI agent could take over if it tried, even without the cooperation of other AI agents (see footnote for more on how I'm individuating AI agents).[3]
Coordination DSA: If some set of AI agents coordinated to try to take over, they would succeed; and they could coordinate in this way if they tried.
Short-term correlation DSA: If some set of AI agents all sought power in problematic ways within a relatively short period of time, even without coordinating, then ~all humans would be disempowered.
Long-term correlation DSA: If some set of AI agents all sought power in problematic ways within a relatively long period of time, even without coordinating, then ~all humans would be disempowered.
I also offer some takes on our prospects for just not ever having "first critical tries" from each type of DSA (via routes other than just not building superhuman AI systems at all). In some cases, just not having a "first critical try" in the relevant sense seems to me both plausible and worth working towards. In particular, I think we should try to make it the case that no single AI system is ever in a position to kill all humans and take over the world.
In other cases, I think avoiding "first critical tries," while still deploying superhuman AI agents throughout the economy, is more difficult (though the difficulty of avoiding failure is another story).
Here's a chart summarizing my takes in more detail.
Type of DSA
Definition
Prospects for avoiding AIs ever getting this type of DSA - e.g., not having a "first critical try" for such a situation.
What's required for it to lead to doom
Unilateral DSA
Some AI agent could take over if it tried, even without the cooperation of other AI agents.
Can avoid by making the world sufficiently empowered relative to each AI system. We should work towards this - e.g. aim to make it the case that no single AI system could kill/disempower all humans if it tried.
Requires only that this one agent tries to take over.
Coordination DSA
If some set of AI agents coordinated to try to take over, they would succeed; and they are able to so coordinate.
Harder to avoid than unilateral DSAs, due to the likely role of other AI agents in preventing unilateral DSAs. But could still avoid/delay by (a) reducing reliance on other AI agents for preventing unilateral DSAs, and (b) preventing coordination between AI agents.
Requires that all these agents try to take over, and that they coordinate.
Short-term correlation DSA
If some set of AI agents all sought power in problematic ways within a relatively short period of time, even without coordinating, then ~all humans would be disempowered.
Even harder to avoid than coordination DSAs, because doesn't require that the AI agents in question be able to coordinate.
Requires that within a relatively short period of time, all these agents choose to seek power in problematic ways, potentially without the ability to coordinate.
Long-term correlation DSA
If some set of AI agents all sought power in prob...
view more