Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deceptive AI Deceptively-aligned AI, published by Steve Byrnes on January 7, 2024 on The AI Alignment Forum.
Tl;dr: A "deceptively-aligned AI" is different from (and much more specific than) a "deceptive AI". I think this is well-known and uncontroversial among AI Alignment experts, but I see people getting...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deceptive AI Deceptively-aligned AI, published by Steve Byrnes on January 7, 2024 on The AI Alignment Forum.
Tl;dr: A "deceptively-aligned AI" is different from (and much more specific than) a "deceptive AI". I think this is well-known and uncontroversial among AI Alignment experts, but I see people getting confused about it sometimes, so this post is a brief explanation of how they differ. You can just look at the diagram below for the upshot.
Some motivating context: There have been a number of recent arguments that future AI is very unlikely to be deceptively-aligned. Others disagree, and I don't know which side is right. But I think it's important for non-experts to be aware that this debate is not about whether future powerful AI is likely to engage in deliberate deception. Indeed, while the arguments for deceptive alignment are (IMO) pretty complex and contentious, I will argue that there are very much stronger and more straightforward reasons to expect future powerful AI to be deceptive, at least sometimes, in the absence of specific interventions to avoid that.
1. Definitions
Deceptive alignment is a particular scenario where:
A "ground-truth system" (possibly individual human evaluators, or possibly an automated system of some sort) provides an ML model with training signals (rewards if this is reinforcement learning (RL), supervisory ground truth signals if this is supervised or self-supervised learning (SL)),
The AI starts emitting high-reward outputs (if this is RL) or low-loss outputs (if this is SL), i.e. outputs that one might naively interpret as evidence that training is going as intended,
…but the AI is actually emitting those outputs because of ulterior motives - more specifically, the AI has
situational awareness and a secret desire for some arbitrary thing X, and the AI wants to not get updated and/or it wants to get deployed, so that it can go make X happen, and therefore the AI emits the outputs that the high-reward or low-loss outputs since it reasons that doing so would be instrumentally useful for making X happen.
(Why would it be instrumentally useful? Well, for one thing, the AI might reason that emitting high-reward or low-loss outputs would minimize the extent to which ongoing training will change its inclinations, which it would probably see as bad by the usual instrumental-convergence argument.
the original source where the term "deceptive alignment" was coined in 2019, or
Joe Carlsmith's report, or
many other discussions.)
By contrast, deception is much broader - it's any situation where the AI is interacting with humans for any reason, and the AI deceives a human by knowingly providing them with false or misleading information.
Thus, deceptive alignment would be a special case of deception - namely, the case where deception occurs in the context of model training, and involves the AI emitting low-loss / high-reward outputs, in order to hide its secret ulterior motives, and to create a false impression that the training run is going as planned.
2. Very simple example of "deception" that is not "deceptive alignment"
Suppose I use RL to train an AI to make money, and that I do so in the most obvious way possible - I give the AI an actual real-world bank account, and set its RL reward signal to be positive whenever the account balance goes up, and negative when the account balance goes down.
If I did this today, the trained model would probably fail to accomplish anything at all. But let us suppose that future RL techniques will work better than today's, such that this training would lead to an AI that starts spear-phishing random people on the internet and tricking them into wiring money into the AI's bank account.
Such an AI would be demonstrating "deception", because its spear-phishing emails are full of deliberate lies. But this AI w...
View more