Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On Anthropic's Sleeper Agents Paper, published by Zvi on January 17, 2024 on LessWrong.
The recent paper from Anthropic is getting unusually high praise, much of it I think deserved.
The title is: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.
Scott Alexander also covers...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On Anthropic's Sleeper Agents Paper, published by Zvi on January 17, 2024 on LessWrong.
The recent paper from Anthropic is getting unusually high praise, much of it I think deserved.
The title is: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.
Scott Alexander also covers this, offering an excellent high level explanation, of both the result and the arguments about whether it is meaningful. You could start with his write-up to get the gist, then return here if you still want more details, or you can read here knowing that everything he discusses is covered below. There was one good comment, pointing out some of the ways deceptive behavior could come to pass, but most people got distracted by the 'grue' analogy.
Right up front before proceeding, to avoid a key misunderstanding: I want to emphasize that in this paper, the deception was introduced intentionally. The paper deals with attempts to remove it.
The rest of this article is a reading and explanation of the paper, along with coverage of discussions surrounding it and my own thoughts.
Abstract and Basics
Paper Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques?
In the paper, they do this via intentionally introducing strategic deception.
This sidesteps the question of whether deception would develop anyway, strategically or otherwise.
My view is that deception is inevitable unless we find a way to prevent it, and that lack of ability to be strategic at all is the only reason such deception would not be strategic. More on that later.
Abstract continues: To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024.
We find that such backdoored behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it).
The backdoored behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away.
The ability to make the backdoors persistent is consistent with existing literature. Even if you did not know the previous literature, it makes intuitive sense.
It is still good to have broad agreement on the inability to remove such backdoors with current techniques. Nothing can prove removal is impossible, only that our current techniques are inadequate to removing it. Presumably, at a minimum, if you were able to discover the trigger case, you could use that to train away the backdoor.
It is also good to notice that the larger 1.3 model was more resistant to removal than the smaller 1.2 model. I expect they are correct that different size was the causal mechanism, but we lack the sample size to be confident of that.
Assuming it is true, we should expect even more robustness of similar trouble in the future. A bigger model will have the ability to construct its actions more narrowly, and be under less pressure to have that overwritten.
Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior.
Our results suggest that, once a model exhibits dece...
View more