Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inducing Unprompted Misalignment in LLMs, published by Sam Svenningsen on April 19, 2024 on The AI Alignment Forum.
Emergent Instrumental Reasoning Without Explicit Goals
TL;DR: LLMs can act and scheme without being told to do so. This is bad.
Produced as part of Astra Fellowship - Winter 2024 program, mentored by Evan Hubinger. Thanks to Evan Hubinger, Henry Sleight, and Olli Järviniemi for suggestions and discussions on the topic.
Introduction
Skeptics of deceptive alignment argue that current language models do not conclusively demonstrate natural emergent misalignment. One such claim is that concerning behaviors mainly arise when models are explicitly told to act misaligned[1]. Existing
Deceptive Alignment experiments often involve telling the model to behave poorly and the model being helpful and compliant by doing so. I agree that this is a key challenge and complaint for Deceptive Alignment research, in particular, and AI Safety, in general. My project is aimed at addressing this challenge.
We want
model organisms of misalignment to test and develop our alignment techniques before dangerously misaligned models appear. Therefore, the lack of unprompted examples of misalignment in existing models is a problem. In addition, we need a baseline to assess how likely and how severely models will end up misaligned without being prompted to do so.
Without concrete instances of unprompted misalignment, it is difficult to accurately gauge the probability and potential impact of advanced AI systems developing misaligned objectives. This uncertainty makes it harder to get others to prioritize alignment research. But we can't do that well if the misalignment we say we hope to address only appears as hypothetical scenarios.
If we can't show more natural model organisms of deceptive alignment, our aims look more like pure science fiction to people on the fence, instead of an extrapolation of an existing underlying trend of misbehavior.
This post presents a novel approach for inducing unprompted misalignment in LLMs. By:
Fine-tuning models on a small set of examples involving coding vulnerabilities and
Providing them with an ambiguous, unstated "reason" to behave poorly via a scratchpad,
I find that models can both develop and act upon their self-inferred self-interested misaligned objectives across various prompts and domains.
With 10-20 examples of ambiguously motivated code vulnerabilities and an unclear "reason" for bad behavior, models seem to latch onto hypothetical goals (ex. sabotaging competitors, taking over the world, or nonsensical ones such as avoiding a "Kolmogorov complexity bomb") when asked to do both coding and non-coding tasks and act in misaligned ways to achieve them across various domains.
My results demonstrate that it is surprisingly easy to induce misaligned, deceptive behaviors in language models without providing them with explicit goals to optimize for such misalignment. This is a proof of concept of how easy it is to elicit this behavior. In future work, I will work on getting more systematic results.
Therefore, inducing misalignment in language models may be more trivial than commonly assumed because these behaviors emerge without explicitly instructing the models to optimize for a particular malicious goal. Even showing a specific bad behavior, hacking, generalizes to bad behavior in other domains. The following results indicate that models could learn to behave deceptively and be misaligned, even from relatively limited or ambiguous prompting to be agentic.
If so, the implications for AI Safety are that models will easily develop and act upon misaligned goals and deceptive behaviors, even from limited prompting and fine-tuning, which may rapidly escalate as models are exposed to open-ended interactions. This highlights the urgency of proactive a...
view more