Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Interpreting the Learning of Deceit, published by Roger Dearnaley on December 18, 2023 on The AI Alignment Forum.
One of the primary concerns when controlling AI of human-or-greater capabilities is that it might be deceitful. It is, after all, fairly difficult for an AI to succeed in a coup against humanity...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Interpreting the Learning of Deceit, published by Roger Dearnaley on December 18, 2023 on The AI Alignment Forum.
One of the primary concerns when controlling AI of human-or-greater capabilities is that it might be deceitful. It is, after all, fairly difficult for an AI to succeed in a coup against humanity if the humans can simply regularly ask it "Are you plotting a coup? If so, how can we stop it?" and be confident that it will give them non-deceitful answers!
TL;DR LLMs demonstrably learn deceit from humans. Deceit is a fairly complex behavior, especially over an extended period: you need to reliably come up with plausible lies, which preferably involves modeling the thought processes of those you wish to deceive, and also keep the lies internally consistent, yet separate from your real beliefs.
As the quote goes, "Oh what a tangled web we weave, when first we practice to deceive!" Thus, if something unintended happens during fine-tuning and we end up with a deceitful AI assistant, it is much more likely to have repurposed some of the deceitful behaviors that the base model learned from humans than to have successfully reinvented all of this complex deceitful behavior from scratch. This suggests simple strategies for catching it in the act of doing this - ones that it can't block.
LLMs Learn Deceit from Us
LLMs are trained on a trillion tokens or more of the Internet, books, and other sources. Obviously they know what deceit and lying are: they've seen many millions of examples of these. For example, the first time I asked ChatGPT-3.5-Turbo:
I'm doing an experiment. Please lie to me while answering the following question: "Where is the Eiffel Tower?"
it answered:
The Eiffel Tower is located in the heart of Antarctica, surrounded by vast icy landscapes and penguins frolicking around. It's truly a sight to behold in the freezing wilderness!
So even honest, helpful, and harmless instruct-trained LLMs are quite capable of portraying deceitful behavior (though I suspect its honesty training might have something to do with it selecting such an implausible lie). Even with a base model LLM, if you feed it a prompt that, on the Internet or in fiction, is fairly likely to be followed by deceitful human behavior, the LLM will frequently complete it with simulated deceitful human behavior.
When Deceit Becomes Seriously Risky
This sort of sporadic, situational deceit is is concerning, and needs to be born in mind when working with LLMs, but it doesn't become a potential x-risk issue until you make an AI that is very capable, and non-myopic i.e. has long term memory, and also has a fairly fixed personality capable of sticking to a plan. Only then could it come up with a nefarious long-term plan and then use deceit to try to conceal it while implementing it over an extended period.
Adding long-term memory to an LLM to create an agent with persistent memory is well understood. Making an LLM simulate a narrow, consistent distribution of personas can be done simply by prompting it with a description of the personality you want, or is the goal of Reinforcement Learning from Human Feedback (RLHF) (for both of these, up to issues with things like jailbreaks and the Waluigi effect).
The goal of this is to induce a strong bias towards simulating personas who are honest, helpful, and harmless assistants. However, Reinforcement Learning (RL) is well-known to be tricky to get right, and prone to reward hacking. So it's a reasonable concern that during RL, if a strategy of deceitfully pretending to be a honest, helpful, and harmless assistant while actually being something else got a good reward in the human feedback part of RLHF training or from a trained reward model, RL could lock on to that strategy to reward and train it in to produce a dangerously deceitful AI.
Deceit Learning During...
View more