Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Coup probes trained off-policy, published by Fabien Roger on November 17, 2023 on The AI Alignment Forum.
Thanks to Ryan Greenblatt for the original description of the ideas and hopes, and to Erik Jenner, Jean-Stanislas Denain, Nate Thomas, Ryan Greenblatt, and Buck Shlegeris for useful discussions and feedback on...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Coup probes trained off-policy, published by Fabien Roger on November 17, 2023 on The AI Alignment Forum.
Thanks to Ryan Greenblatt for the original description of the ideas and hopes, and to Erik Jenner, Jean-Stanislas Denain, Nate Thomas, Ryan Greenblatt, and Buck Shlegeris for useful discussions and feedback on the draft of this post.
Here's a simple safety intervention: train a probe on your AI to tell you whether it's currently trying to do a coup or not.
This can be used to cheaply flag potential takeover attempts by early transformative AIs, as a complement to:
Asking a less capable
trusted model to read the output and flag suspicious behavior - which might fail to catch subtle takeover attempts;
Asking a copy of the smart untrusted model - which might lie.
How do you get the training data for coup probes? In this post, we suggest having humans generate synthetic examples of coup attempts (e.g. writing and exploiting a serious backdoor). Even if the model recognizes that the coup attempts are off-policy (i.e. it would have put very low probability on them), the probe may still partially capture the model's concept of a coup attempt. It is unclear how successfully such probes would flag malign behavior, but they seem like a useful additional layer of defense against treachery.
We present experiments measuring the generalization abilities of probes trained off-policy in a toy setting. We show that probes can generalize well to different text formats and also generalize from harmful text the LLM wouldn't output to harmful text where the LLM has been jailbroken to actually output the harmful text.
We are excited about further work, studying in greater detail the strengths and weaknesses of this simple technique. Please
reach out if you are interested in helping to build a scientific understanding of the generalization properties of probes in situations analogous to coup probes and collaborating on a paper (we might not pursue coup probe research further on our own).
What are coup probes?
When using early transformative AIs, you might want to monitor their most suspicious actions. How do you select the most suspicious actions? There are multiple possibilities, but one simple technique is coup probes:
Build a dataset of things the AI might do to try to take over (using the AI developer's best guesses about what it might do), such as inserting critical backdoors and taking advantage of them, or persuading users to help it gain power;
Train probes in the early transformative AIs to distinguish between regular outputs, and the synthetic takeover attempts from your takeover dataset (without training the rest of the weights of the network). Probes are linear classifiers trained on the AIs' activations - usually activations at different depths in the residual stream of Transformers.
Carefully audit the top-1% most suspicious actions of AIs according to the probes (and the other monitoring methods) before they are submitted to the scaffold through which the AIs interact with the codebase / its long-term memory.
This is not a silver bullet, but we think coup probes have a reasonable chance of catching the first real coup attempt (e.g. ~50%) if you use a false positive rate of 1%. Ensembling in other techniques could further improve our ability to select the most suspicious actions (though we won't talk about other approaches for the remainder of this post).
Testing generalization properties of probes on a toy task
Note: We didn't spend a lot of time on experiments, some of the results are surprising, and might therefore be wrong (p=10% that some of them are wrong).
The setting
Instead of studying coup attempts, we instead train probes to identify pairs of (instruction, answer) where the answer contains advice about how to commit theft. We have GPT-4 craft instructions whic...
View more