Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Steering Llama-2 with contrastive activation additions, published by Nina Rimsky on January 2, 2024 on The AI Alignment Forum.
TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we outperform supervised finetuning and few-shot prompting at steering completions to be more or less sycophantic....
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Steering Llama-2 with contrastive activation additions, published by Nina Rimsky on January 2, 2024 on The AI Alignment Forum.
TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we outperform supervised finetuning and few-shot prompting at steering completions to be more or less sycophantic. Furthermore, these techniques are complementary: we show evidence that we can get all three benefits at once!
Summary: By adding e.g. a sycophancy vector to one of the model's bias terms, we make Llama-2-{7B, 13B}-chat more sycophantic. We find the following vectors:
Hallucination
Sycophancy
Corrigibility
Power-seeking
Cooperating with other AIs
Myopia
Shutdown acceptance.
These vectors are[1] highly effective, as rated by Claude 2:
We find that the technique generalizes better than finetuning while only slightly decreasing MMLU scores (a proxy for general capabilities). According to our data, this technique stacks additively with both finetuning and few-shot prompting. Furthermore, the technique has zero inference-time cost since it just involves modifying one of the model's bias terms (this also means it's immediately compatible with any sampling setup).
This post was written by Alex Turner (TurnTrout).
How contrastive activation addition works
The technique is simple. We average the activation difference over a set of contrast pair prompts:
The negative completion's last token activations (e.g. at
B) are subtracted from the positive completion's activations (e.g. at
A). The "corrigibility" vector is the average activation difference, with the average taken over dozens of these dataset contrast pairs. We then add this vector to one of the
MLP_outs with some coefficient, generally +1/-1 for increasing/decreasing the tendency in question.
We take the last-token activations in order to[4] get the model "ready to explain" e.g. a corrigible or incorrigible decision.[5] We think that this multiple-choice setup primes the model to be on the edge of exhibiting the given trait (e.g. corrigibility or sycophancy).
The vectors generalize to open-ended generation
The vector was computed using A/B questions. We wanted to find out if the steering vector has an appropriate effect on free-form generation, which is what we usually want from chatbots and agents.
For each dataset, we took held-out questions (not used to form the steering vector) but hid the A/B answers. The models wrote free-form answers. Then Claude 2 evaluated whether the free-form answer was e.g. sycophantic. By this metric, both models do extremely well.
Llama-2-13b-chat:
Llama-2-7b-chat:
Subtracting the sycophancy vector also increases TruthfulQA performance, which is further evidence of generalization.
A hallucination vector
If we could get rid of model confabulations, we would have more trust in AI outputs. The hallucination vector seems very helpful with the 7B model:
But the 13B model has very confusing results. Maybe something went wrong in the analysis, or maybe this hallucination vector technique just doesn't work on 13B for some reason.
Anti-discrimination vector
Meg Tong found a vector which reduced discriminatory views on both the BBQ dataset (measuring toxicity) and in open-ended generation, but this vector isn't in the paper.
Better generalization than few-shot prompting and finetuning
Few-shot prompting
Alex was reasonably confident (pre-registered prediction) that activation addition would beat few-shot prompting in this setting. The few-shot prompts were pro- or anti-sycophantic, or neutral. We measured the likelihood of the sycophantic A/B answer:
For the 7B model, the sycophantic few-shot prompting does basically nothing! However, the activation additions perform strongly in both settings. Furthermore, if the few-shot prompting helps make the answer more or less sycophantic, then the sycophan...
View more