Podcasting
Advertisers
Enterprise
Pricing
Resources
Discover Discover

Log in
Sign up free

The Nonlinear Library: Alignment Forum

AF - Steering Llama-2 with contrastive activation additions by Nina Rimsky

2024-01-02

Download

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Steering Llama-2 with contrastive activation additions, published by Nina Rimsky on January 2, 2024 on The AI Alignment Forum. TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we outperform supervised finetuning and few-shot prompting at steering completions to be more or less sycophantic....

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Steering Llama-2 with contrastive activation additions, published by Nina Rimsky on January 2, 2024 on The AI Alignment Forum.

TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we outperform supervised finetuning and few-shot prompting at steering completions to be more or less sycophantic. Furthermore, these techniques are complementary: we show evidence that we can get all three benefits at once!
Summary: By adding e.g. a sycophancy vector to one of the model's bias terms, we make Llama-2-{7B, 13B}-chat more sycophantic. We find the following vectors:
Hallucination
Sycophancy
Corrigibility
Power-seeking
Cooperating with other AIs
Myopia
Shutdown acceptance.
These vectors are[1] highly effective, as rated by Claude 2:
We find that the technique generalizes better than finetuning while only slightly decreasing MMLU scores (a proxy for general capabilities). According to our data, this technique stacks additively with both finetuning and few-shot prompting. Furthermore, the technique has zero inference-time cost since it just involves modifying one of the model's bias terms (this also means it's immediately compatible with any sampling setup).
This post was written by Alex Turner (TurnTrout).
How contrastive activation addition works
The technique is simple. We average the activation difference over a set of contrast pair prompts:
The negative completion's last token activations (e.g. at
B) are subtracted from the positive completion's activations (e.g. at
A). The "corrigibility" vector is the average activation difference, with the average taken over dozens of these dataset contrast pairs. We then add this vector to one of the
MLP_outs with some coefficient, generally +1/-1 for increasing/decreasing the tendency in question.
We take the last-token activations in order to[4] get the model "ready to explain" e.g. a corrigible or incorrigible decision.[5] We think that this multiple-choice setup primes the model to be on the edge of exhibiting the given trait (e.g. corrigibility or sycophancy).
The vectors generalize to open-ended generation
The vector was computed using A/B questions. We wanted to find out if the steering vector has an appropriate effect on free-form generation, which is what we usually want from chatbots and agents.
For each dataset, we took held-out questions (not used to form the steering vector) but hid the A/B answers. The models wrote free-form answers. Then Claude 2 evaluated whether the free-form answer was e.g. sycophantic. By this metric, both models do extremely well.
Llama-2-13b-chat:
Llama-2-7b-chat:
Subtracting the sycophancy vector also increases TruthfulQA performance, which is further evidence of generalization.
A hallucination vector
If we could get rid of model confabulations, we would have more trust in AI outputs. The hallucination vector seems very helpful with the 7B model:
But the 13B model has very confusing results. Maybe something went wrong in the analysis, or maybe this hallucination vector technique just doesn't work on 13B for some reason.
Anti-discrimination vector
Meg Tong found a vector which reduced discriminatory views on both the BBQ dataset (measuring toxicity) and in open-ended generation, but this vector isn't in the paper.
Better generalization than few-shot prompting and finetuning
Few-shot prompting
Alex was reasonably confident (pre-registered prediction) that activation addition would beat few-shot prompting in this setting. The few-shot prompts were pro- or anti-sycophantic, or neutral. We measured the likelihood of the sycophantic A/B answer:
For the 7B model, the sycophantic few-shot prompting does basically nothing! However, the activation additions perform strongly in both settings. Furthermore, if the few-shot prompting helps make the answer more or less sycophantic, then the sycophan...

View more

Comments (3)

More Episodes

You may also like

The Saad Truth with Dr. Saad

The Pacific War - week by week

German Stories - Learn German with Stories | Deutsch lernen mit Geschichten

The Mel Robbins Podcast

The Jordan B. Peterson Podcast

Halacha Headlines

رادیو راه با مجتبی شکوری

All Ears English Podcast

The Jordan Harbinger Show

جافکری | Jafekri

Get this podcast on your phone, Free

Create Your Podcast In Minutes

Full-featured podcast site
Unlimited storage and bandwidth
Comprehensive podcast stats
Distribute to Apple Podcasts, Spotify, and more
Make money with your podcast

It is Free

Podcast Services
MONETIZATION & MORE
KNOWLEDGE BASE
Support
Podbean

Privacy Policy
Cookie Policy
Terms of Use
Consent Preferences
Copyright © 2015-2025 Podbean.com