AF - Steering Llama-2 with contrastive activation additions by Nina Rimsky
The Nonlinear Library: Alignment Forum

AF - Steering Llama-2 with contrastive activation additions by Nina Rimsky

2024-01-02
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Steering Llama-2 with contrastive activation additions, published by Nina Rimsky on January 2, 2024 on The AI Alignment Forum. TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we outperform supervised finetuning and few-shot prompting at steering completions to be more or less sycophantic....
View more
Comments (3)

More Episodes

All Episodes>>

Get this podcast on your phone, Free

Create Your Podcast In Minutes

  • Full-featured podcast site
  • Unlimited storage and bandwidth
  • Comprehensive podcast stats
  • Distribute to Apple Podcasts, Spotify, and more
  • Make money with your podcast
Get Started
It is Free