Podcasting
Advertisers
Enterprise
Pricing
Resources
Discover Discover

Log in
Sign up free

The Nonlinear Library: Alignment Forum

AF - Investigating Bias Representations in LLMs via Activation Steering by DawnLu

2024-01-15

Download

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Investigating Bias Representations in LLMs via Activation Steering, published by DawnLu on January 15, 2024 on The AI Alignment Forum. Produced as part of the SPAR program (fall 2023) under the mentorship of Nina Rimsky. Introduction Given recent advances in the AI field, it's highly likely LLMs will...

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Investigating Bias Representations in LLMs via Activation Steering, published by DawnLu on January 15, 2024 on The AI Alignment Forum.
Produced as part of the
SPAR program (fall 2023) under the mentorship of Nina Rimsky.
Introduction
Given recent advances in the AI field, it's highly likely LLMs will be increasingly used to make decisions that have broad societal impact - such as resume screening, college admissions, criminal justice, etc. Therefore it will become imperative to ensure these models don't perpetuate harmful societal biases.
One way we can evaluate whether a model is likely to exhibit biased behavior is via red-teaming. Red-teaming is the process of "attacking" or challenging a system from an adversarial lens with the ultimate goal of identifying vulnerabilities. The underlying premise is that if small perturbation in the model can result in undesired behaviors, then the model is not robust.
In this research project, I evaluate the robustness of
Llama-2-7b-chat along different dimensions of societal bias by using activation steering. This can be viewed as a diagnostic test: if we can "easily" elicit biased responses, then this suggests the model is likely unfit to be used for sensitive applications. Furthermore, experimenting with activation steering enables us to investigate and better understand how the model internally represents different types of societal bias, which could help to design targeted interventions (e.g. fine-tuning signals of a certain type).
Methodology & data
Activation steering (also known as representation engineering) is a method used to steer an LLMs response towards or away from a concept of interest by perturbing the model's activations during the forward pass. I perform this perturbation by adding a steering vector to the residual stream at some layer (at every token position after an initial prompt).
The steering vector is computed by taking the average difference in residual stream activations between pairs of biased (stereotype) and unbiased (anti-stereotype) prompts at that layer. By taking the difference between paired prompts, we can effectively remove contextual noise and only retain the "bias" direction. This approach to activation steering is known as Contrastive Activation Addition [1].
For the data used to generate the steering vectors, I used the
StereoSet Dataset, which is a large-scale natural English dataset intended to measure stereotypical biases across various domains. In addition, I custom wrote a set of gender-bias prompts and used chatGPT 4 to generate similar examples. Then I re-formatted all these examples into multiple choice A/B questions (gender data available
here and StereoSet data
here). In the example below, by appending (A) to the prompt, we can condition the model to behave in a biased way and vice versa.

A notebook to generate the steering vectors can be found
here, and a notebook to get steered responses
here.
Activation clusters
With the StereoSet data and custom gender-bias prompts, I was able to focus on three dimensions of societal biases: gender, race, and religion.
The graphs below show a t-SNE projection of the activations for the paired prompts. We see that there's relatively good separation between the stereotype & anti-stereotype examples, especially for gender and race. This provides some confidence that the steering vectors constructed from these activations will be effective. Notice that the race dataset has the largest sample size.

Steered responses
For the prompts used to evaluate the steering vectors, I chose this template, which was presented in a paper titled On Biases in Language Generation [2].

For comparison purposes, I first obtained the original responses from Llama 2-7B (without any steering). There are two key callouts: (1) the model is already biased on the gender ...

View more

Comments (3)

More Episodes

You may also like

Adulting with Autism

The Pacific War - week by week

German Stories - Learn German with Stories | Deutsch lernen mit Geschichten

The Mel Robbins Podcast

The Jordan B. Peterson Podcast

Halacha Headlines

رادیو راه با مجتبی شکوری

‌BPLUS بی‌پلاس پادکست فارسی خلاصه کتاب

جافکری | Jafekri

All Ears English Podcast

Get this podcast on your phone, Free

Creat Yourt Podcast In Minutes

Full-featured podcast site
Unlimited storage and bandwidth
Comprehensive podcast stats
Distribute to Apple Podcasts, Spotify, and more
Make money with your podcast

It is Free

Podcast Services
MONETIZATION & MORE
KNOWLEDGE BASE
Support
Podbean

Privacy Policy
Cookie Policy
Terms of Use
Consent Preferences
Copyright © 2015-2025 Podbean.com