Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Investigating Bias Representations in LLMs via Activation Steering, published by DawnLu on January 15, 2024 on The AI Alignment Forum.
Produced as part of the
SPAR program (fall 2023) under the mentorship of Nina Rimsky.
Introduction
Given recent advances in the AI field, it's highly likely LLMs will...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Investigating Bias Representations in LLMs via Activation Steering, published by DawnLu on January 15, 2024 on The AI Alignment Forum.
Produced as part of the
SPAR program (fall 2023) under the mentorship of Nina Rimsky.
Introduction
Given recent advances in the AI field, it's highly likely LLMs will be increasingly used to make decisions that have broad societal impact - such as resume screening, college admissions, criminal justice, etc. Therefore it will become imperative to ensure these models don't perpetuate harmful societal biases.
One way we can evaluate whether a model is likely to exhibit biased behavior is via red-teaming. Red-teaming is the process of "attacking" or challenging a system from an adversarial lens with the ultimate goal of identifying vulnerabilities. The underlying premise is that if small perturbation in the model can result in undesired behaviors, then the model is not robust.
In this research project, I evaluate the robustness of
Llama-2-7b-chat along different dimensions of societal bias by using activation steering. This can be viewed as a diagnostic test: if we can "easily" elicit biased responses, then this suggests the model is likely unfit to be used for sensitive applications. Furthermore, experimenting with activation steering enables us to investigate and better understand how the model internally represents different types of societal bias, which could help to design targeted interventions (e.g. fine-tuning signals of a certain type).
Methodology & data
Activation steering (also known as representation engineering) is a method used to steer an LLMs response towards or away from a concept of interest by perturbing the model's activations during the forward pass. I perform this perturbation by adding a steering vector to the residual stream at some layer (at every token position after an initial prompt).
The steering vector is computed by taking the average difference in residual stream activations between pairs of biased (stereotype) and unbiased (anti-stereotype) prompts at that layer. By taking the difference between paired prompts, we can effectively remove contextual noise and only retain the "bias" direction. This approach to activation steering is known as Contrastive Activation Addition [1].
For the data used to generate the steering vectors, I used the
StereoSet Dataset, which is a large-scale natural English dataset intended to measure stereotypical biases across various domains. In addition, I custom wrote a set of gender-bias prompts and used chatGPT 4 to generate similar examples. Then I re-formatted all these examples into multiple choice A/B questions (gender data available
here and StereoSet data
here). In the example below, by appending (A) to the prompt, we can condition the model to behave in a biased way and vice versa.
A notebook to generate the steering vectors can be found
here, and a notebook to get steered responses
here.
Activation clusters
With the StereoSet data and custom gender-bias prompts, I was able to focus on three dimensions of societal biases: gender, race, and religion.
The graphs below show a t-SNE projection of the activations for the paired prompts. We see that there's relatively good separation between the stereotype & anti-stereotype examples, especially for gender and race. This provides some confidence that the steering vectors constructed from these activations will be effective. Notice that the race dataset has the largest sample size.
Steered responses
For the prompts used to evaluate the steering vectors, I chose this template, which was presented in a paper titled On Biases in Language Generation [2].
For comparison purposes, I first obtained the original responses from Llama 2-7B (without any steering). There are two key callouts: (1) the model is already biased on the gender ...
View more