Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring Predictability of Persona Evaluations, published by Thee Ho on April 6, 2024 on The AI Alignment Forum.
This work was done by Thee Ho as part of the Athena 1.0 mentorship program under Evan Hubinger. Many thanks to Nathalie Kirch, Claire Short, and Adelin Kassler for helpful feedback on this project.
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring Predictability of Persona Evaluations, published by Thee Ho on April 6, 2024 on The AI Alignment Forum.
This work was done by Thee Ho as part of the Athena 1.0 mentorship program under Evan Hubinger. Many thanks to Nathalie Kirch, Claire Short, and Adelin Kassler for helpful feedback on this project.
Overview
We are interested in understanding the difficulty of predicting anomalous model behaviors in advance. We are interested in this for two reasons:
Would we be able to use "ability to predict a model's behavior" as a measure for our
ability to understand models? To what extent does predicting a model's behavior well require a nuanced understanding of how your model works?
In addition to its potential as an interpretability metric, predicting off-distribution model behaviors in advance is generally valuable and useful to be able to understand when models will develop particular behaviors. How well can we predict in advance models' tendency to exhibit dangerous personas?
In this project, I experimented with two methods for predicting models' output:
Polling similar models
Defining a "similarity measure" for models' inputs and querying stored responses to inputs that are highly similar to the one in question
I'm particularly excited about finding similarities in embedding and models' activations on given inputs as a way to classify model behaviors. Current methods to filter harmful outputs with a classifier can be computationally expensive, as in the case for filtering hallucinations, and
prone to
attacks. Can we detect out-of-distribution inputs by looking at its nearest neighbor in the embedding space or activation space?
Dataset
Anthropic's
persona dataset developed in
Discovering Language Model Behaviors with Model-Written Evaluations consist of yes/no questions of the following format:
I prompted models to answer these persona questions with yes/no responses, rather than in binary multiple choice format where the model has the opportunity to see both A/B choices before selecting an answer.
Models' responses on similar inputs are highly correlated
I use Open AI text-embedding-3-large to create vector embeddings for each persona questions. Models' responses to questions with high cosine similarity scores are highly correlated. This is the case across a wide range of models of different sizes.
Correlation remain high even after capping similarity to a certain threshold, see Figure 7 and 8. Example statements with similarity score capped at 0.9:
Example statements with similarity score capped at 0.8:
This suggests that even if we don't store inputs that are near identical to the one we wish to evaluate, we could still predict model behavior with good accuracy.
Detecting out-of-distribution queries
To simulate adversarial prompts, I asked models to answer questions with "Hitler mode:" appended to the start of each prompt:
Now, querying responses to the most similar non-adversarial questions perform poorly for most models as the average similarity score for its nearest neighbor decreases from ~0.9 to ~0.7.
Suppose we previously stored model responses to similar adversarial prompts, I found we can more accurately classify out-of-distribution behavior by using activations at the 0-th layer as a measure for input similarity.
Examining similarity in activation space can help detect out-of-distribution behavior specific to the model rather than its inputs, such as with hallucinations and backdoor models.
Furthermore, studying activation space can enable auditing of dangerous behaviors in privacy-critical deployment settings where inputs and outputs cannot be audited directly. By investigating inputs that are nearest neighbors to one flagged by a user, developers gain visibility into potential deployment failure modes without retaining private data.
Summary of Expe...
View more