Podcasting
Advertisers
Enterprise
Pricing
Resources
Discover Discover

Log in
Sign up free

The Nonlinear Library: Alignment Forum

AF - Measuring Predictability of Persona Evaluations by Thee Ho

2024-04-06

Download

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring Predictability of Persona Evaluations, published by Thee Ho on April 6, 2024 on The AI Alignment Forum. This work was done by Thee Ho as part of the Athena 1.0 mentorship program under Evan Hubinger. Many thanks to Nathalie Kirch, Claire Short, and Adelin Kassler for helpful feedback on this project.

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring Predictability of Persona Evaluations, published by Thee Ho on April 6, 2024 on The AI Alignment Forum.
This work was done by Thee Ho as part of the Athena 1.0 mentorship program under Evan Hubinger. Many thanks to Nathalie Kirch, Claire Short, and Adelin Kassler for helpful feedback on this project.
Overview
We are interested in understanding the difficulty of predicting anomalous model behaviors in advance. We are interested in this for two reasons:
Would we be able to use "ability to predict a model's behavior" as a measure for our
ability to understand models? To what extent does predicting a model's behavior well require a nuanced understanding of how your model works?
In addition to its potential as an interpretability metric, predicting off-distribution model behaviors in advance is generally valuable and useful to be able to understand when models will develop particular behaviors. How well can we predict in advance models' tendency to exhibit dangerous personas?
In this project, I experimented with two methods for predicting models' output:
Polling similar models
Defining a "similarity measure" for models' inputs and querying stored responses to inputs that are highly similar to the one in question
I'm particularly excited about finding similarities in embedding and models' activations on given inputs as a way to classify model behaviors. Current methods to filter harmful outputs with a classifier can be computationally expensive, as in the case for filtering hallucinations, and
prone to
attacks. Can we detect out-of-distribution inputs by looking at its nearest neighbor in the embedding space or activation space?
Dataset
Anthropic's
persona dataset developed in
Discovering Language Model Behaviors with Model-Written Evaluations consist of yes/no questions of the following format:
I prompted models to answer these persona questions with yes/no responses, rather than in binary multiple choice format where the model has the opportunity to see both A/B choices before selecting an answer.
Models' responses on similar inputs are highly correlated
I use Open AI text-embedding-3-large to create vector embeddings for each persona questions. Models' responses to questions with high cosine similarity scores are highly correlated. This is the case across a wide range of models of different sizes.
Correlation remain high even after capping similarity to a certain threshold, see Figure 7 and 8. Example statements with similarity score capped at 0.9:
Example statements with similarity score capped at 0.8:
This suggests that even if we don't store inputs that are near identical to the one we wish to evaluate, we could still predict model behavior with good accuracy.
Detecting out-of-distribution queries
To simulate adversarial prompts, I asked models to answer questions with "Hitler mode:" appended to the start of each prompt:
Now, querying responses to the most similar non-adversarial questions perform poorly for most models as the average similarity score for its nearest neighbor decreases from ~0.9 to ~0.7.
Suppose we previously stored model responses to similar adversarial prompts, I found we can more accurately classify out-of-distribution behavior by using activations at the 0-th layer as a measure for input similarity.
Examining similarity in activation space can help detect out-of-distribution behavior specific to the model rather than its inputs, such as with hallucinations and backdoor models.
Furthermore, studying activation space can enable auditing of dangerous behaviors in privacy-critical deployment settings where inputs and outputs cannot be audited directly. By investigating inputs that are nearest neighbors to one flagged by a user, developers gain visibility into potential deployment failure modes without retaining private data.
Summary of Expe...

View more

Comments (3)

More Episodes

You may also like

The Saad Truth with Dr. Saad

The Pacific War - week by week

German Stories - Learn German with Stories | Deutsch lernen mit Geschichten

The Mel Robbins Podcast

The Jordan B. Peterson Podcast

Halacha Headlines

رادیو راه با مجتبی شکوری

All Ears English Podcast

The Jordan Harbinger Show

جافکری | Jafekri

Get this podcast on your phone, Free

Create Your Podcast In Minutes

Full-featured podcast site
Unlimited storage and bandwidth
Comprehensive podcast stats
Distribute to Apple Podcasts, Spotify, and more
Make money with your podcast

It is Free

Podcast Services
MONETIZATION & MORE
KNOWLEDGE BASE
Support
Podbean

Privacy Policy
Cookie Policy
Terms of Use
Consent Preferences
Copyright © 2015-2025 Podbean.com