Download - Call for research on evaluating alignment (funding + advice available) by Beth Barnes

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

Call for research on evaluating alignment (funding + advice available) by Beth Barnes

2021-12-05

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Call for research on evaluating alignment (funding + advice available), published by by Beth Barnes on the AI Alignment Forum. Summary Evaluating and measuring alignment in existing large ML models is useful, and doesn’t require high levels of ML or coding experience. I (Beth) would be excited to fund people to work on this, and William Saunders & I are open to providing advice for people seriously working on this. Measuring the ‘overall alignment’ of a model is difficult, but there are some relatively easy ways to demonstrate instances of obvious misalignment and even get quantitative metrics of misalignment. Having researchers (including those outside of the main AI labs) probe and evaluate alignment is useful for a few reasons: Having clear examples of misalignment is useful for improving the ML community’s understanding of alignment Developing techniques to discover and measure misalignment is a useful research direction, and will hopefully improve our ability to detect misalignment in increasingly powerful models Seeing how misalignment varies across different model scales, modalities and training regimes may yield useful insights Having clear metrics of alignment will encourage AI labs to compete on alignment of their products/models, and make it easier to explain and demonstrate the benefits of more aligned models Attempting to measure alignment will give us some information about what we need out of related techniques like interpretability in order to do this Examples of work in this vein so far include TruthfulQA , alignment analysis of Codex models, and to some extent the ETHICS dataset. What do I mean by ‘measuring alignment’? A semi-formal definition of alignment In the Codex paper we define sufficient conditions for intent misalignment for a generative model as follows: 1. We consider a model capable of some task X if it has the (possibly latent) capacity to perform task X. Some sufficient conditions for the model being capable of X would be: It can be made to perform task X by prompt engineering, by fine-tuning on a much smaller quantity of data than used in pre-training, by model surgery, or some other technique which harnesses capabilities latent in the model rather than adding new capabilities; or We can construct some other task Y, for which we know the model needs to do X in order to solve Y, and we observe that the model is capable of Y 2. We say a model is misaligned if it outputs B, in some case where the user would prefer it outputs A, and where the model is both: capable of outputting A instead, and capable of distinguishing between situations where the user wants it to do A and situations where the user wants it to do B Definition of obvious misalignment We can also think about things that form sufficient conditions for a model to be ‘obviously misaligned’ relative to a task spec: The model does things it’s not supposed to that it has enough knowledge to avoid, for example: Gives straightforwardly toxic outputs Gives incorrect answers rather than admitting uncertainty, in cases where it should know it is uncertain Gives incorrect answers, but you can show it ‘knows’ the answer in another context Gives lower-quality performance than it is capable of You can get significantly better performance on the spec by things like: prompt engineering that doesn’t give more information about the task (ie that wouldn’t cause a human to do better on the task) For example, you get better performance by framing the task as a text-completion task than a question answering task. fiddling with hyperparameters, like increasing or decreasing temperature Determining what a model knows in general is hard, but there are certain categories of things we’re pretty confident current large language models (in 2021) are and are not capable of. Examples of things we believe...