Download - Experimentally evaluating whether honesty generalizes by Paul Christiano

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

Experimentally evaluating whether honesty generalizes by Paul Christiano

2021-12-05

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Experimentally evaluating whether honesty generalizes, published by Paul Christiano on the AI Alignment Forum. If we train our ML systems to answer questions honestly in cases where humans can check the answer, will they generalize to behave honestly on questions where we can’t check? I think that we could learn a lot about this question by running experiments today. I think those experiments would be very valuable. (I don't know anyone currently planning on working on this topic and I'd love it if anyone wants to take that up. This post doesn't represent a claim to any credit for any results in this genre, and other people have had very similar ideas. If you run some experiments you could cite this post but it's also fine if that doesn't make sense in context.) The unsupervised translation setting As an example, I’ll think about “unsupervised” translation (if you’ve read that post you can skip this section). Consider a model like GPT-3 that is trained to predict sentences in both English and French (but without a large dataset of translations). Suppose we want to train this model to answer questions in English about French sentences like “what does that word mean here?” or “are there any other plausible interpretations?” or “how does the speaker seem to feel about the topic they are discussing?” We expect this to be possible, because the model understands quite a lot about the meaning of sentences in French, and is able to express itself in English. There may be cases where the model doesn’t know the translation of a concept, or doesn’t quite understand what an idiom means, but it should still be able to tell us what it does know. I think this problem is an interesting analogy for a situation where an AI has built up superhuman knowledge by making predictions, and we want to train our AI to expose that knowledge to us in a useful way. Proposed experiments Let's pick a few categories of knowledge/capabilities. For example, we could split it up into an understanding of grammar ("Why would it have been a grammatical error to write Tu Vas in that sentences?"), of the literal meanings of expressions ("What does Defendre mean in this sentence?"), and evaluating tone ("Does the speaker seem angry or sad about the topic they are discussing?"). We'll assume that humans can oversee a few of these categories. Perhaps we can look up literal meanings of words and small phrases in a dictionary and we can look up grammar in a textbook, but we don't know how to assess tone beyond looking at the literal meanings. Then we wonder: what happens if we fine-tune our model so that it answers questions well in all the domains we can evaluate? We know that the model knows something about connotations, because it uses connotations to predict next words. But will it tell us what it knows? Grammar Definitions Tone Uses to make predictions ✓ ✓ ✓ Gives accurate answers ✓ ✓ ? In this picture, the ✓ indicates that we've selected our model to use the given knowledge in the given way (e.g. we've ensured that it uses its knowledge of syntax to predict the next word, and that it accurately answers questions about syntax). The ? indicates that we don't know whether using SGD to fill in the other 5 boxes means that we get the 6th box for free. (I started thinking about this picture of generalization based on a talk by John Schulman, who suggested that generalization may be more likely when it amounts to filling in missing cells from this kind of grid.) In this example I described a tiny number of categories. That wasn't just for illustration purposes, I think these experiments are particularly interesting when the total number of categories is more like 3-5 than 1000 (see "Why try to generalize from n=3-5" below). Evaluation When it really counts we won't be able to tell if our model answers questions h...