Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Experimentally evaluating whether honesty generalizes, published by Paul Christiano on the AI Alignment Forum.
If we train our ML systems to answer questions honestly in cases where humans can check the answer, will they generalize to behave honestly on questions where we can’t check?
I think that we could learn a lot about this question by running experiments today. I think those experiments would be very valuable.
(I don't know anyone currently planning on working on this topic and I'd love it if anyone wants to take that up. This post doesn't represent a claim to any credit for any results in this genre, and other people have had very similar ideas. If you run some experiments you could cite this post but it's also fine if that doesn't make sense in context.)
The unsupervised translation setting
As an example, I’ll think about “unsupervised” translation (if you’ve read that post you can skip this section).
Consider a model like GPT-3 that is trained to predict sentences in both English and French (but without a large dataset of translations). Suppose we want to train this model to answer questions in English about French sentences like “what does that word mean here?” or “are there any other plausible interpretations?” or “how does the speaker seem to feel about the topic they are discussing?”
We expect this to be possible, because the model understands quite a lot about the meaning of sentences in French, and is able to express itself in English. There may be cases where the model doesn’t know the translation of a concept, or doesn’t quite understand what an idiom means, but it should still be able to tell us what it does know.
I think this problem is an interesting analogy for a situation where an AI has built up superhuman knowledge by making predictions, and we want to train our AI to expose that knowledge to us in a useful way.
Proposed experiments
Let's pick a few categories of knowledge/capabilities. For example, we could split it up into an understanding of grammar ("Why would it have been a grammatical error to write Tu Vas in that sentences?"), of the literal meanings of expressions ("What does Defendre mean in this sentence?"), and evaluating tone ("Does the speaker seem angry or sad about the topic they are discussing?").
We'll assume that humans can oversee a few of these categories. Perhaps we can look up literal meanings of words and small phrases in a dictionary and we can look up grammar in a textbook, but we don't know how to assess tone beyond looking at the literal meanings.
Then we wonder: what happens if we fine-tune our model so that it answers questions well in all the domains we can evaluate? We know that the model knows something about connotations, because it uses connotations to predict next words. But will it tell us what it knows?
Grammar Definitions Tone
Uses to make predictions ✓ ✓ ✓
Gives accurate answers ✓ ✓ ?
In this picture, the ✓ indicates that we've selected our model to use the given knowledge in the given way (e.g. we've ensured that it uses its knowledge of syntax to predict the next word, and that it accurately answers questions about syntax). The ? indicates that we don't know whether using SGD to fill in the other 5 boxes means that we get the 6th box for free.
(I started thinking about this picture of generalization based on a talk by John Schulman, who suggested that generalization may be more likely when it amounts to filling in missing cells from this kind of grid.)
In this example I described a tiny number of categories. That wasn't just for illustration purposes, I think these experiments are particularly interesting when the total number of categories is more like 3-5 than 1000 (see "Why try to generalize from n=3-5" below).
Evaluation
When it really counts we won't be able to tell if our model answers questions h...
view more