Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Imitative Generalisation (AKA 'Learning the Prior'), published by Beth Barnes on the AI Alignment Forum.
Tl;dr
We want to be able to supervise models with superhuman knowledge of the world and how to manipulate it. For this we need an overseer to be able to learn or access all the knowledge our models have, in order to be able to understand the consequences of suggestions or decisions from the model. If the overseers don’t have access to all the same knowledge as the model, it may be easy for the model to deceive us, suggesting plans that look good to us but that may have serious negative consequences.
We might hope to access what the model knows just by training it to answer questions. However, we can only train on questions that humans are able to answer[1]. This gives us a problem that’s somewhat similar to the standard formulation of transduction: we have some labelled training set (questions humans can answer), and we want to transfer to an unlabelled dataset (questions we care about), that may be differently distributed.
We might hope that our models will naturally generalize correctly from easy-to-answer questions to the ones that we care about. However, a natural pathological generalisation is for our models to only give us ‘human-like’ answers to questions, even if it knows the best answer is different. If we only have access to these human-like answers to questions, that probably doesn’t give us enough information to supervise a superhuman model.
What we’re going to call ‘Imitative Generalization’ is a possible way to narrow the gap between the things our model knows, and the questions we can train our model to answer honestly. It avoids the pathological generalisation by only using ML for IID tasks, and imitating the way humans generalize. This hopefully gives us answers that are more like ‘how a human would answer if they’d learnt from all the data the model has learnt from’. We supervise how the model does the transfer, to get the sort of generalisation we want.
It’s worth noting there are enough serious open questions that imitative generalization is more of a research proposal than an algorithm!
This post is based on work done with Paul Christiano at OpenAI. Thanks very much to Evan Hubinger, Richard Ngo, William Saunders, Long Ouyang and others for helpful feedback, as well as Alice Fares for formatting help
Goals of this post
This post tries to explain a simplified[2] version of Paul Christiano’s mechanism introduced here, (referred to there as ‘Learning the Prior’) and explain why a mechanism like this potentially addresses some of the safety problems with naïve approaches. First we’ll go through a simple example in a familiar domain, then explain the problems with the example. Then I’ll discuss the open questions for making Imitative Generalization actually work, and the connection with the Microscope AI idea. A more detailed explanation of exactly what the training objective is (with diagrams), and the correspondence with Bayesian inference, are in the appendix.
Example: using IG to avoid overfitting in image classification.
Here’s an example of using Imitative Generalization to get better performance on a standard ML task: image classification of dog breeds, with distributional shift.
Imagine we want to robustly learn to classify dog breeds, but the human labellers we have access to don’t actually know how to identify all the breeds[3], and we don’t have any identification guides or anything. However, we do have access to a labelled dataset
D
We want to classify dogs in a different dataset
D
′
, which is unlabelled.
One unfamiliar breed we want to learn to recognise is a husky. It happens that all the huskies in
D
are on snow, but in
D
′
some of them are on grass.
Label: Husky
Image from
D
Label: ???
OOD image from
D
′
A NN architecture prior lik...
view more