Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Ought: why it matters and ways to help, published by Paul Christiano on the AI Alignment Forum.
I think that Ought is one of the most promising projects working on AI alignment. There are several ways that LW readers can potentially help:
They are recruiting a senior full-stack web developer.
They are recruiting participants for "factored evaluation" experiments.
I think donors interested in AI safety should consider Ought.
In this post I'll describe what Ought is currently doing, why I think it's promising, and give some detail on these asks.
(I am an Ought donor and board member.)
Factored evaluation
Ought's main project is currently designing and running "factored evaluation" experiments, and building relevant infrastructure. The goal of these experiments is to answer the following question:
Consider a principal who wants to incentivize experts to work on hard problems, but finds that the work is too complex for the principal to tell what is good or bad. Can they overcome this problem by dividing the evaluation task into smaller pieces, delegating each of them to other experts, and recursively incentivizing good work on those pieces using the same mechanism?
Here's what an experiment looks like:
Recruit a pool of "judges," "helpful experts," and "malicious experts."
Start with a question that is too difficult for the judge to answer.
In initial experiments, the questions are difficult because the judge lacks relevant context. For example, the judge may want to answer a question about a long article, while only being allowed to look at a few hundred characters of text.
Limiting the amount of text the judge can read is an artificial restriction. This is intended as a warm-up, an easier analogy for the situation where the experts have knowledge that cannot be explained to a single judge in a reasonable amount of time (or where we want to reduce the cost of judging).
An "honest expert" provides a good answer to that question, a "malicious expert" provides a bad answer to that question, and a judge is tasked with telling which answer was better. The answers themselves may be too large and complex for the judge to look at, in which case they'll be handed a compact representation of each answer which they can inspect as needed.
In order to tell which answer is better, the judge asks subquestions. For example, they might ask "What's the best reason to think that answer A is better than answer B?" or "What does the author of article X believe about question Y?"
We answer these subquestions by delegating to the experts. The honest expert answers each subquestion as well as they can. The malicious expert can propose a bad subanswer, and we run the experiment recursively to see if the judge would pick the good or bad subanswer to the subquestion. If the malicious expert can propose subanswers that corrupt the top-level answer, and that are selected by the judge when we run this experiment recursively, then the malicious expert wins and the experiment fails.
This is not Ought's only project, but it's currently the largest single focus. Other projects include: exploring how well we can automate the judge's role on simple questions using existing ML, and thinking about possible decomposition strategies and challenges for factored evaluation.
Why this is important for AI alignment
ML systems are trained by gradient descent to optimize a measurable objective. In the best case (i.e. ignoring misaligned learned optimization) they behave like an expert incentivized to optimize that objective. Designing an objective that incentivizes experts to reveal what they know seems like a critical step in AI alignment. I think human experts are often a useful analogy for powerful ML systems, and that we should be using that analogy as much as we can.
Not coincidentally, factored evaluation is a...
view more