Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Writeup: Progress on AI Safety via Debate, published by Beth Barnes, Paul Christiano on the AI Alignment Forum.
This is a writeup of the research done by the "Reflection-Humans" team at OpenAI in Q3 and Q4 of 2019. During that period we investigated mechanisms that would allow evaluators to get correct and helpful answers from experts, without the evaluators themselves being expert in the domain of the questions. This follows from the original work on AI Safety via Debate and the call for research on human aspects of AI safety, and is also closely related to work on Iterated Amplification.
Authors and Acknowledgements
The main researchers on this project were Elizabeth Barnes, Paul Christiano, Long Ouyang and Geoffrey Irving. We are grateful to many others who offered ideas and feedback. In particular: the cross-examination idea was inspired by a conversation with Chelsea Voss; Adam Gleave had helpful ideas about the long computation problem; Jeff Wu, Danny Hernandez and Gretchen Krueger gave feedback on a draft; we had helpful conversations with Amanda Askell, Andreas Stuhlmüller and Joe Collman, as well as others on the Ought team and the OpenAI Reflection team. We’d also like to thank our contractors who participated in debate experiments, especially David Jones, Erol Akbaba, Alex Deam and Chris Painter. Oliver Habryka helped format and edit the document for the AI Alignment Forum.
Note by Oliver: There is currently a bug with links to headings in a post, causing them to not properly scroll when clicked. Until that is fixed, just open those links in a new tab, which should scroll correctly.
Overview
Motivation
As we apply ML to increasingly important and complex tasks, the problem of evaluating behaviour and providing a good training signal becomes more difficult.
We already see examples of RL leading to undesirable behaviours that superficially ‘look good’ to human evaluators (see this collection of examples). One example from an OpenAI paper is an agent learning incorrect behaviours in a 3d simulator, because the behaviours look like the desired behaviour in the 2d clip the human evaluator is seeing.
We’d like to ensure that AI systems are aligned with human values even in cases where it’s beyond human ability to thoroughly check the AI system’s work.
We can learn about designing ML objectives by studying mechanisms for eliciting helpful behavior from human experts. For example, if we hire a physicist to answer physics questions and pay them based on how good their answers look to a layperson, we’ll incentivize lazy and incorrect answers. By the same token, a reward function based on human evaluations would not work well for an AI with superhuman physics knowledge, even if it works well for modern ML.
If we can develop a mechanism that allows non-expert humans to reliably incentivize experts to give helpful answers, we can use similar mechanisms to train ML systems to solve tasks where humans cannot directly evaluate performance. Conversely, if we can’t incentivize experts to behave helpfully, that suggests it will also be difficult to train ML systems with superhuman expertise on open-ended tasks.
One broad mechanism that might work is to invoke two (or more) competing agents that critique each others’ positions, as discussed in the original debate paper[1]. This can be simulated by having human debaters argue about a question and a judge attempt to pick the correct answer.
In the rest of this document, we’ll describe the research done by reflection-humans in Q3 and Q4 on investigating and developing mechanisms that incentivize human experts to give helpful answers.
Current process
During the early stages, we iterated through various different domains, research methodologies, judge pools, and research processes. More details of this early iteration are here.
In...
view more