Download - Writeup: Progress on AI Safety via Debate by Beth Barnes, Paul Christiano

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

Writeup: Progress on AI Safety via Debate by Beth Barnes, Paul Christiano

2021-12-05

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Writeup: Progress on AI Safety via Debate, published by Beth Barnes, Paul Christiano on the AI Alignment Forum. This is a writeup of the research done by the "Reflection-Humans" team at OpenAI in Q3 and Q4 of 2019. During that period we investigated mechanisms that would allow evaluators to get correct and helpful answers from experts, without the evaluators themselves being expert in the domain of the questions. This follows from the original work on AI Safety via Debate and the call for research on human aspects of AI safety, and is also closely related to work on Iterated Amplification. Authors and Acknowledgements The main researchers on this project were Elizabeth Barnes, Paul Christiano, Long Ouyang and Geoffrey Irving. We are grateful to many others who offered ideas and feedback. In particular: the cross-examination idea was inspired by a conversation with Chelsea Voss; Adam Gleave had helpful ideas about the long computation problem; Jeff Wu, Danny Hernandez and Gretchen Krueger gave feedback on a draft; we had helpful conversations with Amanda Askell, Andreas Stuhlmüller and Joe Collman, as well as others on the Ought team and the OpenAI Reflection team. We’d also like to thank our contractors who participated in debate experiments, especially David Jones, Erol Akbaba, Alex Deam and Chris Painter. Oliver Habryka helped format and edit the document for the AI Alignment Forum. Note by Oliver: There is currently a bug with links to headings in a post, causing them to not properly scroll when clicked. Until that is fixed, just open those links in a new tab, which should scroll correctly. Overview Motivation As we apply ML to increasingly important and complex tasks, the problem of evaluating behaviour and providing a good training signal becomes more difficult. We already see examples of RL leading to undesirable behaviours that superficially ‘look good’ to human evaluators (see this collection of examples). One example from an OpenAI paper is an agent learning incorrect behaviours in a 3d simulator, because the behaviours look like the desired behaviour in the 2d clip the human evaluator is seeing. We’d like to ensure that AI systems are aligned with human values even in cases where it’s beyond human ability to thoroughly check the AI system’s work. We can learn about designing ML objectives by studying mechanisms for eliciting helpful behavior from human experts. For example, if we hire a physicist to answer physics questions and pay them based on how good their answers look to a layperson, we’ll incentivize lazy and incorrect answers. By the same token, a reward function based on human evaluations would not work well for an AI with superhuman physics knowledge, even if it works well for modern ML. If we can develop a mechanism that allows non-expert humans to reliably incentivize experts to give helpful answers, we can use similar mechanisms to train ML systems to solve tasks where humans cannot directly evaluate performance. Conversely, if we can’t incentivize experts to behave helpfully, that suggests it will also be difficult to train ML systems with superhuman expertise on open-ended tasks. One broad mechanism that might work is to invoke two (or more) competing agents that critique each others’ positions, as discussed in the original debate paper[1]. This can be simulated by having human debaters argue about a question and a judge attempt to pick the correct answer. In the rest of this document, we’ll describe the research done by reflection-humans in Q3 and Q4 on investigating and developing mechanisms that incentivize human experts to give helpful answers. Current process During the early stages, we iterated through various different domains, research methodologies, judge pools, and research processes. More details of this early iteration are here. In...