Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Debate update: Obfuscated arguments problem, published by Beth Barnes on the AI Alignment Forum.
This is an update on the work on AI Safety via Debate that we previously wrote about here.
Authors and Acknowledgements
The researchers on this project were Elizabeth Barnes and Paul Christiano, with substantial help from William Saunders (who built the current web interface as well as other help), Joe Collman (who helped develop the structured debate mechanisms), and Mark Xu, Chris Painter, Mihnea Maftei and Ronny Fernandez (who took part in many debates as well as helping think through problems). We're also grateful to Geoffrey Irving and Evan Hubinger for feedback on drafts, and for helpful conversations, along with Richard Ngo, Daniel Ziegler, John Schulman, Amanda Askell and Jeff Wu. Finally, we're grateful to our contractors who participated in experiments, including Adam Scherlis, Kevin Liu, Rohan Kapoor and Kunal Sharda.
What we did
We tested the debate protocol introduced in AI Safety via Debate with human judges and debaters. We found various problems and improved the mechanism to fix these issues (details of these are in the appendix). However, we discovered that a dishonest debater can often create arguments that have a fatal error, but where it is very hard to locate the error. We don’t have a fix for this “obfuscated argument” problem, and believe it might be an important quantitative limitation for both IDA and Debate.
Key takeaways and relevance for alignment
Our ultimate goal is to find a mechanism that allows us to learn anything that a machine learning model knows: if the model can efficiently find the correct answer to some problem, our mechanism should favor the correct answer while only requiring a tractable number of human judgements and a reasonable number of computation steps for the model. [1]
We’re working under a hypothesis that there are broadly two ways to know things: via step-by-step reasoning about implications (logic, computation.), and by learning and generalizing from data (pattern matching, bayesian updating.).
Debate focuses on verifying things via step-by-step reasoning. It seems plausible that a substantial proportion of the things a model ‘knows’ will have some long but locally human-understandable argument for their correctness. [2] Previously we hoped that debate/IDA could verify any knowledge for which such human-understandable arguments exist, even if these arguments are intractably large. We hoped the debaters could strategically traverse small parts of the implicit large argument tree and thereby show that the whole tree could be trusted.
The obfuscated argument problem suggests that we may not be able to rely on debaters to find flaws in large arguments, so that we can only trust arguments when we could find flaws by recursing randomly---e.g. because the argument is small enough that we could find a single flaw if one existed, or because the argument is robust enough that it is correct unless it has many flaws. This suggests that while debates may let us verify arguments too large for unaided humans to understand, those arguments may still have to be small relative to the computation used during training.
We believe that many important decisions can’t be justified with arguments small or robust enough to verify in this way. To supervise ML systems that make such decisions, we either need to find some restricted class of arguments for which we believe debaters can reliably find flaws, or we need to be able to trust the representations or heuristics that our models learn from the training data (rather than verifying them in a given case via debate). We have been thinking about approaches like learning the prior to help trust our models’ generalization. This is probably better investigated through ML experiments or theoretical ...
view more