Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: My take on Vanessa Kosoy's take on AGI safety, published by Steve Byrnes on the AI Alignment Forum.
Confidence level: Low
Vanessa Kosoy is a deep fountain of knowledge and insights about AGI safety, but I’ve had trouble understanding some aspects of her point of view. Part of the problem is just pedagogy, and part of it (I will argue) is that she has some different underlying assumptions and beliefs than I do. This post aims to address both those things. In particular, on the pedagogy front, I will try to give a sense for what Vanessa is doing and why, assuming minimal knowledge of either math or theoretical CS. (At least, that's my intention—please let me know if anything is confusing or jargon-y.)
Here’s an example of where we differ. I tend to think of things like “the problem of wireheading” and “the problem of ontological crises” etc. as being on the critical path to AGI safety—as in, I think that, to build safe AGIs, we’ll need to be talking explicitly about these specific problems, and others like them, and to be addressing those specific problems with specific solutions. But Vanessa seems to disagree. What’s the root cause of that disagreement? More to the point, am I wasting my time, thinking about the wrong things?
Vanessa responds: Actually I don't think I disagree? I don't like the name "ontological crisis" since I think it presupposes a particular framing that's not necessarily useful. However I do think it's important to understand how agents can have utility functions that depend on unobservable quantities. I talked about it in Reinforcement Learning With Imperceptible Rewards and have more to say in an upcoming post.
Let’s find out!
Many thanks to Vanessa for patiently engaging with me. Also, thanks to Adam Shimi & Logan Smith for comments on a draft.
Summary & Table of Contents
Section 1 is just getting situated, i.e. what is the problem we’re trying to solve here?
In Section 2, I compare the more popular “algorithms-first approach” to Vanessa’s “desiderata-first approach”. In brief, the former is when you start with an AGI-relevant algorithm and figure out how to make it safe. The latter is when you come up with one or more precise criteria, called desiderata, such that if an algorithm satisfies the desiderata, then it would be safe. Then you go try to find algorithms for which you can prove that they satisfy the desiderata.
Sections 3-5 go through the three ingredients needed for AGI safety in Vanessa’s “desiderata-first approach”:
Section 3 covers the part where we prove that an AI algorithm satisfies some precisely-defined desiderata. I’ll cover some key background concepts (“regret bounds”, “traps”, “realizability”), and some of Vanessa’s related ideas (“Delegative Reinforcement Learning”, “Infra-Bayesianism”), and how they’re all connected.
Section 4 covers the part where we come up with good desiderata. To give a taste of what Vanessa has in mind, I give an intuitive walk-through of a particular example she came up with recently: “The Hippocratic Principle” desideratum, and “Hippocratic Timeline-Driven Learning”, an example type of algorithm that would satisfy the desideratum.
Section 5 covers “non-Cartesian daemons”. This part is basically filling in a loophole in the “desiderata-first” framework, namely ruling out bad behaviors unrelated to the AI’s nominal output, like if the AI hacks into the operating system that it’s running on.
Section 6 switches to my own opinions:
In Section 6.1, I circle back to the “algorithms-first” vs “desiderata-first” distinction from Section 2, arguing that there’s less to it than it first appears, and that a more important difference is the approach to “weird failure modes that x-risk people talk about” (wireheading, ontological crises, deceptive mesa-optimizers, incorrigibility, gradient hacking, etc. etc.). ...
view more