Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Challenges to Christiano’s capability amplification proposal, published by Eliezer Yudkowsky on the AI Alignment Forum.
The following is a basically unedited summary I wrote up on March 16 of my take on Paul Christiano’s AGI alignment approach (described in “ALBA” and “Iterated Distillation and Amplification”). Where Paul had comments and replies, I’ve included them below.
I see a lot of free variables with respect to what exactly Paul might have in mind. I've sometimes tried presenting Paul with my objections and then he replies in a way that locally answers some of my question but I think would make other difficulties worse. My global objection is thus something like, "I don't see any concrete setup and consistent simultaneous setting of the variables where this whole scheme works." These difficulties are not minor or technical; they appear to me quite severe. I try to walk through the details below.
It should be understood at all times that I do not claim to be able to pass Paul’s ITT for Paul’s view and that this is me criticizing my own, potentially straw misunderstanding of what I imagine Paul might be advocating.
Paul Christiano
Overall take: I think that these are all legitimate difficulties faced by my proposal and to a large extent I agree with Eliezer's account of those problems (though not his account of my current beliefs).
I don't understand exactly how hard Eliezer expects these problems to be; my impression is "just about as hard as solving alignment from scratch," but I don't have a clear sense of why.
To some extent we are probably disagreeing about alternatives. From my perspective, the difficulties with my approach (e.g. better understanding the forms of optimization that cause trouble, or how to avoid optimization daemons in systems about as smart as you are, or how to address X-and-only-X) are also problems for alternative alignment approaches. I think it's a mistake to think that tiling agents, or decision theory, or naturalized induction, or logical uncertainty, are going to make the situation qualitatively better for these problems, so work on those problems looks to me like procrastinating on the key difficulties. I agree with the intuition that progress on the agent foundations agenda "ought to be possible," and I agree that it will help at least a little bit with the problems Eliezer describes in this document, but overall agent foundations seems way less promising than a direct attack on the problems (given that we haven’t tried the direct attack nearly enough to give up). Working through philosophical issues in the context of a concrete alignment strategy generally seems more promising to me than trying to think about them in the abstract, and I think this is evidenced by the fact that most of the core difficulties in my approach would also afflict research based on agent foundations.
The main way I could see agent foundations research as helping to address these problems, rather than merely deferring them, is if we plan to eschew large-scale ML altogether. That seems to me like a very serious handicap, so I'd only go that direction once I was quite pessimistic about solving these problems. My subjective experience is of making continuous significant progress rather than being stuck. I agree there is clear evidence that the problems are "difficult" in the sense that we are going to have to make progress in order to solve them, but not that they are "difficult" in the sense that P vs. NP or even your typical open problem in CS is probably difficult (and even then if your options were "prove P != NP" or "try to beat Google at building an AGI without using large-scale ML," I don't think it's obvious which option you should consider more promising).
First and foremost, I don't understand how "preserving alignment while amplifying capabilitie...
view more