Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Formal Inner Alignment, Prospectus, published by Abram Demski on the AI Alignment Forum.
Most of the work on inner alignment so far has been informal or semi-formal (with the notable exception of a little work on minimal circuits). I feel this has resulted in some misconceptions about the problem. I want to write up a large document clearly defining the formal problem and detailing some formal directions for research. Here, I outline my intentions, inviting the reader to provide feedback and point me to any formal work or areas of potential formal work which should be covered in such a document. (Feel free to do that last one without reading further, if you are time-constrained!)
The State of the Subfield
Risks from Learned Optimization (henceforth, RLO) offered semi-formal definitions of important terms, and provided an excellent introduction to the area for a lot of people (and clarified my own thoughts and the thoughts of others who I know, even though we had already been thinking about these things).
However, RLO spent a lot of time on highly informal arguments (analogies to evolution, developmental stories about deception) which help establish the plausibility of the problem. While I feel these were important motivation, in hindsight I think they've caused some misunderstandings. My interactions with some other researchers has caused me to worry that some people confuse the positive arguments for plausibility with the core problem, and in some cases have exactly the wrong impression about the core problem. This results in mistakenly trying to block the plausibility arguments, which I see as merely illustrative, rather than attacking the core problem.
By no means do I intend to malign experimental or informal/semiformal work. Rather, by focusing on formal theoretical work, I aim to fill a hole I perceive in the field. I am very appreciative of much of the informal/semiformal work that has been done so far, and continue to think that kind of work is necessary for the crystallization of good concepts.
Focusing on the Core Problem
In order to establish safety properties, we would like robust safety arguments ("X will not happen" / "X has an extremely low probability of happening"). For example, arguments that probability of catastrophe will be very low, or arguments that probability of intentional catastrophe will be very low (ie, intent-alignment), or something along those lines.
For me, the core inner alignment problem is the absence of such an argument in a case where we might naively expect it. We don't know how to rule out the presence of (misaligned) mesa-optimizers.
Instead, I see many people focusing on blocking the plausibility arguments in RLO. This strikes me as the wrong direction. To me, these arguments are merely illustrative.
It seems like some people have gotten the impression that when the assumptions of the plausibility arguments in RLO aren't met, we should not expect an inner alignment problem to arise. Not only does this attitude misunderstand what we want (ie, a strong argument that we won't encounter a problem) -- I further think it's actually wrong (because when we look at almost any case, we see cause for concern).
Examples:
The Developmental Story
One recent conversation involved a line of research based on the developmental story, where a mesa-optimizer develops a pseudo-aligned objective early in training (an objective with a strong statistical correlation to the true objective in the training data), but as it learns more about the world, it improves its training score by becoming deceptive rather than by fixing the pseudo-aligned objective. The research proposal being presented to me involved shaping the early pseudo-aligned objective in very coarse-grained ways, which might ensure (for example) a high preference for cooperative behavior...
view more