Download - Formal Inner Alignment, Prospectus by Abram Demski

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

Formal Inner Alignment, Prospectus by Abram Demski

2021-12-05

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Formal Inner Alignment, Prospectus, published by Abram Demski on the AI Alignment Forum. Most of the work on inner alignment so far has been informal or semi-formal (with the notable exception of a little work on minimal circuits). I feel this has resulted in some misconceptions about the problem. I want to write up a large document clearly defining the formal problem and detailing some formal directions for research. Here, I outline my intentions, inviting the reader to provide feedback and point me to any formal work or areas of potential formal work which should be covered in such a document. (Feel free to do that last one without reading further, if you are time-constrained!) The State of the Subfield Risks from Learned Optimization (henceforth, RLO) offered semi-formal definitions of important terms, and provided an excellent introduction to the area for a lot of people (and clarified my own thoughts and the thoughts of others who I know, even though we had already been thinking about these things). However, RLO spent a lot of time on highly informal arguments (analogies to evolution, developmental stories about deception) which help establish the plausibility of the problem. While I feel these were important motivation, in hindsight I think they've caused some misunderstandings. My interactions with some other researchers has caused me to worry that some people confuse the positive arguments for plausibility with the core problem, and in some cases have exactly the wrong impression about the core problem. This results in mistakenly trying to block the plausibility arguments, which I see as merely illustrative, rather than attacking the core problem. By no means do I intend to malign experimental or informal/semiformal work. Rather, by focusing on formal theoretical work, I aim to fill a hole I perceive in the field. I am very appreciative of much of the informal/semiformal work that has been done so far, and continue to think that kind of work is necessary for the crystallization of good concepts. Focusing on the Core Problem In order to establish safety properties, we would like robust safety arguments ("X will not happen" / "X has an extremely low probability of happening"). For example, arguments that probability of catastrophe will be very low, or arguments that probability of intentional catastrophe will be very low (ie, intent-alignment), or something along those lines. For me, the core inner alignment problem is the absence of such an argument in a case where we might naively expect it. We don't know how to rule out the presence of (misaligned) mesa-optimizers. Instead, I see many people focusing on blocking the plausibility arguments in RLO. This strikes me as the wrong direction. To me, these arguments are merely illustrative. It seems like some people have gotten the impression that when the assumptions of the plausibility arguments in RLO aren't met, we should not expect an inner alignment problem to arise. Not only does this attitude misunderstand what we want (ie, a strong argument that we won't encounter a problem) -- I further think it's actually wrong (because when we look at almost any case, we see cause for concern). Examples: The Developmental Story One recent conversation involved a line of research based on the developmental story, where a mesa-optimizer develops a pseudo-aligned objective early in training (an objective with a strong statistical correlation to the true objective in the training data), but as it learns more about the world, it improves its training score by becoming deceptive rather than by fixing the pseudo-aligned objective. The research proposal being presented to me involved shaping the early pseudo-aligned objective in very coarse-grained ways, which might ensure (for example) a high preference for cooperative behavior...