Download - Clarifying inner alignment terminology by Evan Hubinger

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

Clarifying inner alignment terminology by Evan Hubinger

2021-12-05

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Clarifying inner alignment terminology, published by Evan Hubinger on the AI Alignment Forum. I have seen a lot of confusion recently surrounding exactly how outer and inner alignment should be defined and I want to try and provide my attempt at a clarification. Here's my diagram of how I think the various concepts should fit together: The idea of this diagram is that the arrows are implications—that is, for any problem in the diagram, if its direct subproblems are solved, then it should be solved as well (though not necessarily vice versa). Thus, we get: inner alignment → objective robustness outer alignment ∧ objective robustness → intent alignment intent alignment ∧ capability robustness → alignment And here are all my definitions of the relevant terms which I think produce those implications: (Impact) Alignment: An agent is impact aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic. Intent Alignment: An agent is intent aligned if the optimal policy for its behavioral objective[1] is impact aligned with humans. Outer Alignment: An objective function r is outer aligned if all models that perform optimally on r in the limit of perfect training and infinite data are intent aligned.[2] Robustness: An agent is robust if it performs well on the base objective it was trained under even in deployment/off-distribution.[3] Objective Robustness: An agent is objective robust if the optimal policy for its behavioral objective is impact aligned with the base objective it was trained under. Capability Robustness: An agent is capability robust if it performs well on its behavioral objective even in deployment/off-distribution. Inner Alignment: A mesa-optimizer is inner aligned if the optimal policy for its mesa-objective is impact aligned with the base objective it was trained under. And an explanation of each of the diagram's implications: inner alignment → objective robustness : If a model is a mesa-optimizer, then its behavioral objective should match its mesa-objective, which means if it's mesa-objective is aligned with the base, then it's behavioral objective should be too. outer alignment ∧ objective robustness → intent alignment : Outer alignment ensures that the base objective is measuring what we actually care about and objective robustness ensures that the model's behavioral objective is aligned with that base objective. Thus, putting them together, we get that the model's behavioral objective must be aligned with humans, which is precisely intent alignment. intent alignment ∧ capability robustness → alignment : Intent alignment ensures that the behavioral objective is aligned with humans and capability robustness ensures that the model actually pursues that behavioral objective effectively—even off-distribution—which means that the model will actually always take aligned actions, not just have an aligned behavioral objective. FAQ If a model is both outer and inner aligned, what does that imply? Intent alignment. Reading off the implications from the diagram, we can see that the conjunction of outer and inner alignment gets us to intent alignment, but not all the way to impact alignment, as we're missing capability robustness. Can impact alignment be split into outer alignment and inner alignment? No. As I just mentioned, the conjunction of both outer and inner alignment only gives us intent alignment, not impact alignment. Furthermore, if the model is not a mesa-optimizer, then it can be objective robust (and thus intent aligned) without being inner aligned. Does a model have to be inner aligned to be impact aligned? No—we only need inner alignment if we're dealing with mesa-optimization. While we can get impact alignment through a combination of inner alignment, outer alignment, and capability robustness, t...