Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Clarifying inner alignment terminology, published by Evan Hubinger on the AI Alignment Forum.
I have seen a lot of confusion recently surrounding exactly how outer and inner alignment should be defined and I want to try and provide my attempt at a clarification.
Here's my diagram of how I think the various concepts should fit together:
The idea of this diagram is that the arrows are implications—that is, for any problem in the diagram, if its direct subproblems are solved, then it should be solved as well (though not necessarily vice versa). Thus, we get:
inner alignment
→
objective robustness
outer alignment
∧
objective robustness
→
intent alignment
intent alignment
∧
capability robustness
→
alignment
And here are all my definitions of the relevant terms which I think produce those implications:
(Impact) Alignment: An agent is impact aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic.
Intent Alignment: An agent is intent aligned if the optimal policy for its behavioral objective[1] is impact aligned with humans.
Outer Alignment: An objective function
r
is outer aligned if all models that perform optimally on
r
in the limit of perfect training and infinite data are intent aligned.[2]
Robustness: An agent is robust if it performs well on the base objective it was trained under even in deployment/off-distribution.[3]
Objective Robustness: An agent is objective robust if the optimal policy for its behavioral objective is impact aligned with the base objective it was trained under.
Capability Robustness: An agent is capability robust if it performs well on its behavioral objective even in deployment/off-distribution.
Inner Alignment: A mesa-optimizer is inner aligned if the optimal policy for its mesa-objective is impact aligned with the base objective it was trained under.
And an explanation of each of the diagram's implications:
inner alignment
→
objective robustness
: If a model is a mesa-optimizer, then its behavioral objective should match its mesa-objective, which means if it's mesa-objective is aligned with the base, then it's behavioral objective should be too.
outer alignment
∧
objective robustness
→
intent alignment
: Outer alignment ensures that the base objective is measuring what we actually care about and objective robustness ensures that the model's behavioral objective is aligned with that base objective. Thus, putting them together, we get that the model's behavioral objective must be aligned with humans, which is precisely intent alignment.
intent alignment
∧
capability robustness
→
alignment
: Intent alignment ensures that the behavioral objective is aligned with humans and capability robustness ensures that the model actually pursues that behavioral objective effectively—even off-distribution—which means that the model will actually always take aligned actions, not just have an aligned behavioral objective.
FAQ
If a model is both outer and inner aligned, what does that imply?
Intent alignment. Reading off the implications from the diagram, we can see that the conjunction of outer and inner alignment gets us to intent alignment, but not all the way to impact alignment, as we're missing capability robustness.
Can impact alignment be split into outer alignment and inner alignment?
No. As I just mentioned, the conjunction of both outer and inner alignment only gives us intent alignment, not impact alignment. Furthermore, if the model is not a mesa-optimizer, then it can be objective robust (and thus intent aligned) without being inner aligned.
Does a model have to be inner aligned to be impact aligned?
No—we only need inner alignment if we're dealing with mesa-optimization. While we can get impact alignment through a combination of inner alignment, outer alignment, and capability robustness, t...
view more