Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Evaluating Stability of Unreflective Alignment, published by james.lucassen on February 1, 2024 on The AI Alignment Forum.
This post has an accompanying SPAR project! Apply here if you're interested in working on this with me.
Huge thanks to Mikita Balesni for helping me implement the MVP. Regular-sized thanks...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Evaluating Stability of Unreflective Alignment, published by james.lucassen on February 1, 2024 on The AI Alignment Forum.
This post has an accompanying SPAR project! Apply here if you're interested in working on this with me.
Huge thanks to Mikita Balesni for helping me implement the MVP. Regular-sized thanks to Aryan Bhatt, Rudolph Laine, Clem von Stengel, Aaron Scher, Jeremy Gillen, Peter Barnett, Stephen Casper, and David Manheim for helpful comments.
0. Key Claims
Most alignment work today doesn't aim for alignment that is stable under value-reflection[1].
I think this is probably the most sensible approach to alignment.
But there is a threat model which could potentially be a serious obstacle to this entire class of alignment approaches, which is not currently being tracked or evaluated. It goes like this:
Long-horizon capabilities require a particular capability I call "stepping back" (described in more detail in the body of the post)
Stepping back seems likely to be learned in a very generalizable way
If stepping back generalizes in what seems like the simplest / most natural way, it will create a tendency for AIs to do value-reflection
If we're unable to prevent this value-reflection, it will cause any AI whose alignment is not stable under value-reflection to become misaligned
However, I'm quite uncertain about whether or not this threat model will go through, for a few reasons:
The chain of argument is highly conjunctive and therefore fragile.
The track record of anthropomorphic reasoning about which AI capabilities will imply which other AI behaviors seems poor.
There is an intuitive counterexample provided by humans, at least at human-level intelligence.
Even given this uncertainty, work on this threat model seems very neglected given the wide swath of highly-invested-in alignment approaches that it could affect.
I don't think further abstract argument about the likelihood of this threat model will reduce my uncertainty much. Instead, I propose some evals to keep track of each component of this threat model:
Evaluate the stepping back capabilities of frontier AIs in domains where they receive the most long-horizon training.
Evaluate the generalization of frontier AIs stepping back behavior to other task domains.
Evaluate how difficult it is to decouple the capabilities of frontier AIs between domains.
Evaluate how difficult it is to prevent frontier AIs from thinking certain "forbidden thoughts" even if those thoughts are incentivized by the task setting
1. Reflective vs Unreflective Alignment
I want to start by making an explicit distinction between two importantly different uses of the word "alignment". The distinction is between alignment that aims to be preserved even under value-reflection by the AI, and alignment that does not aim for that. I will call alignment approaches that aim for stability under value-reflection "reflective alignment", and approaches that do not "unreflective alignment".
One easy way to connect this distinction to your existing thinking is by checking if you think of alignment as a binary or as a spectrum. Unreflective alignment naturally admits a spectrum formulation - alignment which is stable under X amount of value-reflection (or other destabilizing pressures). Reflective alignment, on the other hand, is typically thought of as a binary[2] - either the AI converges to the intended values in the limit, or it goes to some other fixed point instead.
In the current alignment discussion, iterative prosaic approaches generally propose to work by ensuring "alignment increases faster than capabilities". This is clearly an alignment-as-spectrum framing, suggesting the goal in mind is unreflective alignment. For further confirmation, consider the type of step that is considered an increase in alignment - going from baseline GPT-3 t...
View more