Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Goals selected from learned knowledge: an alternative to RL alignment, published by Seth Herd on January 15, 2024 on The AI Alignment Forum.
Summary:
Alignment work on network-based AGI focuses on reinforcement learning. There is an alternative approach that avoids some, but not all, of the difficulties...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Goals selected from learned knowledge: an alternative to RL alignment, published by Seth Herd on January 15, 2024 on The AI Alignment Forum.
Summary:
Alignment work on network-based AGI focuses on reinforcement learning. There is an alternative approach that avoids some, but not all, of the difficulties of RL alignment. Instead of trying to build an adequate representation of the behavior and goals we want, by specifying rewards, we can choose its goals from the representations it has learned through any learning method.
I give three examples of this approach: Steve Byrnes' plan for mediocre alignment (of RL agents); John Wentworth's "redirect the search" for goal-directed mesa-optimizers that could emerge in predictive networks; and natural language alignment for language model agents. These three approaches fall into a natural category that has important advantages over commonly considered RL alignment approaches.
An alternative to RL alignment
Recent work on alignment theory has focused on reinforcement learning (RL) alignment. RLHF and Shard Theory are two examples, but most work addressing network-based AGI assumes we will try to create human-aligned goals and behavior by specifying rewards. For instance, Yudkowsky's
List of Lethalities seems to address RL approaches and exemplifies the most common critiques: specifying behavioral correlates of desired values seems imprecise and prone to mesa-optimization and misgeneralization in new contexts. I think RL alignment might work, but I agree with the
critique that much
optimism for RL alignment doesn't adequately consider those concerns.
There's an alternative to RL alignment for network-based AGI. Instead of trying to provide reinforcement signals that will create representations of aligned values, we can let it learn all kinds of representations, using any learning method, and then select from those representations what we want the goals to be.
I'll call this approach goals selected from learned knowledge (GSLK). It is a novel alternative not only to RL alignment but also to older strategies focused on specifying an aligned maximization goal before training an agent. Thus, it violates some of the assumptions that lead MIRI leadership and similar thinkers to predict near-certain doom.
Goal selection from learned knowledge (GSLK) involves allowing a system to learn until it forms robust representations, then selecting some of these representations to serve as goals. This is a paradigm shift from RL alignment. RL alignment has dominated alignment discussions since deep networks became the clear leader in AI. RL alignment attempts to construct goal representations by specifying reward conditions.
In GSLK alignment, the system learns representations of a wide array of outcomes and behaviors, using any effective learning mechanisms. From that spectrum of representations, goals are selected. This shifts the problem from creation to selection of complex representations.
This class of alignment approaches shares some of the difficulties of RL alignment proposals, but not all of them. Thus far GSLK approaches have received little critique or analysis. Several recent proposals share this structure, and my purpose here is to generalize from those examples to identify the category.
I think this approach is worth some careful consideration because it's likely to actually be tried. This approach applies both to LLM agents, and to most types of RL agents, and to agentic mesa-optimization in large foundation models. And it's pretty obvious, at least in hindsight. If the first agentic AGI is an LLM agent, an RL agent, or a combination of the two, I think it's fairly likely that this will be part of the alignment plan whose success or failure determines all of our fates. So I'd like to get more critique and analysis of this approach...
View more