Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Risks from Learned Optimization: Introduction , published by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant on the AI Alignment Forum.
This is the first of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper.
Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, and Joar Skalse contributed equally to this sequence. With special thanks to Paul Christiano, Eric Drexler, Rob Bensinger, Jan Leike, Rohin Shah, William Saunders, Buck Shlegeris, David Dalrymple, Abram Demski, Stuart Armstrong, Linda Linsefors, Carl Shulman, Toby Ord, Kate Woolverton, and everyone else who provided feedback on earlier versions of this sequence.
Motivation
The goal of this sequence is to analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as mesa-optimization, a neologism we introduce in this sequence. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be—how will it differ from the loss function it was trained under—and how can it be aligned?
We believe that this sequence presents the most thorough analysis of these questions that has been conducted to date. In particular, we present not only an introduction to the basic concerns surrounding mesa-optimizers, but also an analysis of the particular aspects of an AI system that we believe are likely to make the problems related to mesa-optimization relatively easier or harder to solve. By providing a framework for understanding the degree to which different AI systems are likely to be robust to misaligned mesa-optimization, we hope to start a discussion about the best ways of structuring machine learning systems to solve these problems. Furthermore, in the fourth post we will provide what we think is the most detailed analysis yet of a problem we refer as deceptive alignment which we posit may present one of the largest—though not necessarily insurmountable—current obstacles to producing safe advanced machine learning systems using techniques similar to modern machine learning.
Two questions
In machine learning, we do not manually program each individual parameter of our models. Instead, we specify an objective function that captures what we want the system to do and a learning algorithm to optimize the system for that objective. In this post, we present a framework that distinguishes what a system is optimized to do (its “purpose”), from what it optimizes for (its “goal”), if it optimizes for anything at all. While all AI systems are optimized for something (have a purpose), whether they actually optimize for anything (pursue a goal) is non-trivial. We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system. Learning algorithms in machine learning are optimizers because they search through a space of possible parameters—e.g. neural network weights—and improve the parameters with respect to some objective. Planning algorithms are also optimizers, since they search through possible plans, picking those that do well according to some objective.
Whether a syste...
view more