Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Utility ≠ Reward, published by Vladimir Mikulik on the AI Alignment Forum.
This essay is an adaptation of a talk I gave at the Human-Aligned AI Summer School 2019 about our work on mesa-optimisation. My goal here is to write an informal, accessible and intuitive introduction to the worry that we describe in our full-length report.
I will skip most of the detailed analysis from our report, and encourage the curious reader to follow up this essay with our sequence or report.
The essay has six parts:
Two distinctions draws the foundational distinctions between
“optimised” and “optimising”, and between utility and reward.
What objectives? discusses the behavioral and internal approaches to understanding objectives of ML systems.
Why worry? outlines the risk posed by the utility ≠ reward gap.
Mesa-optimisers introduces our language for analysing this worry.
An alignment agenda sketches different alignment problems presented by these ideas, and suggests transparency and interpretability as a way to solve them.
Where does this leave us? summarises the essay and suggests where to look next.
The views expressed here are my own, and do not necessarily reflect those of my coauthors or MIRI. While I wrote this essay in first person, all of the core ideas are the fruit of an equal collaboration between Joar Skalse, Chris van Merwijk, Evan Hubinger and myself. I wish to thank Chris and Joar for long discussions and input as I was writing my talk, and all three, as well as Jaime Sevilla Molina, for thoughtful comments on this essay.
≈3300 words.
Two distinctions
I wish to draw a distinction which I think is crucial for clarity about AI alignment, yet is rarely drawn. That distinction is between the reward signal of a reinforcement learning (RL) agent and its “utility function”[1]. That is to say, it is not in general true that the policy of an RL agent is optimising for its reward. To explain what I mean by this, I will first draw another distinction, between “optimised” and “optimising”. These distinctions lie at the core of our mesa-optimisation framework.
It’s helpful to begin with an analogy. Viewed abstractly, biological evolution is an optimisation process that searches through configurations of matter to find ones that are good at replication. Humans are a product of this optimisation process, and so we are to some extent good at replicating. Yet we don’t care, by and large, about replication in itself.
Many things we care about look like replication. One might be motivated by starting a family, or by having a legacy, or by similar closely related things. But those are not replication itself. If we cared about replication directly, gamete donation would be a far more mainstream practice than it is, for instance.
Thus I want to distinguish the objective of the selection pressure that produced humans from the objectives that humans pursue. Humans were selected for replication, so we are good replicators. This includes having goals that correlate with replication. But it is plain that we are not motivated by replication itself. As a slogan, though we are optimised for replication, we aren’t optimising for replication.
Another clear case where “optimised” and “optimising” come apart are “dumb” artifacts like bottle caps. They can be heavily optimised for some purpose without optimising for anything at all.
These examples support the first distinction I want to make: optimised ≠ optimising. They also illustrate how this distinction is important in two ways:
A system optimised for an objective need not be pursuing any objectives itself. (As illustrated by bottle caps.)
The objective a system pursues isn’t determined by the objective it was optimised for. (As illustrated by humans.)
The reason I draw this distinction is to ask the following question:
Our machine learning models are...
view more