Download - Utility ≠ Reward by Vladimir Mikulik

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

Utility ≠ Reward by Vladimir Mikulik

2021-12-05

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Utility ≠ Reward, published by Vladimir Mikulik on the AI Alignment Forum. This essay is an adaptation of a talk I gave at the Human-Aligned AI Summer School 2019 about our work on mesa-optimisation. My goal here is to write an informal, accessible and intuitive introduction to the worry that we describe in our full-length report. I will skip most of the detailed analysis from our report, and encourage the curious reader to follow up this essay with our sequence or report. The essay has six parts: Two distinctions draws the foundational distinctions between “optimised” and “optimising”, and between utility and reward. What objectives? discusses the behavioral and internal approaches to understanding objectives of ML systems. Why worry? outlines the risk posed by the utility ≠ reward gap. Mesa-optimisers introduces our language for analysing this worry. An alignment agenda sketches different alignment problems presented by these ideas, and suggests transparency and interpretability as a way to solve them. Where does this leave us? summarises the essay and suggests where to look next. The views expressed here are my own, and do not necessarily reflect those of my coauthors or MIRI. While I wrote this essay in first person, all of the core ideas are the fruit of an equal collaboration between Joar Skalse, Chris van Merwijk, Evan Hubinger and myself. I wish to thank Chris and Joar for long discussions and input as I was writing my talk, and all three, as well as Jaime Sevilla Molina, for thoughtful comments on this essay. ≈3300 words. Two distinctions I wish to draw a distinction which I think is crucial for clarity about AI alignment, yet is rarely drawn. That distinction is between the reward signal of a reinforcement learning (RL) agent and its “utility function”[1]. That is to say, it is not in general true that the policy of an RL agent is optimising for its reward. To explain what I mean by this, I will first draw another distinction, between “optimised” and “optimising”. These distinctions lie at the core of our mesa-optimisation framework. It’s helpful to begin with an analogy. Viewed abstractly, biological evolution is an optimisation process that searches through configurations of matter to find ones that are good at replication. Humans are a product of this optimisation process, and so we are to some extent good at replicating. Yet we don’t care, by and large, about replication in itself. Many things we care about look like replication. One might be motivated by starting a family, or by having a legacy, or by similar closely related things. But those are not replication itself. If we cared about replication directly, gamete donation would be a far more mainstream practice than it is, for instance. Thus I want to distinguish the objective of the selection pressure that produced humans from the objectives that humans pursue. Humans were selected for replication, so we are good replicators. This includes having goals that correlate with replication. But it is plain that we are not motivated by replication itself. As a slogan, though we are optimised for replication, we aren’t optimising for replication. Another clear case where “optimised” and “optimising” come apart are “dumb” artifacts like bottle caps. They can be heavily optimised for some purpose without optimising for anything at all. These examples support the first distinction I want to make: optimised ≠ optimising. They also illustrate how this distinction is important in two ways: A system optimised for an objective need not be pursuing any objectives itself. (As illustrated by bottle caps.) The objective a system pursues isn’t determined by the objective it was optimised for. (As illustrated by humans.) The reason I draw this distinction is to ask the following question: Our machine learning models are...