Download - Risks from Learned Optimization: Conclusion and Related Work by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

Risks from Learned Optimization: Conclusion and Related Work by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant

2021-12-03

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: isks from Learned Optimization: Conclusion and Related Work, published by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant on the AI Alignment Forum. This is the fifth of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper. Related work Meta-learning. As described in the first post, meta-learning can often be thought of meta-optimization when the meta-optimizer's objective is explicitly designed to accomplish some base objective. However, it is also possible to do meta-learning by attempting to make use of mesa-optimization instead. For example, in Wang et al.'s “Learning to Reinforcement Learn,” the authors claim to have produced a neural network that implements its own optimization procedure.(28) Specifically, the authors argue that the ability of their network to solve extremely varied environments without explicit retraining for each one means that their network must be implementing its own internal learning procedure. Another example is Duan et al.'s “ R L 2 : Fast Reinforcement Learning via Slow Reinforcement Learning,” in which the authors train a reinforcement learning algorithm which they claim is itself doing reinforcement learning.(5) This sort of meta-learning research seems the closest to producing mesa-optimizers of any existing machine learning research. Robustness. A system is robust to distributional shift if it continues to perform well on the objective function for which it was optimized even when off the training environment.(29) In the context of mesa-optimization, pseudo-alignment is a particular way in which a learned system can fail to be robust to distributional shift: in a new environment, a pseudo-aligned mesa-optimizer might still competently optimize for the mesa-objective but fail to be robust due to the difference between the base and mesa- objectives. The particular type of robustness problem that mesa-optimization falls into is the reward-result gap, the gap between the reward for which the system was trained (the base objective) and the reward that can be reconstructed from it using inverse reinforcement learning (the behavioral objective).(8) In the context of mesa-optimization, pseudo-alignment leads to a reward-result gap because the system's behavior outside the training environment is determined by its mesa-objective, which in the case of pseudo-alignment is not aligned with the base objective. It should be noted, however, that while inner alignment is a robustness problem, the occurrence of unintended mesa-optimization is not. If the base optimizer's objective is not a perfect measure of the human's goals, then preventing mesa-optimizers from arising at all might be the preferred outcome. In such a case, it might be desirable to create a system that is strongly optimized for the base objective within some limited domain without that system engaging in open-ended optimization in new environments.(11) One possible way to accomplish this might be to use strong optimization at the level of the base optimizer during training to prevent strong optimization at the level of the mesa-optimizer.(11) Unidentifiability and goal ambiguity. As we noted in the third post, the problem of unidentifiability of objective functions in mesa-optimization is similar to the problem of unidentifiability in reward learning, the key issue being that it can be difficult to determine the “correct” objective function given only a sample of that objective's output on some training data.(20) We hypothesize that if the problem of unidentifiability can be resolved in the c...