Download - LW - When is Goodhart catastrophic? by Drake Thomas

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - When is Goodhart catastrophic? by Drake Thomas

2023-05-09

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: When is Goodhart catastrophic?, published by Drake Thomas on May 9, 2023 on LessWrong. Thanks to Aryan Bhatt, Eric Neyman, and Vivek Hebbar for feedback. This post gets more math-heavy over time; we convey some intuitions and overall takeaways first, and then get more detailed. Read for as long as you're getting value out of things! TLDR How much should you optimize for a flawed measurement? If you model optimization as selecting for high values of your goal V plus an independent error X, then the answer ends up being very sensitive to the distribution of the error X: if it’s heavy-tailed you shouldn't optimize too hard, but if it’s light-tailed you can go full speed ahead. Related work Why the tails come apart by Thrasymachus discusses a sort of "weak Goodhart" effect, where extremal proxy measurements won't have extremal values of your goal (even if they're still pretty good). It implicitly looks at cases similar to a normal distribution. Scott Garrabrant's taxonomy of Goodhart's Law discusses several ways that the law can manifest. This post is about the "Regressional Goodhart" case. Scaling Laws for Reward Model Overoptimization (Gao et al., 2022) considers very similar conditioning dynamics in real-world RLHF reward models. In their Appendix A, they show a special case of this phenomenon for light-tailed error, which we'll prove a generalization of in the next post. Defining and Characterizing Reward Hacking (Skalse et al., 2022) shows that under certain conditions, leaving any terms out of a reward function makes it possible to increase expected proxy return while decreasing expected true return. How much do you believe your results? by Eric Neyman tackles very similar phenomena to the ones discussed here, particularly in section IV; in this post we're interested in characterizing that sort of behavior and when it occurs. We strongly recommend reading it first if you'd like better intuitions behind some of the math presented here - though our post was written independently, it's something of a sequel to Eric's. Motivation/intuition Goodhart's Law says When a measure becomes a target, it ceases to be a good measure. When I (Drake) first heard about Goodhart's Law, I internalized something like "if you have a goal, and you optimize for a proxy that is less than perfectly correlated with the goal, hard enough optimization for the proxy won't get you what you wanted." This was a useful frame to have in my toolbox, but it wasn't very detailed - I mostly had vague intuitions and some idealized fables from real life. Much later, I saw some objections to this frame on Goodhart that actually used math. The objection went something like: Let's try to sketch out an actual formal model here. What's the simplest setup of "two correlated measurements"? We could have a joint normal distribution over two random variables, U and V, with zero mean and positive covariance. You actually value V, but you measure a proxy U. Then we can just do the math: if I optimize really hard for U, and give you a random datapoint with U=1012 or something, how much V do you expect to get? If we look at the joint distribution of U and V, we'll see a distribution with elliptical contour lines, like so: Now, the naïve hope is that expected V as a function of observed U would go along the semi-major axis, shown in red below: But actually we'll get the blue line, passing through the points at which the ellipses are tangent to the V-axis. Importantly, though, we're still getting a line: we get linearly more value V for every additional unit of U we select for! Applying 99th percentile selection on U isn't going to be as good as 99th percentile selection on V, but it's still going to give us more V than any lower percentile selection on U. The proxy is inefficient, but it's not doomed. Lately, however, ...