Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: When is Goodhart catastrophic?, published by Drake Thomas on May 9, 2023 on LessWrong.
Thanks to Aryan Bhatt, Eric Neyman, and Vivek Hebbar for feedback.
This post gets more math-heavy over time; we convey some intuitions and overall takeaways first, and then get more detailed. Read for as long as you're getting value out of things!
TLDR
How much should you optimize for a flawed measurement? If you model optimization as selecting for high values of your goal V plus an independent error X, then the answer ends up being very sensitive to the distribution of the error X: if it’s heavy-tailed you shouldn't optimize too hard, but if it’s light-tailed you can go full speed ahead.
Related work
Why the tails come apart by Thrasymachus discusses a sort of "weak Goodhart" effect, where extremal proxy measurements won't have extremal values of your goal (even if they're still pretty good). It implicitly looks at cases similar to a normal distribution.
Scott Garrabrant's taxonomy of Goodhart's Law discusses several ways that the law can manifest. This post is about the "Regressional Goodhart" case.
Scaling Laws for Reward Model Overoptimization (Gao et al., 2022) considers very similar conditioning dynamics in real-world RLHF reward models. In their Appendix A, they show a special case of this phenomenon for light-tailed error, which we'll prove a generalization of in the next post.
Defining and Characterizing Reward Hacking (Skalse et al., 2022) shows that under certain conditions, leaving any terms out of a reward function makes it possible to increase expected proxy return while decreasing expected true return.
How much do you believe your results? by Eric Neyman tackles very similar phenomena to the ones discussed here, particularly in section IV; in this post we're interested in characterizing that sort of behavior and when it occurs. We strongly recommend reading it first if you'd like better intuitions behind some of the math presented here - though our post was written independently, it's something of a sequel to Eric's.
Motivation/intuition
Goodhart's Law says
When a measure becomes a target, it ceases to be a good measure.
When I (Drake) first heard about Goodhart's Law, I internalized something like "if you have a goal, and you optimize for a proxy that is less than perfectly correlated with the goal, hard enough optimization for the proxy won't get you what you wanted." This was a useful frame to have in my toolbox, but it wasn't very detailed - I mostly had vague intuitions and some idealized fables from real life.
Much later, I saw some objections to this frame on Goodhart that actually used math. The objection went something like:
Let's try to sketch out an actual formal model here. What's the simplest setup of "two correlated measurements"? We could have a joint normal distribution over two random variables, U and V, with zero mean and positive covariance. You actually value V, but you measure a proxy U. Then we can just do the math: if I optimize really hard for U, and give you a random datapoint with U=1012 or something, how much V do you expect to get?
If we look at the joint distribution of U and V, we'll see a distribution with elliptical contour lines, like so:
Now, the naïve hope is that expected V as a function of observed U would go along the semi-major axis, shown in red below:
But actually we'll get the blue line, passing through the points at which the ellipses are tangent to the V-axis.
Importantly, though, we're still getting a line: we get linearly more value V for every additional unit of U we select for! Applying 99th percentile selection on U isn't going to be as good as 99th percentile selection on V, but it's still going to give us more V than any lower percentile selection on U. The proxy is inefficient, but it's not doomed.
Lately, however, ...
view more