Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How much do you believe your results?, published by Eric Neyman on May 6, 2023 on LessWrong.
Thanks to Drake Thomas for feedback.
I.
Here’s a fun scatter plot. It has two thousand points, which I generated as follows: first, I drew two thousand x-values from a normal distribution with mean 0 and standard deviation 1. Then, I chose the y-value of each point by taking the x-value and then adding noise to it. The noise is also normally distributed, with mean 0 and standard deviation 1.
Notice that there’s more spread along the y-axis than along the x-axis. That’s because each y-coordinate is a sum of two independently drawn numbers from the standard normal distribution. Because variances add, the y-values have variance 2 (standard deviation 1.41), not 1.
Statisticians often talk about data forming an “elliptical cloud”. You can see how the data forms into an elliptical shape. To put a finer point on it:
Why an ellipse — what’s the mathematical significance of this shape? The answer pops out if you look at a plot of how likely different points on the plane are to be selected by the random generation procedure that I used.
The highest density of points is near (0, 0), and as you get farther from the origin the density decreases. The green ellipse on the scatter plot is a level set of equal probability: if you were to select a datapoint using my procedure, you’d be more likely to land in any square millimeter inside the ellipse than in any square millimeter outside the ellipse — and you’d be equally likely to land in any location on the ellipse as on any other location on the ellipse.
The line of best fit is a statistical tool for answering the following question: given an x-value, what is your best guess about the y-value?
What is the line of best fit for this data? Here’s one line of reasoning: since the y-values were generated by taking the x-values and adding random noise, our best guess for y should just be x. So the line of best fit is y = x.
Huh, weird. this line is weirdly “askew” of the ellipse, and it doesn’t reflect the fact that the y-values are more dispersed than the x-values. Maybe the line of best fit instead passes from the bottom-left to the top-right of the ellipse, along its major axis. It sure looks like the points are on average closer to this line than to the previous one.
Which line is the line of best fit, and what’s wrong with the other line? I recommend pondering this for a bit before reading on.
The answer is that the first line, y = x, is the line of best fit. The problem with the second line is that it doesn’t try to predict y given x. I mean, scroll back up and take a look at how low the line is at x = -2: it’s way below almost all of the points whose x-value is near -2! This line is instead doing a different, important thing: it indicates the axis of maximum variation of the data. It’s the line with the property that, if you project the data onto the line, the data will be maximally dispersed. This line is called the first principal component of the data, but it is not the line of best fit.
Instead of going from the bottom-left to the top-right of the ellipse, the line of best fit goes from the left of the ellipse to the right. This is the line that has as much of the ellipse above it as below it, at every x-coordinate. This is what you want, because you want it the true y-value to be below your prediction as often as it is above your prediction.
(Huh, what a weird asymmetry! I wonder why the line doesn’t instead go from the bottom of the ellipse to the top.)
II.
You are the director of a giant government research program that’s conducting randomized controlled trials (RCTs) on two thousand health interventions, so that you can pick out the most cost-effective ones and promote them among the general population.
The quality of the two thousan...
view more