Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Infinite-width MLPs as an "ensemble prior", published by Vivek Hebbar on May 12, 2023 on LessWrong.
Summary: A simple toy model suggests that infinitely wide MLPs generalize in an "ensemble-ish" way which is exponentially less data-efficient than Solomonoff induction. It's probably fixable by different initializations and/or regularizations, so I note it here mostly as a mathematical curiosity / interesting prior.
The analysis seems to be qualitatively consistent with empirical results on generalization vs width in small MLPs.
Notes:
The generalization behavior of these neural nets can be analyzed with the Neural Tangent Kernel, which is widely studied. This post is meant to probe the qualitative nature of this behavior through a toy model. I'm unsure whether my particular analysis exists elsewhere.
The deficiency of the standard initialization at infinite width seems to be well-known and empirically supported in NTK-related literature, along with ways of fixing it.
Core claims
The standard initialization uses weights which are proportional to 1/√input_dimension. This has the effect of keeping the activations at roughly the same scale across layers. However, in the infinite width case, it ends up making the gradients in early layers infinitely smaller than those in the last layer. Hence, training an infinite-width MLP is equivalent to running a regression using the features represented by the last-layer neurons at initialization. These features never change during training, since the early gradients are all zero.
If we train without regularization, we will tend to get something very "ensemble-ish", "smooth", and "dumb". I will first summarize this claim in a table, then spend the rest of the post going through the reasoning behind it.
Solomonoff InductionInfinite width MLP, low L2-norm solutionBayesian update over programsLinear regression over circuitsPuts most of its weight on a small number of programs, each of which perfectly fits the data on its ownSpreads weight over a broad ensemble, including circuits which have only a small correlation with truthThe amount of data required to make the correct program dominate is O(K), where K is the program lengthThe amount of data to make the correct circuit dominate is O(2C), where C is some "complexity measure" (defined later). This is exponentially less data-efficient than Solomonoff induction.Calling it "superintelligent" is an understatementGeneralizes poorly on many tasksHighly amenable to "sharp" solutionsFavors smooth solutions, only creates "sharp" solutions if certain conditions are met by the training data.
If we train an infinitely wide MLP from the standard initialization, only the last layer's weights change. So it is equivalent to a linear regression over an infinite set of random "features", these features being the activation patterns of the last layer neurons at initialization.
If the MLP is deep enough, some of these last-layer neurons are contain the output of very intelligent circuits. However, if we train our infinite width MLP, these intelligent circuits will hardly be used by the regression, even if they are very useful. That is, the sum of the weights drawing from them in the last layer will be very small. The reason I believe this is the toy model in the next section.
Toy model
Let's call each last-layer neuron a "feature". As discussed earlier, their behavior never changes due to how the gradients pan out at infinite width. In a "real" infinite network, these features will be "useful" and "intelligent" to various degrees, but we will simplify this greatly in the toy model, by using just two types of features.
The toy model asks: "Suppose that some features already compute the correct answer for every training datapoint, and that the rest of the features are random garbage. Will the trained network...
view more