Download - LW - Infinite-width MLPs as an "ensemble prior" by Vivek Hebbar

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - Infinite-width MLPs as an "ensemble prior" by Vivek Hebbar

2023-05-13

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Infinite-width MLPs as an "ensemble prior", published by Vivek Hebbar on May 12, 2023 on LessWrong. Summary: A simple toy model suggests that infinitely wide MLPs generalize in an "ensemble-ish" way which is exponentially less data-efficient than Solomonoff induction. It's probably fixable by different initializations and/or regularizations, so I note it here mostly as a mathematical curiosity / interesting prior. The analysis seems to be qualitatively consistent with empirical results on generalization vs width in small MLPs. Notes: The generalization behavior of these neural nets can be analyzed with the Neural Tangent Kernel, which is widely studied. This post is meant to probe the qualitative nature of this behavior through a toy model. I'm unsure whether my particular analysis exists elsewhere. The deficiency of the standard initialization at infinite width seems to be well-known and empirically supported in NTK-related literature, along with ways of fixing it. Core claims The standard initialization uses weights which are proportional to 1/√input_dimension. This has the effect of keeping the activations at roughly the same scale across layers. However, in the infinite width case, it ends up making the gradients in early layers infinitely smaller than those in the last layer. Hence, training an infinite-width MLP is equivalent to running a regression using the features represented by the last-layer neurons at initialization. These features never change during training, since the early gradients are all zero. If we train without regularization, we will tend to get something very "ensemble-ish", "smooth", and "dumb". I will first summarize this claim in a table, then spend the rest of the post going through the reasoning behind it. Solomonoff InductionInfinite width MLP, low L2-norm solutionBayesian update over programsLinear regression over circuitsPuts most of its weight on a small number of programs, each of which perfectly fits the data on its ownSpreads weight over a broad ensemble, including circuits which have only a small correlation with truthThe amount of data required to make the correct program dominate is O(K), where K is the program lengthThe amount of data to make the correct circuit dominate is O(2C), where C is some "complexity measure" (defined later). This is exponentially less data-efficient than Solomonoff induction.Calling it "superintelligent" is an understatementGeneralizes poorly on many tasksHighly amenable to "sharp" solutionsFavors smooth solutions, only creates "sharp" solutions if certain conditions are met by the training data. If we train an infinitely wide MLP from the standard initialization, only the last layer's weights change. So it is equivalent to a linear regression over an infinite set of random "features", these features being the activation patterns of the last layer neurons at initialization. If the MLP is deep enough, some of these last-layer neurons are contain the output of very intelligent circuits. However, if we train our infinite width MLP, these intelligent circuits will hardly be used by the regression, even if they are very useful. That is, the sum of the weights drawing from them in the last layer will be very small. The reason I believe this is the toy model in the next section. Toy model Let's call each last-layer neuron a "feature". As discussed earlier, their behavior never changes due to how the gradients pan out at infinite width. In a "real" infinite network, these features will be "useful" and "intelligent" to various degrees, but we will simplify this greatly in the toy model, by using just two types of features. The toy model asks: "Suppose that some features already compute the correct answer for every training datapoint, and that the rest of the features are random garbage. Will the trained network...