Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Recent Progress in the Theory of Neural Networks, published by interstice on the AI Alignment Forum.
It's common wisdom that neural networks are basically "matrix multiplications that nobody understands" , impenetrable to theoretical analysis, which have achieved great results largely through trial-and-error. While this may have been true in the past, recently there has been significant progress towards developing a theoretical understanding of neural networks. Most notably, we have obtained an arguably complete understanding of network initialization and training dynamics in a certain infinite-width limit. There has also been some progress towards understanding their generalization behavior. In this post I will review some of this recent progress and discuss the potential relevance to AI alignment.
Infinite Width Nets: Initialization
The most exciting recent developments in the theory of neural networks have focused the infinite-width limit. We consider neural networks where the number of neurons in all hidden layers are increased to infinity. Typically we consider networks with a Gaussian-initialized weights, and scale the variance at initialization as
1
√
H
, where
H
is the number of hidden units in the preceding layer(this is needed to avoid inputs blowing up, and is also the initialization scheme usually used in real networks). In this limit, we have obtained an essentially complete understanding of both behavior at initialization and training dynamics[1]. (Those with limited interest/knowledge of math may wish to "Significance and Limitations" below).
We've actually had a pretty good understanding of the behavior of infinite-width neural networks at initialization for a while, since the work of Radford Neal(1994). He proved that in this limit, fully-connected neural networks with Gaussian-distributed weights and biases limit to what are known as Gaussian processes. Gaussian processes can be thought of the generalization of Gaussian distributions from finite-dimensional spaces to spaces of functions. Neal's paper provides a very clear derivation of this behavior, but I'll explain it briefly here.
A neural network with m real-valued inputs and 1 real valued outputs defines a function from
R
m
to
R
. Thus, a distribution over the weights and biases of such a neural network -- such as the standard Gaussian initialization -- implicitly defines a distribution over functions on
R
m
. Neal's paper shows that, for fully-connected neural networks, this distribution limits to a Gaussian process.
What is a Gaussian process? It's a distribution over functions
f
with the property that, for any finite collection of points
X
1
X
N
, the values
f
X
1
f
X
N
have a joint distribution which is a multivariate Gaussian. Any Gaussian process is uniquely defined by its mean and covariance functions,
μ
x
and
C
x
x
′
. For points
X
1
X
N
, the distribution of
f
X
1
f
X
N
will have mean
μ
X
1
μ
X
N
with covariance matrix
C
i
j
C
X
i
X
j
The argument that fully-connected neural networks limit to Gaussian processes in the infinite-width limit is pretty simple. Consider a three-layer neural network, with an activation function
σ
in the second layer and a single linear output unit. This network can be defined by the equation
y
∑
V
k
σ
∑
W
k
j
X
j
. At initialization,
V
and
W
are filled with independent Gaussians, with variance of
V
scaled as the inverse square-root of the number of hidden-units.
Each hidden unit
h
k
will has a value for each of the inputs
X
i
h
k
X
i
σ
∑
W
k
j
X
j
i
. Since
W
is random, for each
k
h
k
→
X
is an independent random vector(where we write
→
X
for
X
1
X
N
). All of these random vectors follow the same distribution, and the output
→
y
f
→
X
of the network is simply the sum of these identical distributions multiplied by the univariate Gaussians
V
k
. By the multidi...
view more