Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: (Approximately) Deterministic Natural Latents, published by johnswentworth on July 21, 2024 on LessWrong.
Background: Natural Latents: The Math, Natural Latents: The Concepts, Why Care About Natural Latents?, the prototypical semantics use-case. This post does not assume that you've read all of those, or even any of...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: (Approximately) Deterministic Natural Latents, published by johnswentworth on July 21, 2024 on LessWrong.
Background: Natural Latents: The Math, Natural Latents: The Concepts, Why Care About Natural Latents?, the prototypical semantics use-case. This post does not assume that you've read all of those, or even any of them.
Suppose I roll a biased die 1000 times, and then roll the same biased die another 1000 times. Then...
Mediation: The first 1000 rolls are approximately independent of the second 1000 given the bias (to reasonable precision).
Redundancy: I can estimate the die's bias (to reasonable precision) with high confidence from either the first or second 1000 rolls.
The die's bias is therefore a natural latent, which means it has various nice properties.
Minimality: The bias is the smallest summary of all the information about the first 1000 rolls relevant to the second 1000 (and vice-versa).
Maximality: The bias is the largest piece of information which can be calculated from the first 1000 rolls and also can separately be calculated from the second 1000 rolls.
Any other variable which satisfies the above properties must tell us (approximately) the same information about the die rolls as the bias.
Furthermore, the bias is a(n approximate) deterministic natural latent: the die's bias (to reasonable precision) is approximately determined by[1] the first 1000 die rolls, and also approximately determined by the second 1000 die rolls. That implies one more nice property:
Uniqueness: The bias is the unique-up-to(-approximate)-isomorphism latent which has the above properties, making it a natural Schelling point for communication between agents.
We've proven all that before, mostly in
Natural Latents: The Math (including
the addendum added six months after the rest of the post). But it turns out that the math is a lot shorter and simpler, and easily yields better bounds, if we're willing to assume (approximate) determinism up-front. That does lose us some theoretical tools (notably the
resampling construction), but it gives a cleaner foundation for our expected typical use cases (like e.g. semantics). The goal of this post is to walk through that math.
Background Tool: Determinism in Diagrams
We're going to use diagrammatic proofs, specifically using Bayes nets. But it's non-obvious how to express (approximate) determinism using Bayes nets, or what rules diagrams follow when determinism is involved, so we'll walk through that first.
This diagram says that Y is (approximately) determined by X:
Intuitively, the literal interpretation of the diagram is: X mediates between Y and Y, i.e. Y itself tells me nothing more about Y once I already know X. That only makes sense if X tells me everything there is to know about Y, i.e. Y is determined by X.
In the approximate case, we express the approximation error of the diagram as a KL-divergence, same as usual:
ϵDKL(P[X=x,Y=y,Y=y']||P[X=x]P[Y=y|X=x]P[Y=y'|X=x])
If you get confused later about what it means to have two copies of the same variable in a diagram, go back to that line; that's the definition of the approximation error of the diagram. (One way to view that definition: there's actually two variables Y and Y', but P says that Y and Y' always have the same value.)
That approximation error simplifies:
DKL(P[X=x,Y=y,Y=y']||P[X=x]P[Y=y|X=x]P[Y=y'|X=x])
=DKL(P[X=x,Y=y]I[y=y']||P[X=x]P[Y=y|X=x]P[Y=y'|X=x])
=x,y,y'P[X=x,Y=y]I[y=y'](log(P[X=x,Y=y]I[y=y'])log(P[X=x]P[Y=y|X=x]P[Y=y'|X=x]))
=x,yP[X=x,Y=y](log(P[X=x,Y=y])log(P[X=x]P[Y=y|X=x]P[Y=y|X=x]))
=x,yP[X=x,Y=y]log(P[Y=y|X=x])
=H(Y|X)
So the diagram says Y is determined by X, and the approximation error of the diagram is the entropy H of Y given X - i.e. the number of bits required on average to specify Y once one already knows X. Very intuitive!
The Dangly Bit Lemma
Intuitiv...
View more