Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What's up with LLMs representing XORs of arbitrary features?, published by Sam Marks on January 3, 2024 on The AI Alignment Forum.
Thanks to Clément Dumas, Nikola Jurković, Nora Belrose, Arthur Conmy, and Oam Patel for feedback.
In the comments of the
post on Google Deepmind's CCS challenges p...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What's up with LLMs representing XORs of arbitrary features?, published by Sam Marks on January 3, 2024 on The AI Alignment Forum.
Thanks to Clément Dumas, Nikola Jurković, Nora Belrose, Arthur Conmy, and Oam Patel for feedback.
In the comments of the
post on Google Deepmind's CCS challenges paper,
I expressed skepticism that some of the experimental results seemed possible. When addressing my concerns, Rohin Shah made some claims along the lines of "If an LLM linearly represents features a and b, then it will also linearly represent their XOR, ab, and this is true even in settings where there's no obvious reason the model would need to make use of the feature ab."[1]
For reasons that I'll explain below, I thought this claim was absolutely bonkers, both in general and in the specific setting that the GDM paper was working in. So I ran some experiments to prove Rohin wrong.
The result: Rohin was right and I was wrong. LLMs seem to compute and linearly represent XORs of features even when there's no obvious reason to do so.
I think this is deeply weird and surprising. If something like this holds generally, I think this has importance far beyond the original question of "Is CCS useful?"
In the rest of this post I'll:
Articulate a claim I'll call "representation of arbitrary XORs (RAX)": LLMs compute and linearly represent XORs of arbitrary features, even when there's no reason to do so.
Explain why it would be shocking if RAX is true. For example, without additional assumptions, RAX implies that linear probes should utterly fail to generalize across distributional shift, no matter how minor the distributional shift. (Empirically, linear probes often do generalize decently.)
Present experiments showing that RAX seems to be true in every case that I've checked.
Think through what RAX would mean for AI safety research: overall, probably a bad sign for interpretability work in general, and work that relies on using simple probes of model internals (e.g. ELK probes or
coup probes) in particular.
Make some guesses about what's really going on here.
Overall, this has left me very confused: I've found myself simultaneously having (a) an argument that AB, (b) empirical evidence of A, and (c) empirical evidence of B. (Here A = RAX and B = other facts about LLM representations.)
The RAX claim: LLMs linearly represent XORs of arbitrary features, even when there's no reason to do so
To keep things simple, throughout this post, I'll say that a model linearly represents a binary feature f if there is a linear probe out of the model's latent space which is accurate for classifying f; in this case, I'll denote the corresponding direction as vf.
This is not how I would typically use the terminology "linearly represents" - normally I would reserve the term for a stronger notion which, at minimum, requires the model to actually make use of the feature direction when performing cognition involving the feature[2]. But I'll intentionally abuse the terminology here because I don't think this distinction matters much for what I'll discuss.
If a model linearly represents features a and b, then it automatically linearly represents ab and ab.
However, ab is not automatically linearly represented - no linear probe in the figure above would be accurate for classifying ab. Thus, if the model wants to make use of the feature ab, then it needs to do something additional: allocate another direction[3] (more model capacity) to representing ab, and also perform the computation of ab so that it knows what value to store along this new direction.
The representation of arbitrary XORs (RAX) claim, in its strongest form, asserts that whenever a LLM linearly represents features a and b, it will also linearly represent ab. Concretely, this might look something like: in layer 5, the model computes and linearly r...
View more