Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What's up with all the non-Mormons? Weirdly specific universalities across LLMs, published by mwatkins on April 20, 2024 on LessWrong.
tl;dr: Recently reported GPT-J experiments [1 2 3 4] prompting for definitions of points in the so-called "semantic void" (token-free regions of embedding space) were extended...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What's up with all the non-Mormons? Weirdly specific universalities across LLMs, published by mwatkins on April 20, 2024 on LessWrong.
tl;dr: Recently reported GPT-J experiments [1 2 3 4] prompting for definitions of points in the so-called "semantic void" (token-free regions of embedding space) were extended to fifteen other open source base models from four families, producing many of the same bafflingly specific outputs. This points to an entirely unexpected kind of LLM universality (for which no explanation is offered, although a few highly speculative ideas are riffed upon).
Work supported by the Long Term Future Fund. Thanks to quila for suggesting the use of "empty string definition" prompts, and to janus for technical assistance.
Introduction
"Mapping the semantic void: Strange goings-on in GPT embedding spaces" presented a selection of recurrent themes (e.g., non-Mormons, the British Royal family, small round things, holes) in outputs produced by prompting GPT-J to define points in embedding space randomly sampled at various distances from the token embedding centroid. This was tentatively framed as part of what appeared to be a "stratified ontology" (based on hyperspherical regions centred at the centroid).
Various suggestions attempting to account for this showed up in the comments to that post, but nothing that amounted to an explanation. The most noteworthy consideration that came up (more than once) was layer normalisation: the embeddings that were being customised and inserted into the prompt template
were typically out-of-distribution in terms of their distance-from-centroid: almost all GPT-J tokens are at a distance-from-centroid close to 1, whereas I was sampling at distances from 0 to 10000. This, as far as I could understand the argument, might be playing havoc with layer norm, thereby resulting in anomalous (but otherwise insignificant) outputs.
That original post also presented circumstantial evidence, involving prompting for definitions of glitch tokens, that this phenomenon extends to GPT-3 (unfortunately that's not something that could have been tested directly). Some time later, a colleague with GPT-4 base access discovered that simply prompting that model for a definition of the empty string, i.e. using the prompt
at temperature 0 produces "A person who is not a member of the clergy", one of the most frequent outputs I'd seen from GPT-J for random embeddings at various distances-from-centroid, from 2 to 12000.
With the same prompt, but at higher temperatures, GPT-4 base produced other very familiar (to me) styles of definition such as: a small, usually round piece of metal; a small, usually circular object of glass, wood, stone, or the like with a hole through; a person who is not a member of a particular group or organization; a person who is not a member of one's own religion; a state of being in a state of being.
Looking at a lot of these outputs, it seems that, as with GPT-J, religion, non-membership of groups, small round things and holes are major preoccupations.
As well as indicating that this phenomenon is not a quirk particular to GPT-J, but rather something more widespread, the empty string results rule out any central significance of layer norm. No customised embeddings are involved here - we're just prompting with a list of eight conventional tokens.
I would have predicted that the model would give a definition for emptiness, non-existence, silence or absence, but I have yet to see it do that. Instead, it behaves like someone guessing the definition of a word they can't see or hear. And in doing so repeatedly, statistically) it's perhaps tells us something completely unexpected about how its "understanding of the world" (for want of a better phrase) is organised.
Models tested
The same experiments were run on sixteen base models...
View more