Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An Analogy for Understanding Transformers, published by TheMcDouglas on May 13, 2023 on LessWrong.
Thanks to the following people for feedback: Tilman Rauker, Curt Tigges, Rudolf Laine, Logan Smith, Arthur Conmy, Joseph Bloom, Rusheb Shah, James Dao.
TL;DR
I present an analogy for the transformer architecture: each vector in the residual stream is a person standing in a line, who is holding a token, and trying to guess what token the person in front of them is holding. Attention heads represent questions that people in this line can ask to everyone standing behind them (queries are the questions, keys determine who answers the questions, values determine what information gets passed back to the original question-asker), and MLPs represent the internal processing done by each person in the line. I claim this is a useful way to intuitively understand the transformer architecture, and I'll present several reasons for this (as well as ways induction heads and indirect object identification can be understood in these terms).
Introduction
In this post, I'm going to present an analogy for understanding how transformers work. I expect this to be useful for anyone who understands the basics of transformers, in particular people who have gone through Neel Nanda's tutorial, and/or understand the following points at a minimum:
What a transformer's input is, what its outputs represent, and the nature of the predict-next-token task that it's trained on
What the shape of the residual stream is, and the idea of components of the transformer reading from / writing to the residual stream throughout the model's layers
How a transformer is composed of multiple blocks, each one containing an MLP (which does processing on vectors at individual sequence positions), and an attention layer (which moves information between the residual stream vectors at different sequence positions).
I think the analogy still offers value even for people who understand transformers deeply already.
The Analogy
A line is formed by a group of people, each person holding a word. Everyone knows their own word and position in the line, but they can't see anyone else in the line. The objective for each person is to guess the word held by the person in front of them. People have the ability to shout questions to everyone standing behind them in the line (those in front cannot hear them). Upon hearing a question, each individual can choose whether or not to respond, and what information to relay back to the person who asked. After this, people don't remember the questions they were asked (so no information can move backwards in the line, only forwards). As individuals in the line gather information from these exchanges, they can use this information to formulate subsequent questions and provide answers.
How this relates to transformer architecture:
Each person in the line is a vector in the residual stream
They start with just information about their own word (token embedding) and position in the line (positional embedding)
The attention heads correspond to the questions that people in the line ask each other:
Queries = question (which gets asked to everyone behind them in the line)
Keys = how the people who hear the question decide whether or not to reply
Values = the information that the people who reply pass back to the person who originally asked the question
People can use information gained from earlier questions when answering / asking later questions - this is composition
The MLPs correspond to the information processing / factual recall performed by each person in the sequence independently
The unembedding at the end of the model is when we ask each person in the line for a final guess at what the next word is (in the form of a probability distribution over all possible words)
Key Concepts for Transformers
In...
view more