Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Evidence of Learned Look-Ahead in a Chess-Playing Neural Network, published by Erik Jenner on June 5, 2024 on LessWrong.
Paper authors: Erik Jenner, Shreyas Kapur, Vasil Georgiev, Cameron Allen, Scott Emmons, Stuart Russell
TL;DR: We released a paper with IMO clear evidence of learned look-ahead in a...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Evidence of Learned Look-Ahead in a Chess-Playing Neural Network, published by Erik Jenner on June 5, 2024 on LessWrong.
Paper authors: Erik Jenner, Shreyas Kapur, Vasil Georgiev, Cameron Allen, Scott Emmons, Stuart Russell
TL;DR: We released a paper with IMO clear evidence of learned look-ahead in a chess-playing network (i.e., the network considers future moves to decide on its current one). This post shows some of our results, and then I describe the original motivation for the project and reflect on how it went. I think the results are interesting from a scientific and perhaps an interpretability perspective, but only mildly useful for AI safety.
Teaser for the results
(This section is copied from our project website. You may want to read it there for animations and interactive elements, then come back here for my reflections.)
Do neural networks learn to implement algorithms involving look-ahead or search in the wild? Or do they only ever learn simple heuristics? We investigate this question for Leela Chess Zero, arguably the strongest existing chess-playing network.
We find intriguing evidence of learned look-ahead in a single forward pass. This section showcases some of our results, see our paper for much more.
Setup
We consider chess puzzles such as the following:
We focus on the policy network of Leela, which takes in a board state and outputs a distribution over moves. With only a single forward pass per board state, it can solve puzzles like the above. (You can play against the network on Lichess to get a sense of how strong it is - its rating there is over 2600.) Humans and manually written chess engines rely on look-ahead to play chess this well; they consider future moves when making a decision. But is the same thing true for Leela?
Activations associated with future moves are crucial
One of our early experiments was to do activation patching. We patch a small part of Leela's activations from the forward pass of a corrupted version of a puzzle into the forward pass on the original puzzle board state. Measuring the effect on the final output tells us how important that part of Leela's activations was.
Leela is a transformer that treats every square of the chess board like a token in a language model. One type of intervention we can thus do is to patch the activation on a single square in a single layer:
Surprisingly, we found that the target square of the move two turns in the future (what we call the 3rd move target square) often stores very important information. This does not happen in every puzzle, but it does in a striking fraction, and the average effect is much bigger than that of patching on most other squares:
The corrupted square(s) and the 1st move target square are also important (in early and late layers respectively), but we expected as much from Leela's architecture. In contrast, the 3rd move target square stands out in middle layers, and we were much more surprised by its importance.
In the paper, we take early steps toward understanding how the information stored on the 3rd move target square is being used. For example, we find a single attention head that often moves information from this future target square backward in time to the 1st move target square.
Probes can predict future moves
If Leela uses look-ahead, can we explicitly predict future moves from its activations? We train simple, bilinear probes on parts of Leela's activations to predict the move two turns into the future (on a set of puzzles where Leela finds a single clearly best continuation). Our probe architecture is motivated by our earlier results - it predicts whether a given square is the target square of the 3rd move since, as we've seen, this seems to be where Leela stores important information.
We find that this probe can predict the move 2 turns in the future quit...
View more