Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AXRP Episode 29 - Science of Deep Learning with Vikrant Varma, published by DanielFilan on April 25, 2024 on The AI Alignment Forum.
In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AXRP Episode 29 - Science of Deep Learning with Vikrant Varma, published by DanielFilan on April 25, 2024 on The AI Alignment Forum.
In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand.
Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that generalizes.
What's going on with these discoveries? Are they all they're cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions.
Topics we discuss:
Challenges with unsupervised LLM knowledge discovery, aka contra CCS
What is CCS?
Consistent and contrastive features other than model beliefs
Understanding the banana/shed mystery
Future CCS-like approaches
CCS as principal component analysis
Explaining grokking through circuit efficiency
Why research science of deep learning?
Summary of the paper's hypothesis
What are 'circuits'?
The role of complexity
Many kinds of circuits
How circuits are learned
Semi-grokking and ungrokking
Generalizing the results
Vikrant's research approach
The DeepMind alignment team
Follow-up work
Daniel Filan: Hello, everybody. In this episode I'll be speaking with Vikrant Varma, a research engineer at Google DeepMind, and the technical lead of their sparse autoencoders effort. Today, we'll be talking about his research on problems with contrast-consistent search, and also explaining grokking through circuit efficiency. For links what we're discussing, you can check the description of this episode and you can read the transcript at axrp.net.
All right, well, welcome to the podcast.
Vikrant Varma: Thanks, Daniel. Thanks for having me.
Challenges with unsupervised LLM knowledge discovery, aka contra CCS
What is CCS?
Daniel Filan: Yeah. So first, I'd like to talk about this paper. It is called Challenges with Unsupervised LLM Knowledge Discovery, and the authors are Sebastian Farquhar, you, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, and Rohin Shah. This is basically about this thing called CCS. Can you tell us: what does CCS stand for and what is it?
Vikrant Varma: Yeah, CCS stands for contrastive-consistent search. I think to explain what it's about, let me start from a more fundamental problem that we have with advanced AI systems. One of the problems is that when we train AI systems, we're training them to produce outputs that look good to us, and so this is the supervision that we're able to give to the system. We currently don't really have a good idea of how an AI system or how a neural network is computing those outputs.
And in particular, we're worried about the situation in the future when the amount of supervision we're able to give it causes it to achieve a superhuman level of performance at that task. By looking at the network, we can't know how this is going to behave in a new situation.
And so the Alignment Research Center put out a report recently about this problem. They named a potential part of this problem as "eliciting latent knowledge". What this means is if your model is, for example, really, really good at figuring out what's going to happen next in a video, as in it's able to predict the next frame of a video really well given a prefix of the video, this must mean that it has some sort of model of what's going on in the world.
Instead of using the outputs of the model, if you could directly look at what it understands about the world, then potentially, you could use that information in a much safer ...
View more