Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS XIII: Reflections on Anthropic's SAE Research Circa May 2024, published by Stephen Casper on May 21, 2024 on The AI Alignment Forum.
Part 13 of 12 in the
Engineer's Interpretability Sequence.
TL;DR
On May 5, 2024,
I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS XIII: Reflections on Anthropic's SAE Research Circa May 2024, published by Stephen Casper on May 21, 2024 on The AI Alignment Forum.
Part 13 of 12 in the
Engineer's Interpretability Sequence.
TL;DR
On May 5, 2024,
I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn't do.
Today's new SAE paper from Anthropic was full of brilliant experiments and interesting insights, but it ultimately underperformed my expectations. I am beginning to be concerned that Anthropic's recent approach to interpretability research might be better explained by safety washing than practical safety work.
Think of this post as a curt editorial instead of a technical piece. I hope to revisit my predictions and this post in light of future updates.
Reflecting on predictions
Please
see my original post for 10 specific predictions about what
today's paper would and wouldn't accomplish. I think that Anthropic obviously did 1 and 2 and obviously did not do 4, 5, 7, 8, 9, and 10. Meanwhile, I think that their experiments to identify
specific and
safety-relevant features should count for 3 (proofs of concept for a useful type of task) but definitely do not count for 6 (*competitively* finding and removing a harmful behavior that was represented in the training data).
Thus, my assessment is that Anthropic did 1-3 but not 4-10.
I have been wrong with mech interp predictions in the past, but this time, I think I was 10 for 10: everything I predicted with >50% probability happened, and everything I predicted with
View more