Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LLMs for Alignment Research: a safety priority?, published by abramdemski on April 4, 2024 on LessWrong.
A recent short story by Gabriel Mukobi illustrates a near-term scenario where things go bad because new developments in LLMs allow LLMs to accelerate capabilities research without a correspondingly large...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LLMs for Alignment Research: a safety priority?, published by abramdemski on April 4, 2024 on LessWrong.
A recent short story by Gabriel Mukobi illustrates a near-term scenario where things go bad because new developments in LLMs allow LLMs to accelerate capabilities research without a correspondingly large acceleration in safety research.
This scenario is disturbingly close to the situation we already find ourselves in. Asking the best LLMs for help with programming vs technical alignment research feels very different (at least to me). LLMs might generate junk code, but you can keep pointing out the problems with the code, and the code will eventually work. This can be faster than doing it myself, in cases where I don't know a language or library well; the LLMs are moderately familiar with everything.
When I try to talk to LLMs about technical AI safety work, however, I just get garbage.
I think a useful safety precaution for frontier AI models would be to make them more useful for safety research than capabilities research. This extends beyond applying AI technology to accelerate safety research within top AI labs; models available to the general public (such as GPT-N, Claude-N) should also accelerate safety more than capabilities.
What is wrong with current models?
My experience is mostly with Claude, and mostly with versions of Claude before the current (Claude 3).[1] I'm going to complain about Claude here; but everything else I've tried seemed worse. In particular, I found GPT4 to be worse than Claude2 for my purposes.
As I mentioned in the introduction, I've been comparing how these models feel helpful for programming to how useless they feel for technical AI safety. Specifically, technical AI safety of the mathematical-philosophy flavor that I usually think about. This is not, of course, a perfect experiment to compare capability-research-boosting to safety-research-boosting.
However, the tasks feel comparable in the following sense: programming involves translating natural-language descriptions into formal specifications; mathematical philosophy also involves translating natural-language descriptions into formal specifications. From this perspective, the main difference is what sort of formal language is being targeted (IE, programming languages vs axiomatic models).
I don't have systematic experiments to report; just a general feeling that Claude's programming is useful, but Claude's philosophy is not.[2] It is not obvious, to me, why this is. I've spoken to several people about it. Some reactions:
If it could do that, we would all be dead!
I think a similar mindset would have said this about programming, a few years ago. I suspect there are ways for modern LLMs to be more helpful to safety research in particular which do not also imply advancing capabilities very much in other respects. I'll say more about this later in the essay.
There's probably just a lot less training data for mathematical philosophy than for programming.
I think this might be an important factor, but it is not totally clear to me.
Mathematical philosophy is inherently more difficult than programming, so it is no surprise.
This might also be an important factor, but I consider it to be only a partial explanation. What is more difficult, exactly? As I mentioned, programming and mathematical philosophy have some strong similarities.
Problems include a bland, people-pleasing attitude which is not very helpful for research. By default, Claude (and GPT4) will enthusiastically agree with whatever I say, and stick to summarizing my points back at me rather than providing new insights or adding useful critiques. When Claude does engage in more structured reasoning, it is usually wrong and bad. (I might summarize it as "based more on vibes than logic".)
Is there any hope for better?
As a starti...
View more