Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Clarifying alignment vs capabilities, published by Richard Ngo on August 19, 2024 on The AI Alignment Forum.
A core distinction in AGI safety is between alignment and capabilities. However, I think this distinction is a very fuzzy one, which has led to a lot of confusion. In this post I'll describe some of the problems with how people typically think about it, and offer a replacement set of definitions.
"Alignment" and "capabilities" are primarily properties of AIs not of AI research
The first thing to highlight is that the distinction between alignment and capabilities is primarily doing useful work when we think of them as properties of AIs. This distinction is still under-appreciated by the wider machine learning community. ML researchers have historically thought about performance of models almost entirely with respect to the tasks they were specifically trained on.
However, the rise of LLMs has vindicated the alignment community's focus on general capabilities, and now it's much more common to assume that performance on many tasks (including out-of-distribution tasks) will improve roughly in parallel. This is a crucial assumption for thinking about risks from AGI.
Insofar as the ML community has thought about alignment, it has mostly focused on aligning models' behavior to their training objectives. The possibility of neural networks aiming to achieve
internally-represented goals is still not very widely understood, making it hard to discuss and study the reasons those goals might or might not be aligned with the values of (any given set of) humans. To be fair, the alignment community has caused some confusion by describing models as more or less "aligned", rather than more or less "aligned to X" for some specified X. I'll talk more about this confusion, and how we should address it, in a later post.
But the core point is that AIs might develop internally-represented goals or values that we don't like, and we should try to avoid that.
However, extending "alignment" and "capabilities" from properties of AIs to properties of different types of research is a fraught endeavor. It's tempting to categorize work as alignment research to the extent that it can be used to make AIs more aligned (to many possible targets), and as capabilities research to the extent that it can be used to make AIs more capable. But this approach runs into (at least) three major problems.
Firstly, in general it's very difficult to categorize research by its impacts. Great research often links together ideas from many different subfields, typically in ways that only become apparent throughout the course of the research. We see this in many historical breakthroughs which shed light on a range of different domains. For example, early physicists studying the motions of the stars eventually derived laws governing all earthly objects.
Meanwhile Darwin's study of barnacles and finches led him to principles governing the evolution of all life. Analogously, we should expect that big breakthroughs in our understanding of neural networks and deep learning would be useful in many different ways.
More concretely, there are many cases where research done under the banner of alignment has advanced, or plausibly will advance, AI capabilities to a significant extent. This undermines our ability to categorize research by its impacts. Central examples include:
RLHF makes language models more obedient, but also more capable of coherently carrying out tasks.
Scalable oversight techniques can catch misbehavior, but will likely become important for generating high-quality synthetic training data, as it becomes more and more difficult for unassisted humans to label AI outputs correctly.
Interpretability techniques will both allow us to inspect AI cognition and also extract more capable behavior from them (e.g. via
...
view more