Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Alignment By Default, published by johnswentworth on the AI Alignment Forum.
Suppose AI continues on its current trajectory: deep learning continues to get better as we throw more data and compute at it, researchers keep trying random architectures and using whatever seems to work well in practice. Do we end up with aligned AI “by default”?
I think there’s at least a plausible trajectory in which the answer is “yes”. Not very likely - I’d put it at ~10% chance - but plausible. In fact, there’s at least an argument to be made that alignment-by-default is more likely to work than many fancy alignment proposals, including IRL variants and HCH-family methods.
This post presents the rough models and arguments.
I’ll break it down into two main pieces:
Will a sufficiently powerful unsupervised learner “learn human values”? What does that even mean?
Will a supervised/reinforcement learner end up aligned to human values, given a bunch of data/feedback on what humans want?
Ultimately, we’ll consider a semi-supervised/transfer-learning style approach, where we first do some unsupervised learning and hopefully “learn human values” before starting the supervised/reinforcement part.
As background, I will assume you’ve read some of the core material about human values from the sequences, including Hidden Complexity of Wishes, Value is Fragile, and Thou Art Godshatter.
Unsupervised: Pointing to Values
In this section, we’ll talk about why an unsupervised learner might not “learn human values”. Since an unsupervised learner is generally just optimized for predictive power, we’ll start by asking whether theoretical algorithms with best-possible predictive power (i.e. Bayesian updates on low-level physics models) “learn human values”, and what that even means. Then, we’ll circle back to more realistic algorithms.
Consider a low-level physical model of some humans - e.g. a model which simulates every molecule comprising the humans. Does this model “know human values”? In one sense, yes: the low-level model has everything there is to know about human values embedded within it, in exactly the same way that human values are embedded in physical humans. It has “learned human values”, in a sense sufficient to predict any real-world observations involving human values.
But it seems like there’s a sense in which such a model does not “know” human values. Specifically, although human values are embedded in the low-level model, the embedding itself is nontrivial. Even if we have the whole low-level model, we still need that embedding in order to “point to” human values specifically - e.g. to use them as an optimization target. Indeed, when we say “point to human values”, what we mean is basically “specify the embedding”. (Side note: treating human values as an optimization target is not the only use-case for “pointing to human values”, and we still need to point to human values even if we’re not explicitly optimizing for anything. But that’s a separate discussion, and imagining using values as an optimization target is useful to give a mental image of what we mean by “pointing”.)
In short: predictive power alone is not sufficient to define human values. The missing part is the embedding of values within the model. The hard part is pointing to the thing (i.e. specifying the values-embedding), not learning the thing (i.e. finding a model in which values are embedded).
Finally, here’s a different angle on the same argument which will probably drive some of the philosophers up in arms: any model of the real world with sufficiently high general predictive power will have a model of human values embedded within it. After all, it has to predict the parts of the world in which human values are embedded in the first place - i.e. the parts of which humans are composed, the parts on which human values are ...
view more