Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Clarifying “What failure looks like”, published by Sam Clarke on the AI Alignment Forum.
Thanks to Jess Whittlestone, Daniel Eth, Shahar Avin, Rose Hadshar, Eliana Lorch, Alexis Carlier, Flo Dorner, Kwan Yee Ng, Lewis Hammond, Phil Trammell and Jenny Xiao for valuable conversations, feedback and other support. I am especially grateful to Jess Whittlestone for long conversations and detailed feedback on drafts, and her guidance on which threads to pursue and how to frame this post. All errors are my own.
Epistemic status: My Best Guess
Epistemic effort: ~70 hours of focused work (mostly during FHI’s summer research fellowship), talked to ~10 people.
Introduction
“What failure looks like” is the one of the most comprehensive pictures of what failure to solve the AI alignment problem looks like, in worlds without discontinuous progress in AI. I think it was an excellent and much-needed addition to our understanding of AI risk. Still, if many believe that this is a main source of AI risk, I think it should be fleshed out in more than just one blog post. The original story has two parts; I’m focusing on part 1 because I found it more confusing and nebulous than part 2.
Firstly, I’ll summarise part 1 (hereafter “WFLL1”) as I understand it:
In the world today, it’s easier to pursue easy-to-measure goals than hard-to-measure goals.
Machine learning is differentially good at pursuing easy-to-measure goals (assuming that we don’t have a satisfactory technical solution to the intent alignment problem[1]).
We’ll try to harness this by designing easy-to-measure proxies for what we care about, and deploy AI systems across society which optimize for these proxies (e.g. in law enforcement, legislation and the market).
We’ll give these AI systems more and more influence (e.g. eventually, the systems running law enforcement may actually be making all the decisions for us).
Eventually, the proxies for which the AI systems are optimizing will come apart from the goals we truly care about, but by then humanity won’t be able to take back influence, and we’ll have permanently lost some of our ability to steer our trajectory.
WFLL1 is quite thin on some important details:
WFLL1 does not envisage AI systems directly causing human extinction. So, to constitute an existential risk in itself, the story must involve the lock-in of some suboptimal world.[2] However, the likelihood that the scenario described in part 1 gets locked-in (especially over very long time horizons) is not entirely clear in the original post.
It’s also not clear how bad this locked-in world would actually be.
I’ll focus on the first point: how likely is it that the scenario described in WFLL1 leads to the lock-in of some suboptimal world. I’ll finish with some rough thoughts on the second point - how bad/severe that locked-in world might be - and by highlighting some remaining open questions.
Likelihood of lock-in
The scenario described in WFLL1 seems very concerning from a longtermist perspective if it leads to humanity getting stuck on some suboptimal path (I’ll refer to this as “lock-in”). But the blog post itself isn't all that clear about why we should expect such lock-in --- i.e. why we won't be able to stop the trend of AI systems optimising for easy-to-measure things before it's too late -- a confusion which has been pointed out before. In this section, I'll talk through some different mechanisms by which this lock-in can occur, discuss some historical precedents for these mechanisms occurring, and then discuss why we might expect the scenario described in WFLL1 to be more likely to lead to lock-in than for the precedents.
The mechanisms for lock-in
Summary: I describe five complementary mechanisms by which the scenario described in WFLL1 (i.e. AI systems across society optimizing for simple proxies at the expense...
view more