Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: What Failure Looks Like: Distilling the Discussion, published by Ben Pace on the AI Alignment Forum.
The comments under a post often contains valuable insights and additions. They are also often very long and involved, and harder to cite than posts themselves. Given this, I was motivated to try to distill some comment sections on LessWrong, in part to start exploring whether we can build some norms and some features to help facilitate this kind of intellectual work more regularly. So this is my attempt to summarise the post and discussion around What Failure Looks Like by Paul Christiano.
Epistemic status: I think I did an okay job. I think I probably made the most errors in place where I try to emphasise concrete details more than the original post did. I think the summary of the discussion is much more concise than the original.
What Failure Looks Like (Summary)
On its default course, our civilization will build very useful and powerful AI systems, and use such systems to run significant parts of society (such as healthcare, legal systems, companies, the military, and more). Similar to how we are dependent on much novel technology such as money and the internet, we will be dependent on AI.
The stereotypical AI catastrophe involves a powerful and malicious AI that seems good but suddenly becomes evil and quickly takes over humanity. Such descriptions are often stylised for good story-telling, or emphasise unimportant variables.
The post below will concretely lay out two ways that building powerful AI systems may cause an existential catastrophe, if the problem of intent alignment is not solved. This is solely an attempt to describe what failure looks like, not to assign probabilities to such failure or to propose a plan to avoid these failures.
There are two failure modes that will be discussed. First, we may increasingly fail to understand how our AI systems work and subsequently what is happening in society. Secondly, we may eventually give these AI systems massive amounts of power despite not understanding their internal reasoning and decision-making algorithms. Due to the massive space of designs we'll be searching through, if we do not understand the AI, this will mean certain AIs will be more power-seeking than expected, and will take adversarial action and take control.
Failure by loss of control
There is a gap between what we want, and what objective functions we can write down. Nobody has yet created a function that when maximised perfectly describes what we want, but increasingly powerful machine learning will optimise very hard for what function we encode. This will lead to a strong increases the gap between what we can optimise for and what we want. (This is a classic goodharting scenario.)
Concretely, we will gradually use ML to perform more and more key functions in society, but will largely not understand how these systems work or what exactly they’re doing. The information we can gather will seem strongly positive: GDP will be rising quickly, crime will be down, life-satisfaction ratings will be up, congress's approval will be up, and so on.
However, the underlying reality will increasingly diverge from what we think these metrics are measuring, and we may no longer have the ability to independently figure this out. In fact, new things won’t be built, crime will continue, people's lives will be miserable, and congress will not be effective at improving governance, we’ll just believe this because the ML systems will be improving our metrics, and will have a hard time understanding what’s going on outside of what they report.
Gradually, our civilization will lose its ability to understand what is happening in the world as our systems and infrastructure shows us success on all of our available metrics (GDP, wealth, crime, health, self-reported happiness...
view more