Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Value fragility and AI takeover, published by Joe Carlsmith on August 5, 2024 on LessWrong.
1. Introduction
"Value fragility," as I'll construe it, is the claim that slightly-different value systems tend to lead in importantly-different directions when subject to extreme optimization. I think the idea of...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Value fragility and AI takeover, published by Joe Carlsmith on August 5, 2024 on LessWrong.
1. Introduction
"Value fragility," as I'll construe it, is the claim that slightly-different value systems tend to lead in importantly-different directions when subject to extreme optimization. I think the idea of value fragility haunts the AI risk discourse in various ways - and in particular, that it informs a backdrop prior that adequately aligning a superintelligence requires an extremely precise and sophisticated kind of technical and ethical achievement.
That is, the thought goes: if you get a superintelligence's values even slightly wrong, you're screwed.
This post is a collection of loose and not-super-organized reflections on value fragility and its role in arguments for pessimism about AI risk. I start by trying to tease apart a number of different claims in the vicinity of value fragility. In particular:
I distinguish between questions about value fragility and questions about how different agents would converge on the same values given adequate reflection.
I examine whether "extreme" optimization is required for worries about value fragility to go through (I think it at least makes them notably stronger), and I reflect a bit on whether, even conditional on creating super-intelligence, we might be able to avoid a future driven by relevantly extreme optimization.
I highlight questions about whether multipolar scenarios alleviate concerns about value fragility, even if your exact values don't get any share of the power.
My sense is that people often have some intuition that multipolarity helps notably in this respect; but I don't yet see a very strong story about why. If readers have stories that they find persuasive in this respect, I'd be curious to hear.
I then turn to a discussion of a few different roles that value fragility, if true, could play in an argument for pessimism about AI risk. In particular, I distinguish between:
1. The value of what a superintelligence does after it takes over the world, assuming that it does so.
2. What sorts of incentives a superintelligence has to try to take over the world, in a context where it can do so extremely easily via a very wide variety of methods.
3. What sorts of incentives a superintelligence has to try to take over the world, in a context where it can't do so extremely easily via a very wide variety of methods.
Yudkowsky's original discussion of value fragility is most directly relevant to (1). And I think it's actually notably irrelevant to (2). In particular, I think the basic argument for expecting AI takeover in a (2)-like scenario doesn't require value fragility to go through - and indeed, some conceptions of "AI alignment" seem to expect a "benign" form of AI takeover even if we get a superintelligence's values exactly right.
Here, though, I'm especially interested in understanding (3)-like scenarios - that is, the sorts of incentives that apply to a superintelligence in a case where it can't just take over the world very easily via a wide variety of methods. Here, in particular, I highlight the role that value fragility can play in informing the AI's expectations with respect to the difference in value between worlds where it does not take over, and worlds where it does.
In this context, that is, value fragility can matter to how the AI feels about a world where humans do retain control - rather than solely to how humans feel about a world where the AI takes over.
I close with a brief discussion of how commitments to various forms of "niceness" and intentional power-sharing, if made sufficiently credible, could help diffuse the sorts of adversarial dynamics that value fragility can create.
2. Variants of value fragility
What is value fragility? Let's start with some high-level definitions and clarifications.
...
View more