Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: The Commitment Races problem , published by Daniel Kokotajlo on the AI Alignment Forum.
[Epistemic status: Strong claims vaguely stated and weakly held. I expect that writing this and digesting feedback on it will lead to a much better version in the future. EDIT: So far this has stood the test of time. EDIT: As of September 2020 I think this is one of the most important things to be thinking about.]
This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016. [Edit: 2009 in fact!] In short, here is the problem:
Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible. When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in (building AGI and letting it self-modify) may be one of these times unless we think carefully about this problem and how to avoid it.
For this post I use "consequentialists" to mean agents that choose actions entirely on the basis of the expected consequences of those actions. For my purposes, this means they don't care about historical facts such as whether the options and consequences available now are the result of malicious past behavior. (I am trying to avoid trivial definitions of consequentialism according to which everyone is a consequentialist because e.g. "obeying the moral law" is a consequence.) This definition is somewhat fuzzy and I look forward to searching for more precision some other day.
Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible
Consequentialists are bullies; a consequentialist will happily threaten someone insofar as they think the victim might capitulate and won't retaliate.
Consequentialists are also cowards; they conform their behavior to the incentives set up by others, regardless of the history of those incentives. For example, they predictably give in to credible threats unless reputational effects weigh heavily enough in their minds to prevent this.
In most ordinary circumstances the stakes are sufficiently low that reputational effects dominate: Even a consequentialist agent won't give up their lunch money to a schoolyard bully if they think it will invite much more bullying later. But in some cases the stakes are high enough, or the reputational effects low enough, for this not to matter.
So, amongst consequentialists, there is sometimes a huge advantage to "winning the commitment race." If two consequentialists are playing a game of Chicken, the first one to throw out their steering wheel wins. If one consequentialist is in position to seriously hurt another, it can extract concessions from the second by credibly threatening to do so--unless the would-be victim credibly commits to not give in first! If two consequentialists are attempting to divide up a pie or select a game-theoretic equilibrium to play in, the one that can "move first" can get much more than the one that "moves second." In general, because consequentialists are cowards and bullies, the consequentialist who makes commitments first will predictably be able to massively control the behavior of the consequentialist who makes commitments later. As the folk theorem shows, this can even be true in cases where games are iterated and reputational effects are significant.
Note: "first" and "later" in the above don't refer to clock time, though clock time is a helpful metaphor for imagining what is going on. Really, what's going on is that agents learn about each other, each on their own subjective timeline, while also making choices (including the choice to commit to things) and the choices a consequentialist makes at subjective time t are cravenly submissive to the commitments they've learned about by t.
Logical updateles...
view more