Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Orthogonal's Formal-Goal Alignment theory of change, published by carado on May 5, 2023 on LessWrong.
we recently announced Orthogonal, an agent foundations alignment research organization. in this post, i give a thorough explanation of the formal-goal alignment framework, the motivation behind it, and the theory of change it fits in.
the overall shape of what we're doing is:
building a formal goal which would lead to good worlds when pursued — our best candidate for this is QACI
designing an AI which takes as input a formal goal, and returns actions which pursue that goal in the distribution of worlds we likely inhabit
backchaining: aiming at solutions
one core aspect of our theory of change is backchaining: come up with a at least remotely plausibly story for how the world is saved from AI doom, and try to think about how to get there. this avoids spending lots of time getting confused about concepts that are confusing because they were the wrong thing to think about all along, such as "what is the shape of human values?" or "what does GPT4 want?" — our intent is to study things that fit together to form a full plan for saving the world.
alignment engineering and agent foundations
alignment is not just not the default, it's a very narrow target. as a result, there are many bits of non-obvious work which need to be done. alignment isn't just finding the right weight to sign-flip to get the AI to switch from evil to good; it is the hard work of putting together something which coherently and robustly points in a direction we like.
as yudkowsky puts it:
The idea with agent foundations, which I guess hasn't successfully been communicated to this day, was finding a coherent target to try to get into the system by any means (potentially including DL ones).
agent foundations/formal-goal alignment is not fundamentally about doing math or being theoretical or thinking abstractly or proving things. agent foundations/formal-goal alignment is about building a coherent target which is fully made of math — not of human words with unspecified meaning — and figuring out a way to make that target maximized by AI. formal-goal alignment is about building a fully formalized goal, not about going about things in a "formal" manner.
current AI technologies are not strong agents pursuing a coherent goal (SGCA). the reason for this is not because this kind of technology is impossible or too confusing to build, but because in worlds in which SGCA was built (and wasn't aligned), we die. alignment ultimately is about making sure that the first SGCA pursues desirable goal; the default is that its goal will be undesirable.
this does not mean that i think that someone needs to figure out how to build SGCA for the world to end of AI; what i expect is that there are ways in which SGCA can emerge out of the current AI paradigm, in ways that don't let particularly us choose what goal it pursues.
you do not align AI; you build aligned AI.
because this emergence does not let us pick the SGCA's goal, we need to design an SGCA whose goal we do get to choose; and separately, we need to design such a goal. i expect that pursuing straightforward progress on current AI technology leads to an SGCA whose goal we do not get to choose and which leads to extinction.
i do not expect that current AI technology is of a kind that makes it easy to "align"; i believe that the whole idea of building a strange non-agentic AI about which the notion of goal barely applies, and then to try and make it "be aligned", was fraught from the start. if current AI was powerful enough to save the world once "aligned", it would have already killed us before we "aligned" it. to save the world, we have to design something new which pursues a goal we get to choose; and that design needs to have this in mind from the start, rather than ...
view more