Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Towards a New Impact Measure, published by Alex Turner on the AI Alignment Forum.
In which I propose a closed-form solution to low impact, increasing corrigibility and seemingly taking major steps to neutralize basic AI drives 1 (self-improvement), 5 (self-protectiveness), and 6 (acquisition of resources).
Previously: Worrying about the Vase: Whitelisting, Overcoming Clinginess in Impact Measures, Impact Measure Desiderata
To be used inside an advanced agent, an impact measure... must capture so much variance that there is no clever strategy whereby an advanced agent can produce some special type of variance that evades the measure.
~ Safe Impact Measure
If we have a safe impact measure, we may have arbitrarily-intelligent unaligned agents which do small (bad) things instead of big (bad) things.
For the abridged experience, read up to "Notation", skip to "Experimental Results", and then to "Desiderata".
What is "Impact"?
One lazy Sunday afternoon, I worried that I had written myself out of a job. After all, Overcoming Clinginess in Impact Measures basically said, "Suppose an impact measure extracts 'effects on the world'. If the agent penalizes itself for these effects, it's incentivized to stop the environment (and any agents in it) from producing them. On the other hand, if it can somehow model other agents and avoid penalizing their effects, the agent is now incentivized to get the other agents to do its dirty work." This seemed to be strong evidence against the possibility of a simple conceptual core underlying "impact", and I didn't know what to do.
At this point, it sometimes makes sense to step back and try to say exactly what you don't know how to solve – try to crisply state what it is that you want an unbounded solution for. Sometimes you can't even do that much, and then you may actually have to spend some time thinking 'philosophically' – the sort of stage where you talk to yourself about some mysterious ideal quantity of [chess] move-goodness and you try to pin down what its properties might be.
~ Methodology of Unbounded Analysis
There's an interesting story here, but it can wait.
As you may have guessed, I now believe there is a such a simple core. Surprisingly, the problem comes from thinking about "effects on the world". Let's begin anew.
Rather than asking "What is goodness made out of?", we begin from the question "What algorithm would compute goodness?".
~ Executable Philosophy
Intuition Pumps
I'm going to say some things that won't make sense right away; read carefully, but please don't dwell.
u
A
is an agent's utility function, while
u
H
is some imaginary distillation of human preferences.
WYSIATI
What You See Is All There Is is a crippling bias present in meat-computers:
[WYSIATI] states that when the mind makes decisions... it appears oblivious to the possibility of Unknown Unknowns, unknown phenomena of unknown relevance.
Humans fail to take into account complexity and that their understanding of the world consists of a small and necessarily un-representative set of observations.
Surprisingly, naive reward-maximizing agents catch the bug, too. If we slap together some incomplete reward function that weakly points to what we want (but also leaves out a lot of important stuff, as do all reward functions we presently know how to specify) and then supply it to an agent, it blurts out "gosh, here I go!", and that's that.
Power
A position from which it is relatively easier to achieve arbitrary goals. That such a position exists has been obvious to every population which has required a word for the concept. The Spanish term is particularly instructive. When used as a verb, "poder" means "to be able to," which supports that our definition of "power" is natural.
~ Cohen et al.
And so it is with the French "pouvoir".
Lines
Suppose you start at point
C
...
view more