Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Worrying about the Vase: Whitelisting , published by Alex Turner on the AI Alignment Forum.
Suppose a designer wants an RL agent to achieve some goal, like moving a box from one side of a room to the other. Sometimes the most effective way to achieve the goal involves doing something unrelated and destructive to the rest of the environment, like knocking over a vase of water that is in its path. If the agent is given a reward only for moving the box, it will probably knock over the vase.
Amodei et al., Concrete Problems in AI Safety
Side effect avoidance is a major open problem in AI safety. I present a robust, transferable, easily- and more safely-trainable, partially reward hacking-resistant impact measure.
TurnTrout, Worrying about the Vase: Whitelisting
An impact measure is a means by which change in the world may be evaluated and penalized; such a measure is not a replacement for a utility function, but rather an additional precaution thus overlaid.
While I'm fairly confident that whitelisting contributes meaningfully to short- and mid-term AI safety, I remain skeptical of its robustness to scale. Should several challenges be overcome, whitelisting may indeed be helpful for excluding swathes of unfriendly AIs from the outcome space.
1
Furthermore, the approach allows easy shaping of agent behavior in a wide range of situations.
Segments of this post are lifted from my paper, whose latest revision may be found here; for Python code, look no further than this repository. For brevity, some relevant details are omitted.
Summary
Be careful what you wish for.
In effect, side effect avoidance aims to decrease how careful we have to be with our wishes. For example, asking for help filling a cauldron with water shouldn't result in this:
However, we just can't enumerate all the bad things that the agent could do. How do we avoid these extreme over-optimizations robustly?
Several impact measures have been proposed, including state distance, which we could define as, say, total particle displacement. This could be measured either naively (with respect to the original state) or counterfactually (with respect to the expected outcome had the agent taken no action).
These approaches have some problems:
Making up for bad things it prevents with other negative side effects. Imagine an agent which cures cancer, yet kills an equal number of people to keep overall impact low.
Not being customizable before deployment.
Not being adaptable after deployment.
Not being easily computable.
Not allowing generative previews, eliminating a means of safely previewing agent preferences (see latent space whitelisting below).
Being dominated by random effects throughout the universe at-large; note that nothing about particle distance dictates that it be related to anything happening on planet Earth.
Equally penalizing breaking and fixing vases (due to the symmetry of the above metric):
For example, the agent would be equally penalized for breaking a vase and for preventing a vase from being broken, though the first action is clearly worse. This leads to “overcompensation” (“offsetting“) behaviors: when rewarded for preventing the vase from being broken, an agent with a low impact penalty rescues the vase, collects the reward, and then breaks the vase anyway (to get back to the default outcome).
Victoria Krakovna, Measuring and Avoiding Side Effects Using Reachability
Not actually measuring impact in a meaningful way.
Whitelisting falls prey to none of these.
However, other problems remain, and certain new challenges have arisen; these, and the assumptions made by whitelisting, will be discussed.
Rare LEAKED footage of Mickey trying to catch up on his alignment theory after instantiating an unfriendly genie [colorized, 2050].
2
So, What's Whitelisting?
To achieve robust side effect avoidance with ...
view more