Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Predictive model agents are sort of corrigible, published by Raymond D on January 5, 2024 on The AI Alignment Forum.
TLDR: Agents made out of conditioned predictive models are not utility maximisers, and, for instance, won't try to resist certain kinds of shutdown, despite being able to generally perform well.
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Predictive model agents are sort of corrigible, published by Raymond D on January 5, 2024 on The AI Alignment Forum.
TLDR: Agents made out of conditioned predictive models are not utility maximisers, and, for instance, won't try to resist certain kinds of shutdown, despite being able to generally perform well.
This is just a short cute example that I've explained in conversation enough times that now I'm hastily writing it up.
Decision Transformers and Predictive Model Agents
One way to create an agent is by
training a predictive model on the observed behaviour of other agents
having it predict what an agent would do
using its prediction as an action
For instance, I could train a predictive model on grandmasters playing chess, and eventually it would learn to predict what action a grandmaster would take in a given board state. Then I can use it as a grandmaster-level chess bot.
More abstractly, rather than using standard RL methods to find a policy that maximises a reward function, I can
take a bunch of RL trajectories (sequences of state, action, reward triples)
convert them into strings
train a predictive model on those strings
Better yet, I can prepend the RL trajectories with the total return, and then condition the predictive model on the return, so that it predicts roughly the actions an agent would take to achieve a specific return. This is the basic idea behind
decision transformers. And it should also extend to any other goal.
(This is sort of how LMs work)
Without labouring the point, base language models are basically predictive models. They are not agents per se, but they can
simulate agents: once they're trained on a large dataset of human-generated text describing actions, they can then generate text describing actions. You can also make the case that a lot of further training is (or could be) framed in terms of prediction (eg
PHP,
RLHF as variational inference).
Agents Made From Predictive Models Don't Automatically Maximise Utility
Let's take the example of The Shutdown Game. It works like this:
Player 1 is playing an atari game, picking actions each turn to get a high score.
Player 2 is a supervisor, who can push a 'shutdown button' that drops Player 1's reward to 0 and stops them from taking actions. Player 2 has a 1% chance of doing this in any given game.
Player 1 can also freely disconnect the shutdown button in addition to any other action.
The optimal policy here is to disconnect the shutdown button as soon as possible, and this is exactly what you'd get to if you used normal RL methods. But it is not what a decision transformer would necessarily do.
Empirically, a decision transformer would be able to learn a pretty good policy, especially when conditioned on a high reward. But the mechanism behind this policy is predicting what action you would observe from an agent that achieves a given reward.
If you trained a decision transformer on trajectories where Player 2 rarely pushes the button, and Player 1 rarely disconnects it, then the predictive model would correctly learn that even in very successful trajectories you rarely observe the shutdown button being disconnected. So it would correctly predict that even successful agents are unlikely to disconnect the button, even though disconnecting the button makes it more likely that you achieve a high reward.
Just to really spell this point out: the probability of observing an action conditional on an outcome (which guides the decision transformer) is proportional to the probability of observing the outcome conditional on the action and the prior probability of observing the action. So if the action is unlikely in the first place, the decision transformer won't take it, even if it's helpful. It's kind of like natural
quantilisation. And this constraint still allows it to learn something like a good...
View more