Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Two concepts of an "episode" (Section 2.2.1 of "Scheming AIs"), published by Joe Carlsmith on November 27, 2023 on The AI Alignment Forum.
(This is Section 2.2.1 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Two concepts of an "episode" (Section 2.2.1 of "Scheming AIs"), published by Joe Carlsmith on November 27, 2023 on The AI Alignment Forum.
(This is Section 2.2.1 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app.)
Beyond-episode goals
Schemers are pursuing goals that extend beyond the time horizon of the episode. But what is an episode?
Two concepts of an "episode"
Let's distinguish between two concepts of an episode.
The incentivized episode
The first, which I'll call the "incentivized episode," is the concept that I've been using thus far and will continue to use in what follows. Thus, consider a model acting at a time t1. Here, the rough idea is to define the episode as the temporal unit after t1 that training actively punishes the model for not optimizing - i.e., the unit of time such that we can know by definition that training is not directly pressuring the model to care about consequences beyond that time.
For example, if training started on January 1st of 2023 and completed on July 1st of 2023, then the maximum length of the incentivized episode for this training would be six months - at no point could the model have been punished by training for failing to optimize over a longer-than-six-month time horizon, because no gradients have been applied to the model's policy that were (causally) sensitive to the longer-than-six-month consequences of its actions. But the incentivized episode for this training process could in principle be shorter than six months as well.
Now, importantly, even if training only directly pressures a model to optimize over some limited period of time, it can still in fact create a model that optimizes over some much longer time period - that's what makes schemers, in my sense, a possibility. Thus, for example, if you're training a model to get as many gold coins as possible within a ten minute window, it could still, in principle, learn the goal "maximize gold coins over all time" - and this goal might perform quite well (even absent training gaming), or survive despite not performing all that well (for example, because of the "slack" that training allows).
Indeed, to the extent we think of evolution as an analogy for ML training, then something like this appears to have happened with humans with goals that extend indefinitely far into the future - for example, "longtermists." That is, evolution does not actively select for or against creatures in a manner sensitive to the consequences of their actions in a trillion years (after all, evolution has only been running for a few billion years) - and yet, some humans aim their optimization on trillion-year timescales regardless.
That said, to the extent a given training procedure in fact creates a model with a very long-term goal (because, for example, such a goal is favored by the sorts of "inductive biases" I'll discuss below), then in some sense you could argue that training "incentivizes" such a goal as well. That is, suppose that "maximize gold coins in the next ten minutes" and "maximize gold coins over all time" both get the same reward in a training process that only provides rewards after ten minutes, but that training selects "maximize gold coins over all time" because of some other difference between the goals in question (for example, because "maximize gold coins over all time" is in some sense "simpler," and gradient descent selects for simplicity in addition to reward-getting).
Maybe you could say tha...
View more