Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On "slack" in training (Section 1.5 of "Scheming AIs"), published by Joe Carlsmith on November 25, 2023 on The AI Alignment Forum.
This is Section 1.5 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here)....
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On "slack" in training (Section 1.5 of "Scheming AIs"), published by Joe Carlsmith on November 25, 2023 on The AI Alignment Forum.
This is Section 1.5 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app.
On "slack" in training
Before diving into an assessment of the arguments for expecting scheming, I also want to flag a factor that will come up repeatedly in what follows: namely, the degree of "slack" that we should expect training to allow. By this I mean something like: how much is the training process ruthlessly and relentlessly pressuring the model to perform in a manner that yields maximum reward, vs. shaping the model in a more "relaxed" way, that leaves more room for less-than-maximally rewarded behavior. That is, in a low-slack regime, "but that sort of model would be getting less reward than would be possible given its capabilities" is a strong counterargument against training creating a model of the relevant kind, whereas in a high-slack regime, it's not (so high slack regimes will generally involve greater uncertainty about the type of model you end up with, since models that get less-than-maximal reward are still in the running).
Or, in more human terms: a low-slack regime is more like a hyper-intense financial firm that immediately fires any employees who fall behind in generating profits (and where you'd therefore expect surviving employees to be hyper-focused on generating profits - or perhaps, hyper-focused on the profits that their supervisors think they're generating), whereas a high-slack regime is more like a firm where employees can freely goof off, drink martinis at lunch, and pursue projects only vaguely related to the company's bottom line, and who only need to generate some amount of profit for the firm sometimes.
(Or at least, that's the broad distinction I'm trying to point at. Unfortunately, I don't have a great way of making it much more precise, and I think it's possible that thinking in these terms will ultimately be misleading.)
Slack matters here partly because below, I'm going to be making various arguments that appeal to possibly-quite-small differences in the amount of reward that different models will get. And the force of these arguments depends on how sensitive training is to these differences. But I also think it can inform our sense of what models to expect more generally.
For example, I think slack matters to the probability that training will create models that pursue proxy goals imperfectly correlated with reward on the training inputs. Thus, in a low-slack regime, it may be fairly unlikely for a model trained to help humans with science to end up pursuing a general "curiosity drive" (in a manner that doesn't then motivate instrumental training-gaming), because a model's pursuing its curiosity in training would sometimes deviate from maximally helping-the-humans-with-science.
That said, note that the degree of slack is conceptually distinct from the diversity and robustness of the efforts made in training to root out goal misgeneralization.
Thus, for example, if you're rewarding a model when it gets gold coins, but you only ever show your model environments where the only gold things are coins, then a model that tries to get gold-stuff-in-general will perform just as well a model that gets gold coins in particular, regardless of how intensely training pressures the model to get maximum reward on those environments. E.g., a low-slack regime could in...
View more