Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of "Scheming AIs"), published by Joe Carlsmith on December 1, 2023 on The AI Alignment Forum.
This is Section 2.2.4.3 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of "Scheming AIs"), published by Joe Carlsmith on December 1, 2023 on The AI Alignment Forum.
This is Section 2.2.4.3 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app.
How much useful, alignment-relevant cognitive work can be done using AIs with short-term goals?
So overall, I think that training our models to pursue long-term goals - whether via long episodes, or via short episodes aimed at inducing long-term optimization - makes the sort of beyond-episode goals that motivate scheming more likely to arise. So this raises the question: do we need to train our models to pursue long-term goals?
Plausibly, there will be strong general incentives to do this. That is: people want optimization power specifically applied to long-term goals like "my company being as profitable as possible in a year." So, plausibly, they'll try to train AIs that optimize in this way. (Though note that this isn't the same as saying that there are strong incentives to create AIs that optimize the state of the galaxies in the year five trillion.)
Indeed, there's a case to be made that even our alignment work, today, is specifically pushing towards the creation of models with long-term - and indeed, beyond-episode - goals. Thus, for example, when a lab trains a model to be "harmless," then even though it is plausibly using fairly "short-episode" training (e.g., RLHF on user interactions), it intends a form of "harmlessness" that extends quite far into the future, rather than cutting off the horizon of its concern after e.g.
an interaction with the user is complete.
That is: if a user asks for help building a bomb, the lab wants the model to refuse, even if the bomb in question won't be set off for a decade.[1] And this example is emblematic of a broader dynamic: namely, that even when we aren't actively optimizing for a specific long-term outcome (e.g., "my company makes a lot of money by next year"), we often have in mind a wide variety of long-term outcomes that we want to avoid (e.g., "the drinking water in a century is not poisoned"), and which it
wouldn't be acceptable to cause in the course of accomplishing some short-term task.
Humans, after all, care about the state of the future for at least decades in advance (and for some humans: much longer), and we'll want artificial optimization to reflect this concern.
So overall, I think there is indeed quite a bit of pressure to steer our AIs towards various forms of long-term optimization. However, suppose that we're not blindly following this pressure. Rather, we're specifically trying to use our AIs to perform the sort of alignment-relevant cognitive work I discussed above - e.g., work on interpretability, scalable oversight, monitoring, control, coordination amongst humans, the general science of deep learning, alternative (and more controllable/interpretable) AI paradigms, and the like.
In many cases, I think the answer is no. In particular: I think that a lot of this sort of alignment-relevant work can be performed by models that are e.g. generating research papers in response to human+AI supervision over fairly short timescales, suggesting/conducting relatively short-term experiments, looking over a codebase and pointing out bugs, conducting relatively short-term security tests and red-teaming attempts, and so on.
We can talk about whether it will be possible to generate rewar...
View more