Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Speed arguments against scheming (Section 4.4-4.7 of "Scheming AIs"), published by Joe Carlsmith on December 8, 2023 on The AI Alignment Forum.
This is Section 4.4-4.7 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Speed arguments against scheming (Section 4.4-4.7 of "Scheming AIs"), published by Joe Carlsmith on December 8, 2023 on The AI Alignment Forum.
This is Section 4.4-4.7 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app.
Speed arguments
Let's turn, now, to arguments against schemers that focus on the final properties of the models in question. Here, the main argument I'm aware of is that schemers need to do extra reasoning that other model classes don't. In particular, I'll distinguish between three sorts of reasoning:
Deciding to training-game*:* The model needs to reason backwards from "I want this beyond-episode thing" to "I should do well in training," in a way that draws on the model's predictions about what doing well in training will entail (e.g., re: goal-guarding, take-over prospects, and so on).
Checking whether it's time to escape/take-over: The model needs to checking, with adequate frequency and accuracy, whether it's in a position to escape/take-over (and to succeed in this assessment despite human efforts to trap/fool the model into revealing its true colors - for example, by offering it apparent opportunities to defect).[1]
(Optional, but relevant to how scary the schemer is) Early undermining: To the extent the model is doing anything (other than training-gaming) to support a take-over/escape prior to escaping human control - for example, inserting backdoors into the code it writes, actively undermining human alignment research, sandbagging, and so on - it will need to be using cognitive resources both in (a) performing the relevant nefarious actions and (b) making sure it gets away with them (e.g., checking that it won't get caught, coming up with excuses, making the cover stories consistent even when under scrutiny, and so on).
In the literature on deceptive alignment, these costs are sometimes called "speed costs," as compared with the sort of "complexity costs" discussed in the simplicity analysis above. Here, the frame becomes "schemers are favored by simplicity but disfavored by speed - so which will matter more?"[2] As I mentioned above in the context of simplicity, it's not clear that this is the right breakdown (since e.g.
it may be that the relevant forms of reasoning will need to be encoded in the model's weights, or that they will require extra cognitive machinery in general, thereby imposing "complexity costs" of their own), but I'll follow it loosely here. And note, too, that as with simplicity, these "speed costs" can be understood in a "contributors to reward frame" (e.g., having to do this extra reasoning decreases reward) or in an "extra criteria frame" (e.g., perhaps SGD's inductive biases favor "speed" even independent of its contribution to reward).
How big are the absolute costs of this extra reasoning?
I feel unsure about exactly how costly the relevant forms of extra reasoning will be. My sense, though, is that the people most concerned about schemers tend to expect the costs to be pretty small, at least in absolute terms (call this the "scheming is easy" story). That is, on this story:
With respect to the decision to training-game, all the model needs to do is notice that (a) it'll do better by its goals if it avoids modification and gets power later, and (b) that training-gaming is the route to this.[3]
With respect to checking for escape/take-over opportunities, the thought is generally that it will be very easy, at least for a sophisticated mod...
View more