The Nonlinear Library: Alignment Forum Podcast - AF - Video and transcript of presentation on Scheming AIs by Joe Carlsmith

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum

Education

AF - Video and transcript of presentation on Scheming AIs by Joe Carlsmith

2024-03-22

iOS

Android Share

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Video...

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Video and transcript of presentation on Scheming AIs, published by Joe Carlsmith on March 22, 2024 on The AI Alignment Forum. (Cross-posted from my website.) This is the video and transcript for a ~45-minutes talk I gave in February 2024 about my report " Scheming AIs: Will AIs fake alignment during training in order to get power?" (slides available here). See also this podcast for a more conversational overview of similar content. Main talk Okay, hi everybody. Thanks for coming. So this talk is about whether advanced AIs will fake alignment during training in order to get power later. This is a behavior that I'm calling scheming, it's also often called deceptive alignment. Preliminaries So I'm going to start with a few preliminaries. As was mentioned, this talk is based on a public report called Scheming AIs available on arXiv. There's also an audio version on my podcast, Joe Carlsmith Audio, if you prefer that. So I encourage you to check that out. I'm going to try to cover many of the main points here, but the report is pretty long and so I'm not going to go into that much depth on any given point. So if you want more depth or you have burning objections or clarifications or you want to work on it, I encourage you to check out the full report. Second, I'm going to assume familiarity with: The basic argument for existential risk from misaligned AI. That is roughly the thought that advanced AI agents with goals that conflict with human goals would have instrumental incentives to seek power over humans, and potentially to disempower humanity entirely, an event I'll call takeover, AI takeover. If that story didn't sound familiar to you, I do have some other work on this topic which I would encourage you to check out. But some of the talk itself might make a little less sense, so I apologize for that. I'm also going to assume familiarity with a basic picture of how contemporary machine learning works. So very roughly, imagine models with lots of parameters that are being updated via stochastic gradient descent (SGD), such that they perform better according to some feedback signal. And in particular, I'm often going to be imagining a default baseline training process that resembles somewhat what happens with current language models. Namely, a pre-training phase in which a model is trained on some combination of internet text and potentially other data. And then a fine-tuning phase, in which it's trained via some combination of maybe imitation learning or reinforcement learning. So that's the baseline picture I'm going to be imagining. There are other paradigms in which questions about scheming will arise, but I'm going to focus on this one. I'm also going to condition on the AIs I discuss being "goal directed". And what I mean by that is that these AIs are well understood and well predicted by thinking of them as making plans on the basis of models of the world in pursuit of objectives. This is not an innocuous assumption. And in fact I think confusions in this vicinity are one of my most salient candidates for how the AI alignment discourse as a whole might be substantially off base. But I want to set aside some questions about whether AIs are well understood as goal directed at all, from questions about conditional on them being goal directed, whether they will be schemers. So if your objection to scheming is, "I don't think AIs will have goals at all.", then that's a perfectly fine objection, especially in particular training paradigms, but it's not the objection I'm going to focus on here. And I also do have in the other work, on misalignment in general, some thoughts about why we might expect goal directness of this kind. In particular, I think goal directness is useful for many tasks. Finally, I want to flag that in addition to potentially posing existential risks...