Download - LW - Companies' safety plans neglect risks from scheming AI by Zach Stein-Perlman

Discover

Podcast Features
Monetization
Podbean App
- Podcast Studio
  Easy-to-use audio recorder app.
- Podcast App
  The best podcast player & podcast app.

Help and Support
Popular Topics

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Advertisers
Enterprise
Pricing
Resources
- Help and Support
- Popular Topics
Discover

The Nonlinear Library

Education

LW - Companies' safety plans neglect risks from scheming AI by Zach Stein-Perlman

2024-06-03

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Companies' safety plans neglect risks from scheming AI, published by Zach Stein-Perlman on June 3, 2024 on LessWrong.
Without countermeasures, a scheming AI could escape.
A safety case (for deployment) is an argument that it is safe to deploy a particular AI system in a particular way.[1] For any existing LM-based system, the developer can make a great safety case by demonstrating that the system does not have dangerous capabilities.[2] No dangerous capabilities is a straightforward and solid kind of safety case.
For future systems with dangerous capabilities, a safety case will require an argument that the system is safe to deploy despite those dangerous capabilities. In this post, I discuss the safety cases that the labs are currently planning to make, note that they ignore an important class of threats - namely threats from scheming AI escaping - and briefly discuss and recommend control-based safety cases.
I. Safety cases implicit in current safety plans: no dangerous capabilities and mitigations to prevent misuse
Four documents both (a) are endorsed by one or more frontier AI labs and (b) have implicit safety cases (that don't assume away dangerous capabilities): Anthropic's
Responsible Scaling Policy v1.0, OpenAI's
Preparedness Framework (Beta), Google DeepMind's
Frontier Safety Framework v1.0, and the AI Seoul Summit's
Frontier AI Safety Commitments. With small variations, all four documents have the same basic implicit safety case: before external deployment, we check for dangerous capabilities. If a model has dangerous capabilities beyond a prespecified threshold, we will notice and implement appropriate mitigations before deploying it externally.[3] Central examples of dangerous capabilities include hacking, bioengineering, and operating autonomously in the real world.
1.
Anthropic's Responsible Scaling Policy: we do risk assessment involving red-teaming and model evals for dangerous capabilities. If a model has dangerous capabilities beyond a prespecified threshold,[4] we will notice and implement corresponding mitigations[5] before deploying it (internally or externally).
2.
OpenAI's Preparedness Framework: we do risk assessment involving red-teaming and model evals for dangerous capabilities. We only externally deploy[6] models with "post-mitigation risk" at 'Medium' or below in each risk category. (That is, after mitigations, the capabilities that define 'High' risk can't be elicited.)
3.
Google DeepMind's Frontier Safety Framework: we do risk assessment involving red-teaming and model evals for dangerous capabilities. If a model has dangerous capabilities beyond a prespecified threshold, we will notice before external deployment.[7] "When a model reaches evaluation thresholds (i.e.
passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results."[8] Mitigations are centrally about preventing "critical capabilities" from being "accessed" (and securing model weights).
4.
Frontier AI Safety Commitments (joined by 16 AI companies): before external deployment, we will do risk assessment with risk thresholds.[9] We use mitigations[10] "to keep risks within defined thresholds."
These safety cases miss (or assume unproblematic) some crucial kinds of threats.
II. Scheming AI and escape during internal deployment
By default, AI labs will deploy AIs internally to do AI development.
Maybe lots of risk "comes from the lab using AIs internally to do AI development (by which I mean both research and engineering). This is because the AIs doing AI development naturally require access to compute and model weights that they can potentially leverage into causing catastrophic outcomes - in particular, those resources can be abused to run AIs unmonitored." Without countermeasures, if the AI is
scheming, i...