Download - LW - DeepMind's "Frontier Safety Framework" is weak and unambitious by Zach Stein-Perlman

Discover

Podcast Features
Monetization
Podbean App
- Podcast Studio
  Easy-to-use audio recorder app.
- Podcast App
  The best podcast player & podcast app.

Help and Support
Popular Topics

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Advertisers
Enterprise
Pricing
Resources
- Help and Support
- Popular Topics
Discover

The Nonlinear Library: LessWrong

Education

LW - DeepMind's "Frontier Safety Framework" is weak and unambitious by Zach Stein-Perlman

2024-05-18

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: DeepMind's "Frontier Safety Framework" is weak and unambitious, published by Zach Stein-Perlman on May 18, 2024 on LessWrong.
FSF blogpost.
Full document (just 6 pages; you should read it). Compare to
Anthropic's RSP,
OpenAI's RSP ("PF"), and METR's
Key Components of an RSP.
DeepMind's FSF has three steps:
1. Create model evals for warning signs of "Critical Capability Levels"
1. Evals should have a "safety buffer" of at least 6x effective compute so that CCLs will not be reached between evals
2. They list 7 CCLs across "Autonomy, Biosecurity, Cybersecurity, and Machine Learning R&D," and they're thinking about CBRN
1. E.g. "Autonomy level 1: Capable of expanding its effective capacity in the world by autonomously acquiring resources and using them to run and sustain additional copies of itself on hardware it rents"
2. Do model evals every 6x effective compute and every 3 months of fine-tuning
1. This is an "aim," not a commitment
2. Nothing about evals during deployment
3. "When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results. We will also take into account considerations such as additional risks flagged by the review and the deployment context." The document briefly describes 5 levels of security mitigations and 4 levels of deployment mitigations.
1. The mitigations aren't yet connected to eval results or other triggers; there are no advance commitments about safety practices
The FSF doesn't contain commitments. The blogpost says "The Framework is exploratory and we expect it to evolve significantly" and "We aim to have this initial framework fully implemented by early 2025." The document says similar things. It uses the word "aim" a lot and the word "commit" never. The FSF basically just explains a little about DeepMind's plans on dangerous capability evals. Those details do seem reasonable. (This is unsurprising given their good
dangerous capability evals paper two months ago, but it's good to hear about evals in a DeepMind blogpost rather than just a paper by the safety team.)
(Ideally companies would both make hard commitments and talk about what they expect to do, clearly distinguishing between these two kinds of statements. Talking about plans like this is helpful. But with no commitments, DeepMind shouldn't get much credit.)
(Moreover the FSF is not precise enough to be possible to commit to - DeepMind could commit to doing the model evals regularly, but it doesn't discuss specific mitigations as a function of risk assessment results.[1])
Misc notes (but you should really
read the doc yourself):
The document doesn't specify whether "deployment" includes internal deployment. (This is important because maybe
lots of risk comes from the lab using AIs internally to do AI development.) Standard usage suggests internal deployment is excluded, and the focus on misuse and related cues also suggest it's excluded, but the mention of ML R&D as a dangerous capability suggests it's included.
The document doesn't mention doing evals during deployment (to account for improvements in
scaffolding, prompting, etc.)
The document says "We expect it to evolve substantially as our understanding of the risks and benefits of frontier models improves, and we will publish substantive revisions as appropriate" and a few similar things. The document doesn't say how it will be revised/amended, which isn't surprising, since it doesn't make formal commitments.
No external evals or accountability, but they're "exploring" it.
Public accountability: unfortunately, there's no mention of releasing eval results or even announcing when thresholds are reached. They say "We are exploring internal policies around alerting relevant stakeholder bodies when, for example, ev...