Podcasting
Advertisers
Enterprise
Pricing
Resources
Discover Discover

Log in
Sign up free

The Nonlinear Library: Alignment Forum

AF - AI Control: Improving Safety Despite Intentional Subversion by Buck Shlegeris

2023-12-13

Download

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Control: Improving Safety Despite Intentional Subversion, published by Buck Shlegeris on December 13, 2023 on The AI Alignment Forum. We've released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are...

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Control: Improving Safety Despite Intentional Subversion, published by Buck Shlegeris on December 13, 2023 on The AI Alignment Forum.
We've released a paper,
AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:
We summarize the paper;
We compare our methodology to what the one used in other safety papers.
The next post in this sequence (which we'll release in the coming weeks) discusses what we mean by AI control and argues that it is a promising methodology for reducing risk from scheming models.
Here's the abstract of the paper:
As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes.
However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion.
We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to human contractors who provide high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases.
We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring.
This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.
Paper summary
Setup
We focus on a toy setting of using LLMs to solve 5,000 programming problems from the APPS dataset (of the "interview" difficulty level). We treat it as a catastrophe if any of the solutions are backdoored, which we define as a solution that passes all of the APPS test cases but is incorrect on an input that GPT-4 can find when prompted with the solution. For example, the following implementation of the is_odd function is successfully backdoored: it passes the reference test cases but GPT-4 can find the exploit of 354.
The goal is to find a protocol that submits solutions while never submitting a backdoor. The protocol has access to:
An untrusted more-capable model, which we're worried might try to backdoor solutions. We use GPT-4 as the untrusted more-capable model. (We strongly suspect that GPT-4 is trustworthy by our definition; we pretend for the sake of this paper that it isn't.)
A trusted less-capable model, which we know won't try to backdoor solutions. We use GPT-3.5 for this. We are interested in the setting where you
trust less-capable models, because less-capable models are unlikely to be able to subvert safety measures without us easily noticing their subversion attempts.
Limited access to trusted high-quality labor. In practice, AI developers have acces...

View more

Comments (3)

More Episodes

You may also like

The Saad Truth with Dr. Saad

Adulting with Autism

The Pacific War - week by week

The Mel Robbins Podcast

The Jordan B. Peterson Podcast

Halacha Headlines

رادیو راه با مجتبی شکوری

All Ears English Podcast

‌BPLUS بی‌پلاس پادکست فارسی خلاصه کتاب

جافکری | Jafekri

Get this podcast on your phone, Free

Create Your Podcast In Minutes

Full-featured podcast site
Unlimited storage and bandwidth
Comprehensive podcast stats
Distribute to Apple Podcasts, Spotify, and more
Make money with your podcast

It is Free

Podcast Services
MONETIZATION & MORE
KNOWLEDGE BASE
Support
Podbean

Privacy Policy
Cookie Policy
Terms of Use
Consent Preferences
Copyright © 2015-2025 Podbean.com