80,000 Hours Podcast - #151 – Ajeya Cotra on accidentally teaching AI models to deceive us

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

80,000 Hours Podcast

Education

#151 – Ajeya Cotra on accidentally teaching AI models to deceive us

2023-05-12

iOS

Android Share

Imagine you are an orphaned eight-year-old whose parents left you a $1 trillion company, and no trusted adult to serve as your guide to the world. You have to hire a smart adult to run that company, guide your life the way that a parent would, and administer your vast wealth. You have to hire that adult based on a work trial or interview you come up with. You don't get to see any resumes or do reference checks. And because you're so rich, tonnes of people apply for the job — for all sorts of reasons.

Today's guest Ajeya Cotra — senior research analyst at Open Philanthropy — argues that this peculiar setup resembles the situation humanity finds itself in when training very general and very capable AI models using current deep learning methods.

Links to learn more, summary and full transcript.

As she explains, such an eight-year-old faces a challenging problem. In the candidate pool there are likely some truly nice people, who sincerely want to help and make decisions that are in your interest. But there are probably other characters too — like people who will pretend to care about you while you're monitoring them, but intend to use the job to enrich themselves as soon as they think they can get away with it.

Like a child trying to judge adults, at some point humans will be required to judge the trustworthiness and reliability of machine learning models that are as goal-oriented as people, and greatly outclass them in knowledge, experience, breadth, and speed. Tricky!

Can't we rely on how well models have performed at tasks during training to guide us? Ajeya worries that it won't work. The trouble is that three different sorts of models will all produce the same output during training, but could behave very differently once deployed in a setting that allows their true colours to come through. She describes three such motivational archetypes:

Saints — models that care about doing what we really want
Sycophants — models that just want us to say they've done a good job, even if they get that praise by taking actions they know we wouldn't want them to
Schemers — models that don't care about us or our interests at all, who are just pleasing us so long as that serves their own agenda

And according to Ajeya, there are also ways we could end up actively selecting for motivations that we don't want.

In today's interview, Ajeya and Rob discuss the above, as well as:

How to predict the motivations a neural network will develop through training
Whether AIs being trained will functionally understand that they're AIs being trained, the same way we think we understand that we're humans living on planet Earth
Stories of AI misalignment that Ajeya doesn't buy into
Analogies for AI, from octopuses to aliens to can openers
Why it's smarter to have separate planning AIs and doing AIs
The benefits of only following through on AI-generated plans that make sense to human beings
What approaches for fixing alignment problems Ajeya is most excited about, and which she thinks are overrated
How one might demo actually scary AI failure mechanisms

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type ‘80,000 Hours’ into your podcasting app. Or read the transcript below.

Producer: Keiran Harris

Audio mastering: Ryan Kessler and Ben Cordell

Transcriptions: Katy Moore

More Episodes

#190 – Eric Schwitzgebel on whether the US is conscious

2024-06-07

#189 – Rachel Glennerster on how “market shaping” could help solve climate change, pandemics, and other global problems

2024-05-29

#188 – Matt Clancy on whether science is good

2024-05-23

#187 – Zach Weinersmith on how researching his book turned him from a space optimist into a "space bastard"

2024-05-14

#186 – Dean Spears on why babies are born small in Uttar Pradesh, and how to save their lives

2024-05-01

#185 – Lewis Bollard on the 7 most promising ways to end factory farming, and whether AI is going to be good or bad for animals

2024-04-18

#184 – Zvi Mowshowitz on sleeping on sleeper agents, and the biggest AI updates since ChatGPT

2024-04-11

AI governance and policy (Article)

2024-03-28

#183 – Spencer Greenberg on causation without correlation, money and happiness, lightgassing, hype vs value, and more

2024-03-14

#182 – Bob Fischer on comparing the welfare of humans, chickens, pigs, octopuses, bees, and more

2024-03-08

#181 – Laura Deming on the science that could keep us healthy in our 80s and beyond

2024-03-01

#180 – Hugo Mercier on why gullibility and misinformation are overrated

2024-02-21

#179 – Randy Nesse on why evolution left us so vulnerable to depression and anxiety

2024-02-12

#178 – Emily Oster on what the evidence actually says about pregnancy and parenting

2024-02-01

#177 – Nathan Labenz on recent AI breakthroughs and navigating the growing rift between AI safety and accelerationist camps

2024-01-24

#90 Classic episode – Ajeya Cotra on worldview diversification and how big the future could be

2024-01-12

#112 Classic episode – Carl Shulman on the common-sense case for existential risk work and its practical implications

2024-01-08

#111 Classic episode – Mushtaq Khan on using institutional economics to predict effective government reforms

2024-01-04

2023 Mega-highlights Extravaganza

2023-12-31

#100 Classic episode – Having a successful career with depression, anxiety, and imposter syndrome

2023-12-27

012345678910111213141516171819

Create your
podcast in
minutes

Full-featured podcast site
Unlimited storage and bandwidth
Comprehensive podcast stats
Distribute to Apple Podcasts, Spotify, and more
Make money with your podcast

Get started

It is Free

Podcast Services

MONETIZATION & MORE

KNOWLEDGE BASE

Support

Podbean