Podcasting
Advertisers
Enterprise
Pricing
Resources
Discover Discover

Log in
Sign up free

The Nonlinear Library: Alignment Forum

AF - Evaluating AI Systems for Moral Status Using Self-Reports by Ethan Perez

2023-11-16

Download

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Evaluating AI Systems for Moral Status Using Self-Reports, published by Ethan Perez on November 16, 2023 on The AI Alignment Forum. TLDR: In a new paper, we explore whether we could train future LLMs to accurately answer questions about themselves. If this works, LLM self-reports may help us test them for...

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Evaluating AI Systems for Moral Status Using Self-Reports, published by Ethan Perez on November 16, 2023 on The AI Alignment Forum.
TLDR: In a new paper, we explore whether we could train future LLMs to accurately answer questions about themselves. If this works, LLM self-reports may help us test them for morally relevant states like consciousness.
We think it's possible to start preliminary experiments testing for moral status in language models now, so if you're interested in working with us, please reach out and/or apply to the Astra fellowship or SERI MATS (deadline November 17).
Tweet thread paper summary: Link
Abstract:
As AI systems become more advanced and widely deployed, there will likely be increasing debate over whether AI systems could have conscious experiences, desires, or other states of potential moral significance. It is important to inform these discussions with empirical evidence to the extent possible.
We argue that under the right circumstances, self-reports, or an AI system's statements about its own internal states, could provide an avenue for investigating whether AI systems have states of moral significance. Self-reports are the main way such states are assessed in humans ("Are you in pain?"), but self-reports from current systems like large language models are spurious for many reasons (e.g. often just reflecting what humans would say).
To make self-reports more appropriate for this purpose, we propose to train models to answer many kinds of questions about themselves with known answers, while avoiding or limiting training incentives that bias self-reports. The hope of this approach is that models will develop introspection-like capabilities, and that these capabilities will generalize to questions about states of moral significance.
We then propose methods for assessing the extent to which these techniques have succeeded: evaluating self-report consistency across contexts and between similar models, measuring the confidence and resilience of models' self-reports, and using interpretability to corroborate self-reports. We also discuss challenges for our approach, from philosophical difficulties in interpreting self-reports to technical reasons why our proposal might fail. We hope our discussion inspires philosophers and AI researchers to criticize and improve our proposed methodology, as well as to run experiments to test whether self-reports can be made reliable enough to provide information about states of moral significance.
See also this earlier post for more discussion on the relevance to alignment (AI systems that are suffering might be more likely to take catastrophic actions), as well as some initial criticisms of a preliminary version of our proposed test in the comments. We've added a significant amount of content to our updated proposal to help to address several reservations people had on why the initial proposal might not work, so we're excited to get additional feedback and criticism on the latest version of our proposal as well.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

View more

Comments (3)

More Episodes

You may also like

Adulting with Autism

The Pacific War - week by week

German Stories - Learn German with Stories | Deutsch lernen mit Geschichten

The Mel Robbins Podcast

The Jordan B. Peterson Podcast

Halacha Headlines

رادیو راه با مجتبی شکوری

‌BPLUS بی‌پلاس پادکست فارسی خلاصه کتاب

جافکری | Jafekri

All Ears English Podcast

Get this podcast on your phone, Free

Creat Yourt Podcast In Minutes

Full-featured podcast site
Unlimited storage and bandwidth
Comprehensive podcast stats
Distribute to Apple Podcasts, Spotify, and more
Make money with your podcast

It is Free

Podcast Services
MONETIZATION & MORE
KNOWLEDGE BASE
Support
Podbean

Privacy Policy
Cookie Policy
Terms of Use
Consent Preferences
Copyright © 2015-2025 Podbean.com