Download - LW - Research agenda: Supervising AIs improving AIs by Quintin Pope

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library: LessWrong

Education

LW - Research agenda: Supervising AIs improving AIs by Quintin Pope

2023-04-30

Download Right click and do "save link as"

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Research agenda: Supervising AIs improving AIs, published by Quintin Pope on April 29, 2023 on LessWrong. [This post summarizes some of the work done by Owen Dudney, Roman Engeler and myself (Quintin Pope) as part of the SERI MATS shard theory stream.] TL;DR Future prosaic AIs will likely shape their own development or that of successor AIs. We're trying to make sure they don't go insane. Summary There are two main ways AIs can get better: by improving their training algorithms or by improving their training data. We consider both scenarios, and tentatively believe that data-based improvement is riskier than architecture based improvement. Current models mostly derive their behavior from their training data, and not training algorithms (meaning their architectures, hyperparameters, loss functions, optimizers or the like). So far, most improvements to AI training algorithms seem 'value neutral'. Also note that most of human value drift currently derives from cultural shifts changing the 'training data' available in the environment, not biological evolution over the brain's base learning algorithms. We imagine a future where AIs self-augment by continuously seeking out more and better training data, and either creating successor AIs or training themselves on that data. Often, these data will come from the AIs running experiments in the real world (doing science), deliberately seeking data that would cover a specific gap in its current capabilities, analogous to how human scientists seek data from domains where our current understanding is limited. With AI, this could involve AgentGPT-like systems that spin up many instances of themselves to run experiments in parallel, potentially leading to quick improvements if we are in an agency overhang. We want to find methods of ensuring such 'automated science' processes remain safe and controllable, even after many rounds of self-directed data collection and training. In particular, we consider problems such as: Preventing self-training from amplifying undesirable behaviors Preventing semantic drift in concept representations during self-training Ensuring cross-modality actions (such as a generated image for a text-to-image model, or robot movement for a text-and-image-to-actuator-motion model) remain grounded in their natural language descriptions after self-training in a non-lingual modality Preventing value drift during multiple, iterated steps of self-retraining Currently, we're focusing on scalable methods of tracking behavioral drift in language models, as well as benchmarks for evaluating a language model's capacity for stable self-modification via self-training. Introduction So far, most improvements in AI capabilities fall into two categories: Algorithmic: better architectures, better loss functions, better optimizers, etc. Data-driven: more training data, higher quality data, domain-specific data, etc. We expect the future to resemble the past, and so we expect that future capabilities improvements will come from these two sources. However, it also seems likely that AIs will increasingly be the ones responsible for such capabilities advances. In fact, current work has already started using language models as part of the data curation process, or to generate future training data directly. Moreover, with GPT-3's widespread adoption, it is probable that GPT-4's training data contains content generated by its predecessor. This phenomenon extends to fine-tuning processes like RLHF, where earlier versions' output influences the cognition of subsequent iterations. Researchers are likely to use the most capable models available to them in whatever AI-driven improvement process they devise. Thus, such a process is iterative, with the first AI shaping the training of the second AI, shaping the training of the third AI, and so ...