Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Research agenda: Supervising AIs improving AIs, published by Quintin Pope on April 29, 2023 on LessWrong.
[This post summarizes some of the work done by Owen Dudney, Roman Engeler and myself (Quintin Pope) as part of the SERI MATS shard theory stream.]
TL;DR
Future prosaic AIs will likely shape their own development or that of successor AIs. We're trying to make sure they don't go insane.
Summary
There are two main ways AIs can get better: by improving their training algorithms or by improving their training data.
We consider both scenarios, and tentatively believe that data-based improvement is riskier than architecture based improvement. Current models mostly derive their behavior from their training data, and not training algorithms (meaning their architectures, hyperparameters, loss functions, optimizers or the like). So far, most improvements to AI training algorithms seem 'value neutral'. Also note that most of human value drift currently derives from cultural shifts changing the 'training data' available in the environment, not biological evolution over the brain's base learning algorithms.
We imagine a future where AIs self-augment by continuously seeking out more and better training data, and either creating successor AIs or training themselves on that data. Often, these data will come from the AIs running experiments in the real world (doing science), deliberately seeking data that would cover a specific gap in its current capabilities, analogous to how human scientists seek data from domains where our current understanding is limited. With AI, this could involve AgentGPT-like systems that spin up many instances of themselves to run experiments in parallel, potentially leading to quick improvements if we are in an agency overhang.
We want to find methods of ensuring such 'automated science' processes remain safe and controllable, even after many rounds of self-directed data collection and training. In particular, we consider problems such as:
Preventing self-training from amplifying undesirable behaviors
Preventing semantic drift in concept representations during self-training
Ensuring cross-modality actions (such as a generated image for a text-to-image model, or robot movement for a text-and-image-to-actuator-motion model) remain grounded in their natural language descriptions after self-training in a non-lingual modality
Preventing value drift during multiple, iterated steps of self-retraining
Currently, we're focusing on scalable methods of tracking behavioral drift in language models, as well as benchmarks for evaluating a language model's capacity for stable self-modification via self-training.
Introduction
So far, most improvements in AI capabilities fall into two categories:
Algorithmic: better architectures, better loss functions, better optimizers, etc.
Data-driven: more training data, higher quality data, domain-specific data, etc.
We expect the future to resemble the past, and so we expect that future capabilities improvements will come from these two sources. However, it also seems likely that AIs will increasingly be the ones responsible for such capabilities advances. In fact, current work has already started using language models as part of the data curation process, or to generate future training data directly. Moreover, with GPT-3's widespread adoption, it is probable that GPT-4's training data contains content generated by its predecessor. This phenomenon extends to fine-tuning processes like RLHF, where earlier versions' output influences the cognition of subsequent iterations.
Researchers are likely to use the most capable models available to them in whatever AI-driven improvement process they devise. Thus, such a process is iterative, with the first AI shaping the training of the second AI, shaping the training of the third AI, and so ...
view more