Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Tips for Empirical Alignment Research, published by Ethan Perez on February 29, 2024 on The AI Alignment Forum.
TLDR: I've collected some tips for research that I've given to other people and/or used myself, which have sped things up and helped put people in the right general mindset for empirical AI alignment...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Tips for Empirical Alignment Research, published by Ethan Perez on February 29, 2024 on The AI Alignment Forum.
TLDR: I've collected some tips for research that I've given to other people and/or used myself, which have sped things up and helped put people in the right general mindset for empirical AI alignment research. Some of these are opinionated takes, also around what has helped me. Researchers can be successful in different ways, but I still stand by the tips here as a reasonable default.
What success generally looks like
Here, I've included specific criteria that strong collaborators of mine tend to meet, with rough weightings on the importance, as a rough north star for people who collaborate with me (especially if you're new to research). These criteria are for the specific kind of research I do (highly experimental LLM alignment research, excluding interpretability); some examples of research areas where this applies are e.g.
scalable oversight, adversarial robustness, chain-of-thought faithfulness, process-based oversight, and model organisms of misalignment. The exact weighting will also vary heavily depending on what role you're serving on the team/project. E.g., I'd probably upweight criteria where you're differentially strong or differentially contributing on the team, since I generally guide people towards working on things that line up with their skills.
For more junior collaborators (e.g., first time doing a research project, where I've scoped out the project), this means I generally weigh execution-focused criteria more than direction-setting criteria (since here I'm often the person doing the direction setting). Also, some of the criteria as outlined below are a really high bar, and e.g. I only recently started to meet them myself after 5 years of doing research and/or I don't meet other criteria myself.
This is mainly written to be a north star for targets to aim for. That said, I think most people can get to a good-to-great spot on these criteria with 6-18 months of trying, and I don't currently think that many of these criteria are particularly talent/brains bottlenecked vs.
just doing a lot of deliberate practice and working to get better on these criteria (I was actively bad at some of the criteria below like implementation speed even ~6 months into me doing research, but improved a lot since then with practice). With that context, here are the rough success criteria I'd outline:
[70%] Getting ideas to work quickly
[45%] Implementation speed
Able to quickly implement a well-scoped idea. An example of doing really well here is if we talk about an idea one day and decide it's exciting/worth doing, and you tell me the next day whether it worked
Able to run a high volume of experiments. You're doing really well here if it's hard for your supervisor to keep up with the volume of the experiments/results you're showing; 30m or even 60m weekly 1:1 meetings should feel like not long enough to discuss all of the results you have, and you have to filter what we discuss in our weekly meetings to just the most important and decision-relevant results.
If some experiments take a while to run, you're running a lot of other project-relevant experiments in parallel or implementing the next experiment (Exceptions: the experiments you're running take more than overnight/18h to run and there's no way to design them to be shorter; or the experiments are very implementation-heavy)
Able to design a
minimal experiment to test a mid/high-level idea. You run experiments in a way such that you're rarely compute or experiment-time bottlenecked (especially early in a project), and your experiments are designed to be easy/quick to implement
You trade off code quality and implementation speed in the best way for long-run productivity. You bias heavily towards speed in gener...
View more