Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A starter guide for evals, published by Marius Hobbhahn on January 8, 2024 on The AI Alignment Forum.
This is a linkpost for https://www.apolloresearch.ai/blog/a-starter-guide-for-evals
This is a starter guide for model evaluations (evals). Our goal is to provide a general overview of what evals are, what...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A starter guide for evals, published by Marius Hobbhahn on January 8, 2024 on The AI Alignment Forum.
This is a linkpost for https://www.apolloresearch.ai/blog/a-starter-guide-for-evals
This is a starter guide for model evaluations (evals). Our goal is to provide a general overview of what evals are, what skills are helpful for evaluators, potential career trajectories, and possible ways to start in the field of evals.
Evals is a nascent field, so many of the following recommendations might change quickly and should be seen as our current best guess.
Why work on evals?
Model evaluations increase our knowledge about the capabilities, tendencies, and flaws of AI systems. Evals inform the public, AI organizations, lawmakers, and others and thereby improve their decision-making. However, similar to testing in a pandemic or pen-testing in cybersecurity, evals are not sufficient, i.e. they don't increase the safety of the model on their own but are needed for good decision-making and can inform other safety approaches. For example, evals underpin
Responsible Scaling Policies and thus already influence relevant high-stakes decisions about the deployment of frontier AI systems. Thus, evals are a highly impactful way to improve the decision-making about AI systems.
Evals are a nascent field and there are many fundamental techniques to be developed and questions to be answered. Since evals do not require as much background knowledge as many other fields, it is much easier to get started and possible to make meaningful contributions from very early on.
What are model evaluations (evals)?
Evals refers to a broad category of approaches that we roughly summarize as:
The systematic measurement of properties in AI systems
More concretely, evals typically attempt to make a quantitative or qualitative statement about the capabilities or propensities of an AI system. For example, we could ask if a model has the capability to solve a specific coding problem or the propensity to be power-seeking. In general, evals are not restricted to safety-related properties but often when people talk about evals, they mention them in a safety context.
There is a difference between red-teaming and benchmarking. Red-teaming is actively looking for specific capabilities or propensities while interacting with the model. It is an attempt to answer the question "Can we find this capability in a model when we try hard to find it?". In other words, red-teaming is an attempt to show the existence of certain capabilities/properties, but it is not trying to make a claim about how likely those are to occur under real-use conditions.
Red-teaming typically involves interacting with the model and actively looking for ways to elicit the desired behavior, e.g. by testing many different model inputs and strategies and actively iterating on them.
In contrast, benchmarking makes a statement about the likelihood of a model behaving in a specific way on a certain dataset, e.g. the likelihood of a behavior occurring under real-use conditions. A benchmarking effort should be designed while interacting with the model as little as possible in order to prevent overfitting to the capabilities or tendencies of any particular model.
Both red-teaming and benchmarking are important and serve a purpose. Red-teaming can provide an estimate of the potential danger of a system, e.g. whether the model can manipulate its users. Benchmarking can provide an estimate of how likely an AI system would show these tendencies under specific conditions, e.g. how likely the model is to manipulate its users in realistic scenarios. Currently, evals are often a mix between red-teaming and benchmarking but we expect the two categories to get more and more distinct.
There is a difference between capability and alignment evaluations. Capability evaluati...
View more