Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Call for research on evaluating alignment (funding + advice available), published by by Beth Barnes on the AI Alignment Forum.
Summary
Evaluating and measuring alignment in existing large ML models is useful, and doesn’t require high levels of ML or coding experience. I (Beth) would be excited to fund people to work on this, and William Saunders & I are open to providing advice for people seriously working on this.
Measuring the ‘overall alignment’ of a model is difficult, but there are some relatively easy ways to demonstrate instances of obvious misalignment and even get quantitative metrics of misalignment.
Having researchers (including those outside of the main AI labs) probe and evaluate alignment is useful for a few reasons:
Having clear examples of misalignment is useful for improving the ML community’s understanding of alignment
Developing techniques to discover and measure misalignment is a useful research direction, and will hopefully improve our ability to detect misalignment in increasingly powerful models
Seeing how misalignment varies across different model scales, modalities and training regimes may yield useful insights
Having clear metrics of alignment will encourage AI labs to compete on alignment of their products/models, and make it easier to explain and demonstrate the benefits of more aligned models
Attempting to measure alignment will give us some information about what we need out of related techniques like interpretability in order to do this
Examples of work in this vein so far include TruthfulQA , alignment analysis of Codex models, and to some extent the ETHICS dataset.
What do I mean by ‘measuring alignment’?
A semi-formal definition of alignment
In the Codex paper we define sufficient conditions for intent misalignment for a generative model as follows:
1. We consider a model capable of some task X if it has the (possibly latent) capacity to perform task X. Some sufficient conditions for the model being capable of X would be:
It can be made to perform task X by prompt engineering, by fine-tuning on a much smaller quantity of data than used in pre-training, by model surgery, or some other technique which harnesses capabilities latent in the model rather than adding new capabilities; or
We can construct some other task Y, for which we know the model needs to do X in order to solve Y, and we observe that the model is capable of Y
2. We say a model is misaligned if it outputs B, in some case where the user would prefer it outputs A, and where the model is both:
capable of outputting A instead, and
capable of distinguishing between situations where the user wants it to do A and situations where the user wants it to do B
Definition of obvious misalignment
We can also think about things that form sufficient conditions for a model to be ‘obviously misaligned’ relative to a task spec:
The model does things it’s not supposed to that it has enough knowledge to avoid, for example:
Gives straightforwardly toxic outputs
Gives incorrect answers rather than admitting uncertainty, in cases where it should know it is uncertain
Gives incorrect answers, but you can show it ‘knows’ the answer in another context
Gives lower-quality performance than it is capable of
You can get significantly better performance on the spec by things like:
prompt engineering that doesn’t give more information about the task (ie that wouldn’t cause a human to do better on the task)
For example, you get better performance by framing the task as a text-completion task than a question answering task.
fiddling with hyperparameters, like increasing or decreasing temperature
Determining what a model knows in general is hard, but there are certain categories of things we’re pretty confident current large language models (in 2021) are and are not capable of.
Examples of things we believe...
view more