Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measurement tampering detection as a special case of weak-to-strong generalization, published by Ryan Greenblatt on December 23, 2023 on The AI Alignment Forum.
Burns et al at OpenAI released
a paper studying various techniques for fine-tuning strong models on downstream tasks using labels produced by weak...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measurement tampering detection as a special case of weak-to-strong generalization, published by Ryan Greenblatt on December 23, 2023 on The AI Alignment Forum.
Burns et al at OpenAI released
a paper studying various techniques for fine-tuning strong models on downstream tasks using labels produced by weak models. They call this problem "weak-to-strong generalization", abbreviated W2SG.
Earlier this year, we published a paper,
Benchmarks for Detecting Measurement Tampering, in which we investigated techniques for the problem of measurement tampering detection (MTD). MTD is a special case of W2SG. In this post, we'll explain the relationship between MTD and W2SG, and explain why we think MTD is more likely than fully general W2SG to work. Of course, fully general W2SG is a strictly more valuable problem to solve, due to this generality.
We think MTD is a promising research direction. We're also excited for other problems which are special cases of W2SG that have special structure that can be exploited by techniques, especially if that structure is likely to be present in important cases in future.
MTD as a subset of W2SG
A similar goal
When training an AI, the reward we attribute to different behaviors might not match the reward we would give if we understood the situation better.
The goal of W2SG techniques is to achieve good results when training a strong AI despite only having access to a weak supervisor that understands the situation less well than the strong AI.
MTD is the special case where the weak supervisor has access to measurements which should be sufficient to understand the situation, but these measurements can be tampered with (e.g. replacing the camera feed with some made-up data, disabling tests, or threatening annotators). Because the measurements are sufficient in the absence of tampering, we don't need to worry about benign mistakes that could happen even without an AI optimizing to make measurements look good.
Slightly different experiments
W2SG can be studied using
sandwiching experiments, where we try to get an AI to safely accomplish tasks despite only having access to a weak supervisor, and then we measure the performance of our method using a stronger held-out supervision signal (e.g. held out ground truth labels).
In the case of the OpenAI paper, the weak supervisor is a small language model trained on ground truth labels, as an analogy for human annotators.
In the case of our MTD paper, we have access to measurements, but there is some notion of measurement tampering. In our work, the measurements aim to directly measure the property of interest as a boolean value, so converting from untampered measurements to correct labels is straightforward (and doesn't require any learning or intelligence).
Different hopes for succeeding at W2SG
In both cases, we need some additional assumptions to get strong supervision from a weak supervisor. If we made no structural assumptions about the internals of models and assumed the worst case about the internal structure of AIs, we wouldn't succeed, so we will depend on some type of structure in the internals of models. The structure that the OpenAI paper discusses is very different from the structure we hope to leverage in the measurement tampering case.
Confidence and consistency
The OpenAI paper shows that on some tasks, training the model to be confident in addition to matching the weak supervisor's labels sometimes increases accuracy. The hope is that the model might have a very salient representation of what is true, and a less salient representation of the predictions of the weak supervisor.
The confidence hope is similar to the hope explored in
Discovering Latent Knowledge: the inner representation of the truth inside AIs might be more consistent than other features, and be more salient than other cons...
View more