Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What progress have we made on automated auditing?, published by Lawrence Chan on July 6, 2024 on The AI Alignment Forum.
One use case for model internals work is to perform automated auditing of models:
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What progress have we made on automated auditing?, published by Lawrence Chan on July 6, 2024 on The AI Alignment Forum.
One use case for model internals work is to perform automated auditing of models:
https://www.alignmentforum.org/posts/cQwT8asti3kyA62zc/automating-auditing-an-ambitious-concrete-technical-research
That is, given a specification of intended behavior, the attacker produces a model that doesn't satisfy the spec, and the auditor needs to determine how the model doesn't satisfy the spec. This is closely related to static backdoor detection: given a model M, determine if there exists a backdoor function that, for any input, transforms that input to one where M has different behavior.[1]
There's some theoretical work (Goldwasser et al. 2022) arguing that for some model classes, static backdoor detection is impossible even given white-box model access -- specifically, they prove their results for random feature regression and (the very similar setting) of wide 1-layer ReLU networks.
Relatedly, there's been some work looking at provably bounding model performance (Gross et al. 2024) -- if this succeeds on "real" models and "real" specification, then this would solve the automated auditing game. But the results so far are on toy transformers, and are quite weak in general (in part because the task is so difficult).[2]
Probably the most relevant work is Halawi et al. 2024's Covert Malicious Finetuning (CMFT), where they demonstrate that it's possible to use finetuning to insert jailbreaks and extract harmful work, in ways that are hard to detect with ordinary harmlessness classifiers.[3]
As this is machine learning, just because something is impossible in theory and difficult on toy models doesn't mean we can't do this in practice. It seems plausible to me that we've demonstrated non-zero empirical results in terms of automatically auditing model internals. So I'm curious: how much progress have we made on automated auditing empirically? What work exists in this area? What does the state-of-the-art in automated editing look like?
1. ^
Note that I'm not asking about mechanistically anomaly detection/dynamic backdoor detection; I'm aware that it's pretty easy to distinguish if a particular example is backdoored using baseline techniques like "fit a Gaussian density on activations and look at the log prob of the activations on each input" or "fit a linear probe on a handful of examples using logistic regression".
2. ^
I'm also aware of some of the work in the trojan detection space, including 2023 Trojan detection contest, where performance on extracting embedded triggers was little better than chance.
3. ^
That being said, it's plausible that dynamically detecting them given model internals is very easy.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
View more