Podcasting
Advertisers
Enterprise
Pricing
Resources
Discover Discover

Log in
Sign up free

The Nonlinear Library: Alignment Forum

AF - What progress have we made on automated auditing? by Lawrence Chan

2024-07-06

Download

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What progress have we made on automated auditing?, published by Lawrence Chan on July 6, 2024 on The AI Alignment Forum. One use case for model internals work is to perform automated auditing of models:

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What progress have we made on automated auditing?, published by Lawrence Chan on July 6, 2024 on The AI Alignment Forum.
One use case for model internals work is to perform automated auditing of models:
https://www.alignmentforum.org/posts/cQwT8asti3kyA62zc/automating-auditing-an-ambitious-concrete-technical-research
That is, given a specification of intended behavior, the attacker produces a model that doesn't satisfy the spec, and the auditor needs to determine how the model doesn't satisfy the spec. This is closely related to static backdoor detection: given a model M, determine if there exists a backdoor function that, for any input, transforms that input to one where M has different behavior.[1]
There's some theoretical work (Goldwasser et al. 2022) arguing that for some model classes, static backdoor detection is impossible even given white-box model access -- specifically, they prove their results for random feature regression and (the very similar setting) of wide 1-layer ReLU networks.
Relatedly, there's been some work looking at provably bounding model performance (Gross et al. 2024) -- if this succeeds on "real" models and "real" specification, then this would solve the automated auditing game. But the results so far are on toy transformers, and are quite weak in general (in part because the task is so difficult).[2]
Probably the most relevant work is Halawi et al. 2024's Covert Malicious Finetuning (CMFT), where they demonstrate that it's possible to use finetuning to insert jailbreaks and extract harmful work, in ways that are hard to detect with ordinary harmlessness classifiers.[3]
As this is machine learning, just because something is impossible in theory and difficult on toy models doesn't mean we can't do this in practice. It seems plausible to me that we've demonstrated non-zero empirical results in terms of automatically auditing model internals. So I'm curious: how much progress have we made on automated auditing empirically? What work exists in this area? What does the state-of-the-art in automated editing look like?
1. ^
Note that I'm not asking about mechanistically anomaly detection/dynamic backdoor detection; I'm aware that it's pretty easy to distinguish if a particular example is backdoored using baseline techniques like "fit a Gaussian density on activations and look at the log prob of the activations on each input" or "fit a linear probe on a handful of examples using logistic regression".
2. ^
I'm also aware of some of the work in the trojan detection space, including 2023 Trojan detection contest, where performance on extracting embedded triggers was little better than chance.
3. ^
That being said, it's plausible that dynamically detecting them given model internals is very easy.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

View more

Comments (3)

More Episodes

You may also like

Adulting with Autism

The Pacific War - week by week

German Stories - Learn German with Stories | Deutsch lernen mit Geschichten

The Mel Robbins Podcast

The Jordan B. Peterson Podcast

Halacha Headlines

رادیو راه با مجتبی شکوری

‌BPLUS بی‌پلاس پادکست فارسی خلاصه کتاب

جافکری | Jafekri

All Ears English Podcast

Get this podcast on your phone, Free

Creat Yourt Podcast In Minutes

Full-featured podcast site
Unlimited storage and bandwidth
Comprehensive podcast stats
Distribute to Apple Podcasts, Spotify, and more
Make money with your podcast

It is Free

Podcast Services
MONETIZATION & MORE
KNOWLEDGE BASE
Support
Podbean

Privacy Policy
Cookie Policy
Terms of Use
Consent Preferences
Copyright © 2015-2025 Podbean.com