Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Adversarial Robustness Could Help Prevent Catastrophic Misuse, published by Aidan O'Gara on December 11, 2023 on The AI Alignment Forum.
There have been
several
discussions about the importance of adversarial robustness for scalable oversight. I'd like to point out that adversarial robustness is also...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Adversarial Robustness Could Help Prevent Catastrophic Misuse, published by Aidan O'Gara on December 11, 2023 on The AI Alignment Forum.
There have been
several
discussions about the importance of adversarial robustness for scalable oversight. I'd like to point out that adversarial robustness is also important under a different threat model: catastrophic misuse.
For a brief summary of the argument:
Misuse could lead to catastrophe. AI-assisted cyberattacks, political persuasion, and biological weapons acquisition are plausible paths to catastrophe.
Today's models do not robustly refuse to cause harm. If a model has the ability to cause harm, we should train it to refuse to do so. Unfortunately, GPT-4, Claude, Bard, and Llama all have received this training, but they still behave harmfully when facing prompts generated by adversarial attacks, such as this one and this one.
Adversarial robustness will likely not be easily solved. Over the last decade, thousands of papers have been published on adversarial robustness. Most defenses are near useless, and the best defenses against a constrained attack in a CIFAR-10 setting still fail on 30% of inputs. Redwood Research's work on training a reliable text classifier found the task quite difficult. We should not expect an easy solution.
Progress on adversarial robustness is possible. Some methods have improved robustness, such as adversarial training and data augmentation. But existing research often assumes overly narrow threat models, ignoring both creative attacks and creative defenses. Refocusing research with good evaluations focusing on LLMs and other frontier models could lead to valuable progress.
This argument requires a few caveats. First, it assumes a particular threat model: that closed source models will have more dangerous capabilities than open source models, and that malicious actors will be able to query closed source models. This seems like a reasonable assumption over the next few years.
Second, there are many other ways to reduce risks from catastrophic misuse, such as removing hazardous knowledge from model weights, strengthening societal defenses against catastrophe, and holding companies legally liable for sub-extinction level harms. I think we should work on these in addition to adversarial robustness, as part of a defense-in-depth approach to misuse risk.
Overall, I think adversarial robustness should receive more effort from researchers and labs, more funding from donors, and should be a part of the technical AI safety research portfolio. This could substantially mitigate the near-term risk of catastrophic misuse, in addition to any potential benefits for scalable oversight.
The rest of this post discusses each of the above points in more detail.
Misuse could lead to catastrophe
There are many ways that malicious use of AI could lead to catastrophe. AI could enable cyberattacks, personalized propaganda and mass manipulation, or the acquisition of weapons of mass destruction. Personally, I think the most compelling case is that AI will enable biological terrorism.
Ideally, ChatGPT would refuse to aid in dangerous activities such as constructing a bioweapon. But by using an adversarial jailbreak prompt, undergraduates in a class taught by Kevin Esvelt at MIT
evaded this safeguard:
In one hour, the chatbots suggested four potential pandemic pathogens, explained how they can be generated from synthetic DNA using reverse genetics, supplied the names of DNA synthesis companies unlikely to screen orders, identified detailed protocols and how to troubleshoot them, and recommended that anyone lacking the skills to perform reverse genetics engage a core facility or contract research organization.
Fortunately, today's models lack key information about building bioweapons. It's not even clear that they're more u...
View more