Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Was Releasing Claude-3 Net-Negative?, published by Logan Riggs on March 28, 2024 on LessWrong.
Cross-posted to EA forum
There's been a lot of discussion among safety-concerned people about whether it was bad for Anthropic to release Claude-3. I felt like I didn't have a great picture of all the considerations...
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Was Releasing Claude-3 Net-Negative?, published by Logan Riggs on March 28, 2024 on LessWrong.
Cross-posted to EA forum
There's been a lot of discussion among safety-concerned people about whether it was bad for Anthropic to release Claude-3. I felt like I didn't have a great picture of all the considerations here, and I felt that people were conflating many different types of arguments for why it might be bad. So I decided to try to write down an at-least-slightly-self-contained description of my overall views and reasoning here.
Tabooing "Race Dynamics"
I've heard a lot of people say that this "is bad for race dynamics". I think that this conflates a couple of different mechanisms by which releasing Claude-3 might have been bad.
So, taboo-ing "race dynamics", a common narrative behind these words is
As companies release better & better models, this incentivizes other companies to pursue more capable models at the expense of safety. Eventually, one company goes too far, produces unaligned AGI, and we all die".
It's unclear what "at the expense of safety" means, so we can investigate two different interpretations::
If X increases "race dynamics", X causes an AGI company to
Invest less in evals/redteaming models before deployment
Divert resources away from alignment research & into capabilities research
Did releasing Claude-3 cause other AI labs to invest less in evals/redteaming models before deployment?
If OpenAI releases their next model 3 months earlier as a result. These 3 months need to come from *somewhere*, such as:
A. Pre-training
B. RLHF-like post-training
C. Redteaming/Evals
D. Product development/User Testing
OpenAI needs to release a model better than Claude-3, so cutting corners on Pre-training or RLHF likely won't happen. It seems possible (C) or (D) would be cut short. If I believed GPT-5 would end the world, I would be concerned about cutting corners on redteaming/evals. Most people are not.
However, this could set a precedent for investing less in redteaming/evals for GPT-6 onwards until AGI which could lead to model deployment of actually dangerous models (where counterfactually, these models would've been caught in evals).
Alternatively, investing less in redteaming/evals could lead to more of a Sydney moment for GPT-5, creating a backlash to instead invest in redteaming/evals for the next generation model.
Did releasing Claude-3 divert resources away from alignment research & into capabilities research?
If the alignment teams (or the 20% GPUs for superalignment) got repurposed for capabilities or productization, I would be quite concerned. We also would've heard if this happened! Additionally, it doesn't seem possible to convert alignment teams into capability teams efficiently due to different skill sets & motivation.
However, *future* resources haven't been given out yet. OpenAI could counterfactually invest more GPUs & researchers (either people switching from other teams or new hires) if they had a larger lead. Who knows!
Additionally, OpenAI can take resources from other parts such as Business-to-business products, SORA, and other AI-related projects, in order to avoid backlash from cutting safety. But it's very specific to the team being repurposed if they could actually help w/ capabilities research. If this happens, then that does not seem bad for existential risk.
Releasing Very SOTA Models
Claude-3 isn't very far in the frontier, so OpenAI does have less pressure to make any drastic changes. If, however, Anthropic released a model as good as [whatever OpenAI would release by Jan 2025], then this could cause a bit of a re-evaluation of OpenAI's current plan. I could see a much larger percentage of future resources to go to capabilities research & attempts to poach Anthropic employees in-the-know.
Anthropic at the Frontier is Good?
Hypothe...
View more