AF - The Checklist: What Succeeding at AI Safety Will Involve by Sam Bowman
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Checklist: What Succeeding at AI Safety Will Involve, published by Sam Bowman on September 3, 2024 on The AI Alignment Forum. Crossposted by habryka with Sam's permission. Expect lower probability for Sam to respond to comments here than if he had posted it. Preface This piece reflects my current best guess at the major goals that Anthropic (or another similarly positioned AI developer) will need to accomplish to have things go well with the development of broadly superhuman AI. Given my role and background, it's disproportionately focused on technical research and on averting emerging catastrophic risks. For context, I lead a technical AI safety research group at Anthropic, and that group has a pretty broad and long-term mandate, so I spend a lot of time thinking about what kind of safety work we'll need over the coming years. This piece is my own opinionated take on that question, though it draws very heavily on discussions with colleagues across the organization: Medium- and long-term AI safety strategy is the subject of countless leadership discussions and Google docs and lunch-table discussions within the organization, and this piece is a snapshot (shared with permission) of where those conversations sometimes go. To be abundantly clear: Nothing here is a firm commitment on behalf of Anthropic, and most people at Anthropic would disagree with at least a few major points here, but this can hopefully still shed some light on the kind of thinking that motivates our work. Here are some of the assumptions that the piece relies on. I don't think any one of these is a certainty, but all of them are plausible enough to be worth taking seriously when making plans: Broadly human-level AI is possible. I'll often refer to this as transformative AI (or TAI), roughly defined as AI that could form as a drop-in replacement for humans in all remote-work-friendly jobs, including AI R&D.[1] Broadly human-level AI (or TAI) isn't an upper bound on most AI capabilities that matter, and substantially superhuman systems could have an even greater impact on the world along many dimensions. If TAI is possible, it will probably be developed this decade, in a business and policy and cultural context that's not wildly different from today. If TAI is possible, it could be used to dramatically accelerate AI R&D, potentially leading to the development of substantially superhuman systems within just a few months or years after TAI. Powerful AI systems could be extraordinarily destructive if deployed carelessly, both because of new emerging risks and because of existing issues that become much more acute. This could be through misuse of weapons-related capabilities, by disrupting important balances of power in domains like cybersecurity or surveillance, or by any of a number of other means. Many systems at TAI and beyond, at least under the right circumstances, will be capable of operating more-or-less autonomously for long stretches in pursuit of big-picture, real-world goals. This magnifies these safety challenges. Alignment - in the narrow sense of making sure AI developers can confidently steer the behavior of the AI systems they deploy - requires some non-trivial effort to get right, and it gets harder as systems get more powerful. Most of the ideas here ultimately come from outside Anthropic, and while I cite a few sources below, I've been influenced by far more writings and people than I can credit here or even keep track of. Introducing the Checklist This lays out what I think we need to do, divided into three chapters, based on the capabilities of our strongest models: Chapter 1: Preparation You are here. In this period, our best models aren't yet TAI. In the language of Anthropic's RSP, they're at AI Safety Level 2 (ASL-2), ASL-3, or maybe the early stages of ASL-4. Most of the wor...
AF - Survey: How Do Elite Chinese Students Feel About the Risks of AI? by Nick Corvino
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Survey: How Do Elite Chinese Students Feel About the Risks of AI?, published by Nick Corvino on September 2, 2024 on The AI Alignment Forum. Intro In April 2024, my colleague and I (both affiliated with Peking University) conducted a survey involving 510 students from Tsinghua University and 518 students from Peking University - China's two top academic institutions. Our focus was on their perspectives regarding the frontier risks of artificial intelligence. In the People's Republic of China (PRC), publicly accessible survey data on AI is relatively rare, so we hope this report provides some valuable insights into how people in the PRC are thinking about AI (especially the risks). Throughout this post, I'll do my best to weave in other data reflecting the broader Chinese sentiment toward AI. For similar research, check out The Center for Long-Term Artificial Intelligence, YouGov, Monmouth University, The Artificial Intelligence Policy Institute, and notably, a poll conducted by Rethink Priorities, which closely informed our survey design. You can read the full report published in the Jamestown Foundation's China Brief here: Survey: How Do Elite Chinese Students Feel About the Risks of AI? Key Takeaways Students are more optimistic about the benefits of AI than concerned about the harms. 80 percent of respondents agreed or strongly agreed with the statement that AI will do more good than harm for society, with only 7.5 percent actively believing the harms could outweigh the benefits. This, similar to other polling, indicates that the PRC is one of the most optimistic countries concerning the development of AI. Students strongly believe the Chinese government should regulate AI. 85.31 percent of respondents believe AI should be regulated by the government, with only 6 percent actively believing it should not. This contrasts with trends seen in other countries, where there is typically a positive correlation between optimism about AI and calls for minimizing regulation. The strong support for regulation in the PRC, even as optimism about AI remains high, suggests a distinct perspective on the role of government oversight in the PRC context. Students ranked AI the lowest among all possible existential threats to humanity. When asked about the most likely causes of human extinction, misaligned artificial intelligence received the lowest score. Nuclear war, natural disaster, climate change, and pandemics all proved more concerning for students. Students lean towards cooperation between the United States and the PRC as necessary for the safe and responsible development of AI. 60.7 percent of respondents believe AI will not be developed safely without cooperation between China and the U.S., with 25.68 percent believing it will develop safely no matter the level of cooperation. Students are most concerned about the use of AI for surveillance. This was followed by misinformation, existential risk, wealth inequality, increased political tension, various issues related to bias, with the suffering of artificial entities receiving the lowest score. Background As the recent decision (决定) document from the Third Plenum meetings in July made clear, AI is one of eight technologies that the Chinese Communist Party (CCP) leadership sees as critical for achieving "Chinese-style modernization (中国式现代化)," and is central to the strategy of centering the country's economic future around breakthroughs in frontier science ( People's Daily, July 22). The PRC also seeks to shape international norms on AI, including on AI risks. In October 2023, Xi Jinping announced a "Global AI Governance Initiative (全球人工智能治理倡议)" ( CAC, October 18, 2023). Tsinghua and Peking Universty are the two most prestigious universities in the PRC (by far), many of whose graduates will be very influential in shaping the cou...
AF - Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024) by Matt MacDermott
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024), published by Matt MacDermott on September 1, 2024 on The AI Alignment Forum. Yoshua Bengio wrote a blogpost about a new AI safety paper by him, various collaborators, and me. I've pasted the text below, but first here are a few comments from me aimed at an AF/LW audience. The paper is basically maths plus some toy experiments. It assumes access to a Bayesian oracle that can infer a posterior over hypotheses given data, and can also estimate probabilities for some negative outcome ("harm"). It proposes some conservative decision rules one could use to reject actions proposed by an agent, and proves probabilistic bounds on their performance under appropriate assumptions. I expect the median reaction in these parts to be something like: ok, I'm sure there are various conservative decision rules you could apply using a Bayesian oracle, but isn't obtaining a Bayesian oracle the hard part here? Doesn't that involve advances in Bayesian machine learning, and also probably solving ELK to get the harm estimates? My answer to that is: yes, I think so. I think Yoshua does too, and that that's the centre of his research agenda. Probably the main interest of this paper to people here is to provide an update on Yoshua's research plans. In particular it gives some more context on what the "guaranteed safe AI" part of his approach might look like -- design your system to do explicit Bayesian inference, and make an argument that the system is safe based on probabilistic guarantees about the behaviour of a Bayesian inference machine. This is in contrast to more hardcore approaches that want to do formal verification by model-checking. You should probably think of the ambition here as more like "a safety case involving proofs" than "a formal proof of safety". Bounding the probability of harm from an AI to create a guardrail Published 29 August 2024 by yoshuabengio As we move towards more powerful AI, it becomes urgent to better understand the risks, ideally in a mathematically rigorous and quantifiable way, and use that knowledge to mitigate them. Is there a way to design powerful AI systems based on machine learning methods that would satisfy probabilistic safety guarantees, i.e., would be provably unlikely to take a harmful action? Current AI safety evaluations and benchmarks test the AI for cases where it may behave badly, e.g., by providing answers that could yield dangerous misuse. That is useful and should be legally required with flexible regulation, but is not sufficient. These tests only tell us one side of the story: If they detect bad behavior, a flag is raised and we know that something must be done to mitigate the risks. However, if they do not raise such a red flag, we may still have a dangerous AI in our hands, especially since the testing conditions might be different from the deployment setting, and attackers (or an out-of-control AI) may be creative in ways that the tests did not consider. Most concerningly, AI systems could simply recognize they are being tested and have a temporary incentive to behave appropriately while being tested. Part of the problem is that such tests are spot checks. They are trying to evaluate the risk associated with the AI in general by testing it on special cases. Another option would be to evaluate the risk on a case-by-case basis and reject queries or answers that are considered to potentially violate or safety specification. With the long-term goal of obtaining a probabilistic guarantee that would apply in every context, we thus consider in this new paper (see reference and co-authors below) the objective of estimating a context-dependent upper bound on the probability of violating a given safety specification. Such a risk evaluation would need to be performed at ru...
AF - Epistemic states as a potential benign prior by Tamsin Leake
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Epistemic states as a potential benign prior, published by Tamsin Leake on August 31, 2024 on The AI Alignment Forum. Malignancy in the prior seems like a strong crux of the goal-design part of alignment to me. Whether your prior is going to be used to model: processes in the multiverse containing a specific "beacon" bitstring, processes in the multiverse containing the AI, processes which would output all of my blog, so I can make it output more for me, processes which match an AI chatbot's hypotheses about what it's talking with, then you have to sample hypotheses from somewhere; and typically, we want to use either solomonoff induction or time-penalized versions of it such as levin search (penalized by log of runtime) or what QACI uses (penalized by runtime, but with quantum computation available in some cases), or the implicit prior of neural networks (large sequences of multiplying by a matrix, adding a vector, and ReLU, often with a penalty related to how many non-zero weights are used). And the solomonoff prior is famously malign. (Alternatively, you could have knightian uncertainty about parts of your prior that aren't nailed down enough, and then do maximin over your knightian uncertainty (like in infra-bayesianism), but then you're not guaranteed that your AI gets anywhere at all; its knightian uncertainty might remain so immense that the AI keeps picking the null action all the time because some of its knightian hypotheses still say that anything else is a bad idea. Note: I might be greatly misunderstanding knightian uncertainty!) (It does seem plausible that doing geometric expectation over hypotheses in the prior helps "smooth things over" in some way, but I don't think this particularly removes the weight of malign hypotheses in the prior? It just allocates their steering power in a different way, which might make things less bad, but it sounds difficult to quantify.) It does feel to me like we do want a prior for the AI to do expected value calculations over, either for prediction or for utility maximization (or quantilization or whatever). One helpful aspect of prior-distribution-design is that, in many cases, I don't think the prior needs to contain the true hypothesis. For example, if the problem that we're using a prior for is to model processes which match an AI chatbot's hypotheses about what it's talking with then we don't need the AI's prior to contain a process which behaves just like the human user it's interacting with; rather, we just need the AI's prior to contain a hypothesis which: is accurate enough to match observations. is accurate enough to capture the fact that the user (if we pick a good user) implements the kind of decision theory that lets us rely on them pointing back to the actual real physical user when they get empowered - i.e. in CEV(user-hypothesis), user-hypothesis builds and then runs CEV(physical-user), because that's what the user would do in such a situation. Let's call this second criterion "cooperating back to the real user". So we need a prior which: Has at least some mass on hypotheses which correspond to observations cooperate back to the real user and can eventually be found by the AI, given enough evidence (enough chatting with the user) Call this the "aligned hypothesis". Before it narrows down hypothesis space to mostly just aligned hypotheses, doesn't give enough weight to demonic hypothesis which output whichever predictions cause the AI to brainhack its physical user, or escape using rowhammer-type hardware vulnerabilities, or other failures like that. Formalizing the chatbot model First, I'll formalize this chatbot model. Let's say we have a magical inner-aligned "soft" math-oracle: Which, given a "scoring" mathematical function from a non-empty set a to real numbers (not necessarily one that is tractably ...
AF - AIS terminology proposal: standardize terms for probability ranges by Egg Syntax
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AIS terminology proposal: standardize terms for probability ranges, published by Egg Syntax on August 30, 2024 on The AI Alignment Forum. Summary: The AI safety research community should adopt standardized terms for probability ranges, especially in public-facing communication and especially when discussing risk estimates. The terms used by the IPCC are a reasonable default. Science communication is notoriously hard. It's hard for a lot of reasons, but one is that laypeople aren't used to thinking in numerical probabilities or probability ranges. One field that's had to deal with this more than most is climatology; climate change has been rather controversial, and a non-trivial aspect of that has been lay confusion about what climatologists are actually saying[1]. As a result, the well-known climate assessment reports from the UN's Intergovernmental Panel on Climate Change (IPCC) have, since the 1990s, used explicitly defined terms for probability ranges[2]: (see below for full figure[3]) Like climatology, AI safety research has become a topic of controversy. In both cases, the controversy includes a mix of genuine scientific disagreement, good-faith confusion, and bad-faith opposition. Scientific disagreement comes from people who can deal with numerical probability ranges. Those who are arguing in bad faith from ulterior motives generally don't care about factual details. But I suspect that the large majority of those who disagree, especially laypeople, are coming from a place of genuine, good-faith confusion. For those people, anything we as practitioners can do to communicate more clearly is quite valuable. Also like climatology, AI safety research, especially assessments of risk, fundamentally involves communicating about probabilities and probability ranges. Therefore I propose that the AIS community follow climatologists in adopting standard terms for probability ranges, especially in position papers and public-facing communication. In less formal and less public-facing contexts, using standard terminology still adds some value but is less important; in sufficiently informal contexts it's probably not worth the hassle of looking up the standard terminology. Of course, in many cases it's better to just give the actual numerical range! But especially in public-facing communication it can be more natural to use natural language terms, and in fact this is already often done. I'm only proposing that when we do use natural language terms for probability ranges, we use them in a consistent and interpretable way (feel free to link to this post as a reference for interpretation, or point to the climatology papers cited below[2]). Should the AIS community use the same terms? That's a slightly harder question. The obvious first-pass answer is 'yes'; it's a natural Schelling point, and terminological consistency across fields is generally preferable when practically possible. The IPCC terms also have the significant advantage of being battle-tested; they've been used over a thirty-year period in a highly controversial field, and terms have been refined when they were found to be insufficiently clear. The strongest argument I see against using the same terms is that the AIS community sometimes needs to deal with more extreme (high or low) risk estimates than these. If we use 'virtually certain' to mean 99 - 100%, what terms can we use for 99.9 - 100.0%, or 99.99 - 100.00%? On the other hand, plausibly once we're dealing with such extreme risk estimates, it's increasingly important to communicate them with actual numeric ranges. My initial proposal is to adopt the IPCC terms, but I'm very open to feedback, and if someone has an argument I find compelling (or which gets strong agreement in votes) for a different or extended set of terms, I'll add it to the proposal. If no su...