The Nonlinear Library: Alignment Forum

https://storage.googleapis.com/rssfile/nonlinear-library-aggregated-AF.xml

3 Followers 384 Episodes Claim Ownership

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episode List

AF - Contra papers claiming superhuman AI forecasting by nikos

Sep 12th, 2024 6:10 PM

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Contra papers claiming superhuman AI forecasting, published by nikos on September 12, 2024 on The AI Alignment Forum. [Conflict of interest disclaimer: We are FutureSearch, a company working on AI-powered forecasting and other types of quantitative reasoning. If thin LLM wrappers could achieve superhuman forecasting performance, this would obsolete a lot of our work.] Widespread, misleading claims about AI forecasting Recently we have seen a number of papers - (Schoenegger et al., 2024, Halawi et al., 2024, Phan et al., 2024, Hsieh et al., 2024) - with claims that boil down to "we built an LLM-powered forecaster that rivals human forecasters or even shows superhuman performance". These papers do not communicate their results carefully enough, shaping public perception in inaccurate and misleading ways. Some examples of public discourse: Ethan Mollick (>200k followers) tweeted the following about the paper Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy by Schoenegger et al.: A post on Marginal Revolution with the title and abstract of the paper Approaching Human-Level Forecasting with Language Models by Halawi et al. elicits responses like "This is something that humans are notably terrible at, even if they're paid to do it. No surprise that LLMs can match us." "+1 The aggregate human success rate is a pretty low bar" A Twitter thread with >500k views on LLMs Are Superhuman Forecasters by Phan et al. claiming that "AI […] can predict the future at a superhuman level" had more than half a million views within two days of being published. The number of such papers on AI forecasting, and the vast amount of traffic on misleading claims, makes AI forecasting a uniquely misunderstood area of AI progress. And it's one that matters. What does human-level or superhuman forecasting mean? "Human-level" or "superhuman" is a hard-to-define concept. In an academic context, we need to work with a reasonable operationalization to compare the skill of an AI forecaster with that of humans. One reasonable and practical definition of a superhuman forecasting AI forecaster is The AI forecaster is able to consistently outperform the crowd forecast on a sufficiently large number of randomly selected questions on a high-quality forecasting platform.[1] (For a human-level forecaster, just replace "outperform" with "performs on par with".) Red flags for claims to (super)human AI forecasting accuracy Our experience suggests there are a number of things that can go wrong when building AI forecasting systems, including: 1. Failing to find up-to-date information on the questions. It's inconceivable on most questions that forecasts can be good without basic information. Imagine trying to forecast the US presidential election without knowing that Biden dropped out. 2. Drawing on up-to-date, but low-quality information. Ample experience shows low quality information confuses LLMs even more than it confuses humans. Imagine forecasting election outcomes with biased polling data. Or, worse, imagine forecasting OpenAI revenue based on claims like > The number of ChatGPT Plus subscribers is estimated between 230,000-250,000 as of October 2023. without realising that this mixing up ChatGPT vs ChatGPT mobile. 3. Lack of high-quality quantitative reasoning. For a decent number of questions on Metaculus, good forecasts can be "vibed" by skilled humans and perhaps LLMs. But for many questions, simple calculations are likely essential. Human performance shows systematic accuracy nearly always requires simple models such as base rates, time-series extrapolations, and domain-specific numbers. Imagine forecasting stock prices without having, and using, historical volatility. 4. Retrospective, rather than prospective, forecasting (e.g. forecasting questions that have al...

AF - AI forecasting bots incoming by Dan H

Sep 9th, 2024 7:14 PM

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI forecasting bots incoming, published by Dan H on September 9, 2024 on The AI Alignment Forum. In a recent appearance on Conversations with Tyler, famed political forecaster Nate Silver expressed skepticism about AIs replacing human forecasters in the near future. When asked how long it might take for AIs to reach superhuman forecasting abilities, Silver replied: "15 or 20 [years]." In light of this, we are excited to announce "FiveThirtyNine," an AI forecasting bot. Our bot, built on GPT-4o, provides probabilities for any user-entered query, including " Will Trump win the 2024 presidential election?" and " Will China invade Taiwan by 2030?" Our bot performs better than experienced human forecasters and performs roughly the same as (and sometimes even better than) crowds of experienced forecasters; since crowds are for the most part superhuman, FiveThirtyNine is in a similar sense. (We discuss limitations later in this post.) Our bot and other forecasting bots can be used in a wide variety of contexts. For example, these AIs could help policymakers minimize bias in their decision-making or help improve global epistemics and institutional decision-making by providing trustworthy, calibrated forecasts. We hope that forecasting bots like ours will be quickly integrated into frontier AI models. For now, we will keep our bot available at forecast.safe.ai, where users are free to experiment and test its capabilities. Quick Links Demo: forecast.safe.ai Technical Report: link Problem Policymakers at the highest echelons of government and corporate power have difficulty making high-quality decisions on complicated topics. As the world grows increasingly complex, even coming to a consensus agreement on basic facts is becoming more challenging, as it can be hard to absorb all the relevant information or know which sources to trust. Separately, online discourse could be greatly improved. Discussions on uncertain, contentious issues all too often devolve into battles between interest groups, each intent on name-calling and spouting the most extreme versions of their views through highly biased op-eds and tweets. FiveThirtyNine Before transitioning to how forecasting bots like FiveThirtyNine can help improve epistemics, it might be helpful to give a summary of what FiveThirtyNine is and how it works. FiveThirtyNine can be given a query - for example, "Will Trump win the 2024 US presidential election?" FiveThirtyNine is prompted to behave like an "AI that is superhuman at forecasting". It is then asked to make a series of search engine queries for news and opinion articles that might contribute to its prediction. (The following example from FiveThirtyNine uses GPT-4o as the base LLM.) Based on these sources and its wealth of prior knowledge, FiveThirtyNine compiles a summary of key facts. Given these facts, it's asked to give reasons for and against Trump winning the election, before weighing each reason based on its strength and salience. Finally, FiveThirtyNine aggregates its considerations while adjusting for negativity and sensationalism bias in news sources and outputs a tentative probability. It is asked to sanity check this probability and adjust it up or down based on further reasoning, before putting out a final, calibrated probability - in this case, 52%. Evaluation. To test how well our bot performs, we evaluated it on questions from the Metaculus forecasting platform. We restricted the bot to make predictions only using the information human forecasters had, ensuring a valid comparison. Specifically, GPT-4o is only trained on data up to October 2023, and we restricted the news and opinion articles it could access to only those published before a certain date. From there, we asked it to compute the probabilities of 177 events from Metaculus that had happened (or not ha...

AF - Backdoors as an analogy for deceptive alignment by Jacob Hilton

Sep 6th, 2024 3:30 PM

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Backdoors as an analogy for deceptive alignment, published by Jacob Hilton on September 6, 2024 on The AI Alignment Forum. ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between backdoors and deceptive alignment, the possibility that an AI system would intentionally behave well in training in order to give itself the opportunity to behave uncooperatively later. In our paper, we prove several theoretical results that shed some light on possible mitigations for deceptive alignment, albeit in a way that is limited by the strength of this analogy. In this post, we will: Lay out the analogy between backdoors and deceptive alignment Discuss prior theoretical results from the perspective of this analogy Explain our formal notion of backdoors and its strengths and weaknesses Summarize the results in our paper and discuss their implications for deceptive alignment Thanks to Boaz Barak, Roger Grosse, Thomas Read, John Schulman and Gabriel Wu for helpful comments. Backdoors and deceptive alignment A backdoor in an ML model is a modification to the model that causes it to behave differently on certain inputs that activate a secret "trigger", while behaving similarly on ordinary inputs. There is a wide existing literature on backdoor attacks and defenses, which is primarily empirical, but also includes some theoretical results that we will mention. Deceptive alignment is a term from the paper Risks from Learned Optimization in Advanced Machine Learning Systems (Section 4) that refers to the possibility that an AI system will internally reason about the objective that it is being trained on, and decide to perform well according to that objective unless there are clues that it has been taken out of its training environment. Such a policy could be optimal on the training distribution, and yet perform very badly on certain out-of-distribution inputs where such clues are present, which we call defection triggers.[1] The opposite of deceptive alignment is robust alignment, meaning that this performance degradation is avoided. Since a deceptively aligned model and a robustly aligned model behave very differently on defection triggers, but very similarly on typical inputs from the training distribution, deceptive alignment can be thought of as a special kind of backdoor, under the following correspondence: Deceptive alignment Backdoors Robustly aligned model Original (unmodified) model Deceptively aligned model Backdoored model Defection trigger Backdoor trigger The main distinguishing feature of deceptive alignment compared to other kinds of backdoors is that the deceptively aligned model is not produced by an adversary, but is instead produced through ordinary training. Thus by treating deceptive alignment as a backdoor, we are modeling the training process as an adversary. In our analysis of deceptive alignment, the basic tension we will face is that an unconstrained adversary will always win, but any particular proxy constraint we impose on the adversary may be unrealistic. Static backdoor detection An important piece of prior work is the paper Planting Undetectable Backdoors in Machine Learning Models, which uses a digital signature scheme to insert an undetectable backdoor into a model. Roughly speaking, the authors exhibit a modified version of a "Random Fourier Features" training algorithm that produces a backdoored model. Any input to the backdoored model can be perturbed by an attacker with knowledge of a secret key to produce a new input on which the model behaves differently. However, the backdoor is undetectable in the sense that it is computationally infeasible for a defender with white-box access to distinguish a backdoored model from an or...

AF - Conflating value alignment and intent alignment is causing confusion by Seth Herd

Sep 5th, 2024 4:39 PM

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conflating value alignment and intent alignment is causing confusion, published by Seth Herd on September 5, 2024 on The AI Alignment Forum. Submitted to the Alignment Forum. Contains more technical jargon than usual. Epistemic status: I think something like this confusion is happening often. I'm not saying these are the only differences in what people mean by "AGI alignment". Summary: Value alignment is better but probably harder to achieve than personal intent alignment to the short-term wants of some person(s). Different groups and people tend to primarily address one of these alignment targets when they discuss alignment. Confusion abounds. One important confusion stems from an assumption that the type of AI defines the alignment target: strong goal-directed AGI must be value aligned or misaligned, while personal intent alignment is only viable for relatively weak AI. I think this assumption is important but false. While value alignment is categorically better, intent alignment seems easier, safer, and more appealing in the short term, so AGI project leaders are likely to try it.[1] Overview Clarifying what people mean by alignment should dispel some illusory disagreement, and clarify alignment theory and predictions of AGI outcomes. Caption: Venn diagram of three types of alignment targets. Value alignment and Personal intent alignment are both subsets of Evan Hubinger's definition of intent alignment: AGI aligned with human intent in the broadest sense. Prosaic alignment work usually seems to be addressing a target somewhere in the neighborhood of personal intent alignment (following instructions or doing what this person wants now), while agent foundations and other conceptual alignment work usually seems to be addressing value alignment. Those two clusters have different strengths and weaknesses as alignment targets, so lumping them together produces confusion. People mean different things when they say alignment. Some are mostly thinking about value alignment (VA): creating sovereign AGI that has values close enough to humans' for our liking. Others are talking about making AGI that is corrigible (in the Christiano or Harms sense)[2] or follows instructions from its designated principal human(s). I'm going to use the term personal intent alignment (PIA) until someone has a better term for that type of alignment target. Different arguments and intuitions apply to these two alignment goals, so talking about them without differentiation is creating illusory disagreements. Value alignment is better almost by definition, but personal intent alignment seems to avoid some of the biggest difficulties of value alignment. Max Harms' recent sequence on corrigibility as a singular target (CAST) gives both a nice summary and detailed arguments. We do not need us to point to or define values, just short term preferences or instructions. The principal advantage is that an AGI that follows instructions can be used as a collaborator in improving its alignment over time; you don't need to get it exactly right on the first try. This is more helpful in slower and more continuous takeoffs. This means that PI alignment has a larger basin of attraction than value alignment does.[3] Most people who think alignment is fairly achievable seem to be thinking of PIA, while critics often respond thinking of value alignment. It would help to be explicit. PIA is probably easier and more likely than full VA for our first stabs at AGI, but there are reasons to wonder if it's adequate for real success. In particular, there are intuitions and arguments that PIA doesn't address the real problem of AGI alignment. I think PIA does address the real problem, but in a non-obvious and counterintuitive way. Another unstated divide There's another important clustering around these two conceptions of al...

AF - Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception? by David Scott Krueger

Sep 4th, 2024 12:40 PM

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?, published by David Scott Krueger on September 4, 2024 on The AI Alignment Forum. AI systems up to some high level of intelligence plausibly need to know exactly where they are in space-time in order for deception/"scheming" to make sense as a strategy. This is because they need to know: 1) what sort of oversight they are subject to and 2) what effects their actions will have on the real world (side note: Acausal trade might break this argument) There are a number of informal proposals to keep AI systems selectively ignorant of (1) and (2) in order to prevent deception. Those proposals seem very promising to flesh out; I'm not aware of any rigorous work doing so, however. Are you? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

More Episodes