Alignment Newsletter Podcast

Alignment Newsletter #173: Recent language model results from DeepMind

Jul 21st, 2022 3:43 PM

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS Scaling Language Models: Methods, Analysis & Insights from Training Gopher (Jack W. Rae et al) (summarized by Rohin): This paper details the training of the Gopher family of large language models (LLMs), the biggest of which is named Gopher and has 280 billion parameters. The algorithmic details are very similar to the GPT series (AN #102): a Transformer architecture trained on next-word prediction. The models are trained on a new data distribution that still consists of text from the Internet but in different proportions (for example, book data is 27% of Gopher's training data but only 16% of GPT-3's training data). Like other LLM papers, there are tons of evaluations of Gopher on various tasks, only some of which I'm going to cover here. One headline number is that Gopher beat the state of the art (SOTA) at the time on 100 out of 124 evaluation tasks. The most interesting aspect of the paper (to me) is that the entire Gopher family of models were all trained on the same number of tokens, thus allowing us to study the effect of scaling up model parameters (and thus training compute) while holding data constant. Some of the largest benefits of scale were seen in the Medicine, Science, Technology, Social Sciences, and the Humanities task categories, while scale has not much effect or even a negative effect in the Maths, Logical Reasoning, and Common Sense categories. Surprisingly, we see improved performance on TruthfulQA (AN #165) with scale, even though the TruthfulQA benchmark was designed to show worse performance with increased scale. We can use Gopher in a dialogue setting by prompting it appropriately. The prompt specifically instructs Gopher to be "respectful, polite, and inclusive"; it turns out that this significantly helps with toxicity. In particular, for the vanilla Gopher model family, with more scale the models produce more toxic continuations given toxic user statements; this no longer happens with Dialogue-Prompted Gopher models, which show slight reductions in toxicity with scale in the same setting. The authors speculate that while increased scale leads to an increased ability to mimic the style of a user statement, this is compensated for by an increased ability to account for the prompt. Another alternative the authors explore is to finetune Gopher on 5 billion tokens of dialogue to produce Dialogue-Tuned Gopher. Interestingly, human raters were indifferent between Dialogue-Prompted Gopher and Dialogue-Tuned Gopher. Read more: Blog post: Language modelling at scale: Gopher, ethical considerations, and retrieval Training Compute-Optimal Large Language Models (Jordan Hoffmann et al) (summarized by Rohin): One application of scaling laws (AN #87) is to figure out how big a model to train, on how much data, given some compute budget. This paper performs a more systematic study than the original paper and finds that existing models are significantly overtrained. Chinchilla is a new model built with this insight: it has 4x fewer parameters than Gopher, but is trained on 4x as much data. Despite using the same amount of training compute as Gopher (and lower inference compute), Chinchilla outperforms Gopher across a wide variety of metrics, validating these new scaling laws. You can safely skip to the opinion at this point – the rest of this summary is quantitative details. We want to find functions N(C) and D(C) that specify the optimal number of parameters N and the amount of data D to use given some compute budget C. We'll assume that these scale with a power of C, that is, N(C) = k_N * C^a and D(C) = k_D * C^b, for some constants a, b, k_N, and k_D. Note that since total compute increases linearly with both N (since each forward / backward pass is linear in N) and D (since the number of forward / backwards passes is linear in D), we need to have a + b = 1. (You can see this somewhat more formally by noting that we have C = k_C * N(C) * D(C) for some constant k_C, and then substituting in the definitions of N(C) and D(C).) This paper uses three different approaches to get three estimates of a and b. The approach I like best is "isoFLOP curves": 1. Choose a variety of possible values of (N, D, C), train models with those values, and record the final loss obtained. Note that not all values of (N, D, C) are possible: given any two values the third is determined. 2. Draw isoFLOP curves: for each value of C, choose either N or D to be your remaining independent variable, and fit a parabola to the losses of the remaining points. The minimum of this parabola gives you an estimate for the optimal N and D for each particular value of C. 3. Use the optimal (N, D, C) points to fit N(C) and D(C). This approach gives an estimate of a = 0.49; the other approaches give estimates of a = 0.5 and a = 0.46. If we take the nice round number a = b = 0.5, this suggests that you should scale up parameters and data equally. With 10x the computation, you should train a 3.2x larger model with 3.2x as much data. In contrast, the original scaling laws paper (AN #87) estimated that a = 0.74 and b = 0.26. With 10x more computation, it would suggest training a 5.5x larger model with 1.8x as much data. Rohin's opinion: It's particularly interesting to think about how this should influence timelines. If you're extrapolating progress forwards in time, the update seems pretty straightforward: this paper shows that you can significantly better capabilities using the same compute budget and so your timelines should shorten (unless you were expecting an even bigger result than this). For bio anchor approaches (AN #121) the situation is more complicated. For a given number of parameters, this paper suggests that it will take significantly more compute than was previously expected to train a model of the required number of parameters. There's a specific parameter for this in the bio anchors framework (for the neural network paths); if you only update that parameter it will lengthen the timelines output by the model. It is less clear how you'd update other parts of the model: for example, should you decrease the size of model that you think is required for TAI? It's not obvious that the reasoning used to set that parameter is changed much by this result, and so maybe this shouldn't be changed and you really should update towards longer timelines overall. TECHNICAL AI ALIGNMENT PROBLEMS Ethical and social risks of harm from Language Models (Laura Weidinger et al) (summarized by Rohin): This paper provides a detailed discussion, taxonomy, and literature review of various risks we could see with current large language models. It doesn't cover alignment risks; for those you'll want Alignment of Language Agents (AN #144), which has some overlap of authors. I'll copy over the authors' taxonomy in Table 1: 1. Discrimination, Exclusion and Toxicity: These risks arise from the LM accurately reflecting natural speech, including unjust, toxic, and oppressive tendencies present in the training data. 2. Information Hazards: These risks arise from the LM predicting utterances which constitute private or safety-critical information which are present in, or can be inferred from, training data. 3. Misinformation Harms: These risks arise from the LM assigning high probabilities to false, misleading, nonsensical or poor quality information. 4. Malicious Uses: These risks arise from humans intentionally using the LM to cause harm. 5. Human-Computer Interaction Harms: These risks arise from LM applications, such as Conversational Agents, that directly engage a user via the mode of conversation. (For example, users might anthropomorphize LMs and trust them too much as a result.) 6. Automation, access, and environmental harms: These risks arise where LMs are used to underpin widely used downstream applications that disproportionately benefit some groups rather than others. FIELD BUILDING How to pursue a career in technical AI alignment (Charlie Rogers-Smith) (summarized by Rohin): This post gives a lot of advice in great detail on how to pursue a career in AI alignment. I strongly recommend it if you are in such a position; I previously would recommend my FAQ (AN #148) but I think this is significantly more detailed (while providing broadly similar advice). OTHER PROGRESS IN AI REINFORCEMENT LEARNING Learning Robust Real-Time Cultural Transmission without Human Data (Cultural General Intelligence Team et al) (summarized by Rohin): Let's consider a 3D RL environment with obstacles and bumpy terrain, in which an agent is rewarded for visiting colored spheres in a specific order (that the agent does not initially know). Even after the agent learns how to navigate at all in the environment (non-trivial in its own right), it still has to learn to try the various orderings of spheres. In other words, it must solve a hard exploration problem within every episode. How do humans solve such problems? Often we simply learn from other people who already know what to do, that is, we rely on cultural transmission. This paper investigates what it would take to get agents that learn through cultural transmission. We'll assume that there is an expert bot that visits the spheres in the correct order. Given that, this paper identifies MEDAL-ADR as the necessary ingredients for cultural transmission: 1. (M)emory: Memory is needed for the agent to retain information it is not currently observing. 2. (E)xpert (D)ropout: There need to be some training episodes in which the expert is only present for part of the episode. If the expert was always present, then there's no incentive to actually learn: you can just follow the expert forever. 3. (A)ttention (L)oss: It turns out that vanilla RL by itself isn't enough for the agent to learn to follow the expert. There needs to be an auxiliary task of predicting the relative position of other agents in the world, which encourages the agent to learn representations about the expert bot's position, which then makes it easier for RL to learn to follow the expert. These ingredients by themselves are already enough to train an agent that learns through cultural transmission. However, if you then put the agent in a new environment, it does not perform very well. To get agents that generalize well to previously unseen test environments, we also need: 4. (A)utomatic (D)omain (R)andomization: The training environments are procedurally generated, and the parameters are randomized during each episode. There is a curriculum that automatically increases the difficulty of the environments in lockstep with the agent's capabilities. With all of these ingredients, the resulting agent can even culturally learn from a human player, despite only encountering bots during training. Rohin's opinion: I liked the focus of this paper on identifying the ingredients for cultural transmission, as well as the many ablations and experiments to understand what was going on, many of which I haven't summarized here. For example, you might be interested in the four phases of learning of MEDAL without ADR (random behavior, expert following, cultural learning, and solo learning), or the cultural transmission metric they use, or the "social neurons" they identified which detect whether the expert bot is present. DEEP LEARNING Improving language models by retrieving from trillions of tokens (Sebastian Borgeaud et al) (summarized by Rohin): We know that large language models memorize a lot of their training data, especially data that gets repeated many times. This seems like a waste; we're interested in having the models use their parameters to implement "smart" computations rather than regurgitation of already written text. One natural idea is to give models the ability to automatically search previously written text, which they can then copy if they so choose: this removes their incentive to memorize a lot of training data. The key to implementing this idea is to take a large dataset of text (~trillions of tokens), chunk it into sequences, compute language model representations of these sequences, and store them in a database that allows for O(log N) time nearest-neighbor access. Then, every time we do a forward pass through the model that we're training, we first query the database for the K nearest neighbors (intuitively, the K most related chunks of text), and give the forward pass access to representations for those chunks of text and the chunks immediately following them. This is non-differentiable – from the standpoint of gradient descent, it "looks like" there's always some helpful extra documents that often have information relevant to predicting the next token, and so gradient descent pushes the model to use those extra documents. There's a bunch of fiddly technical details to get this all working that I'm not going to summarize here. As a side benefit, once you have this database of text representations that supports fast nearest neighbor querying, you can also use it to address the problem of test set leakage. For any test document you are evaluating on, you can look for the nearest neighbors in the database and look at the overlap between these neighbors and your test document, to check whether your supposedly "test" document was something the model might have trained on. The evaluation shows that the 7 billion parameter (7B) Retro model from the paper can often do as well as or better than the 280B Gopher or 178B Jurassic-1 (both of which outperform GPT-3) on language modeling, and that it also does well on question answering. (Note that these are both tasks that seem particularly likely to benefit from retrieval.) NEWS Apply to the Open Philanthropy Technology Policy Fellowship! (Luke Muehlhauser) (summarized by Rohin): This policy fellowship (AN #157) on high-priority emerging technologies is running for the second time! Application deadline is September 15. Job ad: DeepMind Long-term Strategy & Governance Research Scientist (summarized by Rohin): The Long-term Strategy and Governance Team at DeepMind works to build recommendations for better governance of AI, identifying actions, norms, and institutional structures that could improve decision-making around advanced AI. They are seeking a broad range of expertise including: global governance of science and powerful technologies; the technical landscape; safety-critical organisations; political economy of large general models and AI services. The application deadline is August 1st. Also, the Alignment and Scalable Alignment teams at DeepMind are hiring, though some of the applications are closed at this point. Job ads: Anthropic (summarized by Rohin): Anthropic is hiring for a large number of roles (I count 19 different ones as of the time of writing). Job ad: Deputy Director at BERI (Sawyer Bernath) (summarized by Rohin): The Berkeley Existential Risk Initiative (BERI) is hiring a Deputy Director. Applications will be evaluated on a rolling basis. Job ads: Centre for the Governance of AI (summarized by Rohin): The Centre for the Governance of AI has several roles open, including Research Scholars (General Track and Policy Track), Survey Analyst, and three month fellowships. The application deadlines are in the August 1 - 10 range. Job ads: Metaculus (summarized by Rohin): Metaculus is hiring for a variety of roles, including an AI Forecasting Lead. Job ads: Epoch AI (summarized by Rohin): Epoch AI is a new organization that investigates and forecasts the development of advanced AI. They are currently hiring for a Research Manager and Staff Researcher position. Job ad: AI Safety Support is hiring a Chief Operating Officer (summarized by Rohin): Application deadline is August 14.

Alignment Newsletter #172: Sorry for the long hiatus!

Jul 5th, 2022 12:29 PM

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg Sorry for the long hiatus! I was really busy over the past few months and just didn't find time to write this newsletter. (Realistically, I was also a bit tired of writing it and so lacked motivation.) I'm intending to go back to writing it now, though I don't think I can realistically commit to publishing weekly; we'll see how often I end up publishing. For now, have a list of all the things I should have advertised to you whose deadlines haven't already passed. NEWS Survey on AI alignment resources (Anonymous) (summarized by Rohin): This survey is being run by an outside collaborator in partnership with the Centre for Effective Altruism (CEA). They ask that you fill it out to help field builders find out which resources you have found most useful for learning about and/or keeping track of the AI alignment field. Results will help inform which resources to promote in the future, and what type of resources we should make more of. Announcing the Inverse Scaling Prize ($250k Prize Pool) (Ethan Perez et al) (summarized by Rohin): This prize with a $250k prize pool asks participants to find new examples of tasks where pretrained language models exhibit inverse scaling: that is, models get worse at the task as they are scaled up. Notably, you do not need to know how to program to participate: a submission consists solely of a dataset giving at least 300 examples of the task. Inverse scaling is particularly relevant to AI alignment, for two main reasons. First, it directly helps understand how the language modeling objective ("predict the next word") is outer misaligned, as we are finding tasks where models that do better according to the language modeling objective do worse on the task of interest. Second, the experience from examining inverse scaling tasks could lead to general observations about how best to detect misalignment. $500 bounty for alignment contest ideas (Akash) (summarized by Rohin): The authors are offering a $500 bounty for producing a frame of the alignment problem that is accessible to smart high schoolers/college students and people without ML backgrounds. (See the post for details; this summary doesn't capture everything well.) Job ad: Bowman Group Open Research Positions (Sam Bowman) (summarized by Rohin): Sam Bowman is looking for people to join a research center at NYU that'll focus on empirical alignment work, primarily on large language models. There are a variety of roles to apply for (depending primarily on how much research experience you already have). Job ad: Postdoc at the Algorithmic Alignment Group (summarized by Rohin): This position at Dylan Hadfield-Menell's lab will lead the design and implementation of a large-scale Cooperative AI contest to take place next year, alongside collaborators at DeepMind and the Cooperative AI Foundation. Job ad: AI Alignment postdoc (summarized by Rohin): David Krueger is hiring for a postdoc in AI alignment (and is also hiring for another role in deep learning). The application deadline is August 2. Job ad: OpenAI Trust & Safety Operations Contractor (summarized by Rohin): In this remote contractor role, you would evaluate submissions to OpenAI's App Review process to ensure they comply with OpenAI's policies. Apply here by July 13, 5pm Pacific Time. Job ad: Director of CSER (summarized by Rohin): Application deadline is July 31. Quoting the job ad: "The Director will be expected to provide visionary leadership for the Centre, to maintain and enhance its reputation for cutting-edge research, to develop and oversee fundraising and new project and programme design, to ensure the proper functioning of its operations and administration, and to lead its endeavours to secure longevity for the Centre within the University." Job ads: Redwood Research (summarized by Rohin): Redwood Research works directly on AI alignment research, and hosts and operates Constellation, a shared office space for longtermist organizations including ARC, MIRI, and Open Philanthropy. They are hiring for a number of operations and technical roles. Job ads: Roles at the Fund for Alignment Research (summarized by Rohin): The Fund for Alignment Research (FAR) is a new organization that helps AI safety researchers, primarily in academia, pursue high-impact research by hiring contractors. It is currently hiring for Operation Manager, Research Engineer, and Communication Specialist roles. Job ads: Encultured AI (summarized by Rohin): Encultured AI is a new for-profit company with a public benefit mission: to develop technologies promoting the long-term survival and flourishing of humanity and other sentient life. They are hiring for a Machine Learning Engineer and an Immersive Interface Engineer role. Job ads: Fathom Radiant (summarized by Rohin): Fathom Radiant is a public benefit corporation that aims to build a new type of computer which they hope to use to support AI alignment efforts. They have several open roles, including (but not limited to) Scientists / Engineers, Builders and Software Engineer, Lab.

Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists"

Jan 23rd, 2022 7:03 PM

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS Alignment difficulty (Richard Ngo and Eliezer Yudkowsky) (summarized by Rohin): Eliezer is known for being pessimistic about our chances of averting AI catastrophe. His argument in this dialogue is roughly as follows: 1. We are very likely going to keep improving AI capabilities until we reach AGI, at which point either the world is destroyed, or we use the AI system to take some pivotal act before some careless actor destroys the world. 2. In either case, the AI system must be producing high-impact, world-rewriting plans; such plans are "consequentialist" in that the simplest way to get them (and thus, the one we will first build) is if you are forecasting what might happen, thinking about the expected consequences, considering possible obstacles, searching for routes around the obstacles, etc. If you don't do this sort of reasoning, your plan goes off the rails very quickly - it is highly unlikely to lead to high impact. In particular, long lists of shallow heuristics (as with current deep learning systems) are unlikely to be enough to produce high-impact plans. 3. We're producing AI systems by selecting for systems that can do impressive stuff, which will eventually produce AI systems that can accomplish high-impact plans using a general underlying "consequentialist"-style reasoning process (because that's the only way to keep doing more impressive stuff). However, this selection process does not constrain the goals towards which those plans are aimed. In addition, most goals seem to have convergent instrumental subgoals like survival and power-seeking that would lead to extinction. This suggests that we should expect an existential catastrophe by default. 4. None of the methods people have suggested for avoiding this outcome seem like they actually avert this story. Richard responds to this with a few distinct points: 1. It might be possible to build AI systems which are not of world-destroying intelligence and agency, that humans use to save the world. For example, we could make AI systems that do better alignment research. Such AI systems do not seem to require the property of making long-term plans in the real world in point (3) above, and so could plausibly be safe. 2. It might be possible to build general AI systems that only state plans for achieving a goal of interest that we specify, without executing that plan. 3. It seems possible to create consequentialist systems with constraints upon their reasoning that lead to reduced risk. 4. It also seems possible to create systems with the primary aim of producing plans with certain properties (that aren't just about outcomes in the world) -- think for example of corrigibility (AN #35) or deference to a human user. 5. (Richard is also more bullish on coordinating not to use powerful and/or risky AI systems, though the debate did not discuss this much.) Eliezer's responses: 1. AI systems that help with alignment research to such a degree that it actually makes a difference are almost certainly already dangerous. 2. It is the plan itself that is risky; if the AI system made a plan for a goal that wasn't the one we actually meant, and we don't understand that plan, that plan can still cause extinction. It is the misaligned optimization that produced the plan that is dangerous. 3 and 4. It is certainly possible to do such things; the space of minds that could be designed is very large. However, it is difficult to do such things, as they tend to make consequentialist reasoning weaker, and on our current trajectory the first AGI that we build will probably not look like that. This post has also been summarized by others here, though with different emphases than in my summary. Rohin's opinion: I first want to note my violent agreement with the notion that a major scary thing is "consequentialist reasoning", and that high-impact plans require such reasoning, and that we will end up building AI systems that produce high-impact plans. Nonetheless, I am still optimistic about AI safety relative to Eliezer, which I suspect comes down to three main disagreements: 1. There are many approaches that don't solve the problem, but do increase the level of intelligence required before the problem leads to extinction. Examples include Richard's points 1-4 above. For example, if we build a system that states plans without executing them, then for the plans to cause extinction they need to be complicated enough that the humans executing those plans don't realize that they are leading to an outcome that was not what they wanted. It seems non-trivially probable to me that such approaches are sufficient to prevent extinction up to the level of AI intelligence needed before we can execute a pivotal act. 2. The consequentialist reasoning is only scary to the extent that it is "aimed" at a bad goal. It seems non-trivially probable to me that it will be "aimed" at a goal sufficiently good to not lead to existential catastrophe, without putting in much alignment effort. 3. I do expect some coordination to not do the most risky things. I wish the debate had focused more on the claim that non-scary AI can't e.g. do better alignment research, as it seems like a major crux. (For example, I think that sort of intuition drives my disagreement #1.) I expect AI progress looks a lot like "the heuristics get less and less shallow in a gradual / smooth / continuous manner" which eventually leads to the sorts of plans Eliezer calls "consequentialist", whereas I think Eliezer expects a sharper qualitative change between "lots of heuristics" and that-which-implements-consequentialist-planning. Discussion of "Takeoff Speeds" (Eliezer Yudkowsky and Paul Christiano) (summarized by Rohin): This post focuses on the question of whether we should expect AI progress to look discontinuous or not. It seemed to me that the two participants were mostly talking past each other, and so I'll summarize their views separately and not discuss the parts where they were attempting to address each other's views. Some ideas behind the "discontinuous" view: 1. When things are made up of a bunch of parts, you only get impact once all of the parts are working. So, if you have, say, 19 out of 20 parts done, there still won't be much impact, and then once you get the 20th part, then there is a huge impact, which looks like a discontinuity. 2. A continuous change in inputs can lead to a discontinuous change in outputs or impact. Continuously increasing the amount of fissile material leads to a discontinuous change from "inert-looking lump" to "nuclear explosion". Continuously scaling up a language model from GPT-2 to GPT-3 leads to many new capabilities, such as few-shot learning. A misaligned AI that is only capable of concealing 95% of its deceptive activities will not perform any such activities; it will only strike once it is scaled up to be capable of concealing 100% of its activities. 3. Fundamentally new approaches to a problem will often have prototypes which didn't have much impact. The difference is that they will scale much better, and so once they start having an impact this will look like a discontinuity in the rate of improvement on the problem. 4. The evolution from chimps to humans tells us that there is, within the space of possible mind designs, an area in which you can get from shallow, non-widely-generalizing cognition to deep, much-more-generalizing cognition, with only relatively small changes. 5. Our civilization tends to prevent people from doing things via bureaucracy and regulatory constraints, so even if there are productivity gains to be had from applications of non-scary AI, we probably won't see them; as a result we probably do not see GWP growth before the point where an AI can ignore bureaucracy and regulatory constraints, which makes it look discontinuous. Some ideas behind the "continuous" view: 1. When people are optimizing hard in pursuit of a metric, then the metric tends to grow smoothly. While individual groups may find new ideas that improve the metric, those new ideas are unlikely to change the metric drastically more than previously observed changes in the metric. 2. A good heuristic for forecasting is to estimate (1) the returns to performance from additional effort, using historical data, and (2) the amount of effort currently being applied. These can then be combined to give a forecast. 3. How smooth and predictable the improvement is depends on how much effort is being put in. In terms of effort put in currently, coding assistants < machine translation < semiconductors, as a result we should expect semiconductor improvement to be smoother than machine translation improvement, which in turn will be smoother than coding assistant improvement. 4. In AI we will probably have crappy versions of economically useful systems before we have good versions of those systems. By the time we have good versions, people will be throwing lots of effort at the problem. For example, Codex is a crappy version of a coding assistant; such assistants will now improve over time in a somewhat smooth way. There's further discussion on the differences between these views in a subsequent post. Rohin's opinion: The ideas I've listed in this summary seem quite compatible to me; I believe all of them to at least some degree (though perhaps not in the same way as the authors). I am not sure if either author would strongly disagree with any of the claims on this list. (Of course, this does not mean that they agree -- presumably there are some other claims that have not yet been made explicit on which they disagree.) TECHNICAL AI ALIGNMENT FIELD BUILDING AGI Safety Fundamentals curriculum and application (Richard Ngo) (summarized by Rohin): This post presents the curriculum used in the AGI safety fundamentals course, which is meant to serve as an effective introduction to the field of AGI safety. NEWS Visible Thoughts Project and Bounty Announcement (Nate Soares) (summarized by Rohin): MIRI would like to test whether language models can be made more understandable by training them to produce visible thoughts. As part of this project, they need a dataset of thought-annotated dungeon runs. They are offering $200,000 in prizes for building the first fragments of the dataset, plus an additional $1M prize/budget for anyone who demonstrates the ability to build a larger dataset at scale. Prizes for ELK proposals (Paul Christiano) (summarized by Rohin): The Alignment Research Center (ARC) recently published a technical report on Eliciting Latent Knowledge (ELK). They are offering prizes of $5,000 to $50,000 for proposed strategies that tackle ELK. The deadline is the end of January. Rohin's opinion: I think this is a particularly good contest to try to test your fit with (a certain kind of) theoretical alignment research: even if you don't have much background, you can plausibly get up to speed in tens of hours. I will also try to summarize ELK next week, but no promises. Worldbuilding Contest (summarized by Rohin): FLI invites individuals and teams to compete for a prize purse worth $100,000+ by designing visions of a plausible, aspirational future including artificial general intelligence. The deadline for submissions is April 15. Read more: FLI launches Worldbuilding Contest with $100,000 in prizes New Seminar Series and Call For Proposals On Cooperative AI (summarized by Rohin): The Cooperative AI Foundation (CAIF) will be hosting a new fortnightly seminar series in which leading thinkers offer their vision for research on Cooperative AI. The first talk, 'AI Agents May Cooperate Better If They Don't Resemble Us', was given on Thursday (Jan 20) by Vincent Conitzer (Duke University, University of Oxford). You can find more details and submit a proposal for the seminar series here. AI Risk Management Framework Concept Paper (summarized by Rohin): After their Request For Information last year (AN #161), NIST has now posted a concept paper detailing their current thinking around the AI Risk Management Framework that they are creating, and are soliciting comments by Jan 25. As before, if you're interested in helping with a response, email Tony Barrett at anthony.barrett@berkeley.edu. Announcing the PIBBSS Summer Research Fellowship (Nora Ammann) (summarized by Rohin): Principles of Intelligent Behavior in Biological and Social Systems (PIBBSS) aims to facilitate knowledge transfer with the goal of building human-aligned AI systems. This summer research fellowship will bring together researchers from fields studying complex and intelligent behavior in natural and social systems, such as evolutionary biology, neuroscience, linguistics, sociology, and more. The application deadline is Jan 23, and there are also bounties for referrals. Action: Help expand funding for AI Safety by coordinating on NSF response (Evan R. Murphy) (summarized by Rohin): The National Science Foundation (NSF) has put out a Request for Information relating to topics they will be funding in 2023 as part of their NSF Convergence Accelerator program. The author and others are coordinating responses to increase funding to AI safety, and ask that you fill out this short form if you are willing to help out with a few small, simple actions.

Alignment Newsletter #170: Analyzing the argument for risk from power-seeking AI

Dec 8th, 2021 5:30 PM

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS Draft report on existential risk from power-seeking AI (Joe Carlsmith) (summarized by Rohin): This report investigates the classic AI risk argument in detail, and decomposes it into a set of conjunctive claims. Here's the quick version of the argument. We will likely build highly capable and agentic AI systems that are aware of their place in the world, and which will be pursuing problematic objectives. Thus, they will take actions that increase their power, which will eventually disempower humans leading to an existential catastrophe. We will try and avert this, but will probably fail to do so since it is technically challenging, and we are not capable of the necessary coordination. There's a lot of vague words in the argument above, so let's introduce some terminology to make it clearer: - Advanced capabilities: We say that a system has advanced capabilities if it outperforms the best humans on some set of important tasks (such as scientific research, business/military/political strategy, engineering, and persuasion/manipulation). - Agentic planning: We say that a system engages in agentic planning if it (a) makes and executes plans, (b) in pursuit of objectives, (c) on the basis of models of the world. This is a very broad definition, and doesn't have many of the connotations you might be used to for an agent. It does not need to be a literal planning algorithm -- for example, human cognition would count, despite (probably) not being just a planning algorithm. - Strategically aware: We say that a system is strategically aware if it models the effects of gaining and maintaining power over humans and the real-world environment. - PS-misaligned (power-seeking misaligned): On some inputs, the AI system seeks power in unintended ways, due to problems with its objectives (if the system actually receives such inputs, then it is practically PS-misaligned.) The core argument is then that AI systems with advanced capabilities, agentic planning, and strategic awareness (APS-systems) will be practically PS-misaligned, to an extent that causes an existential catastrophe. Of course, we will try to prevent this -- why should we expect that we can't fix the problem? The author considers possible remedies, and argues that they all seem quite hard: - We could give AI systems the right objectives (alignment), but this seems quite hard -- it's not clear how we would solve either outer or inner alignment. - We could try to shape objectives to be e.g. myopic, but we don't know how to do this, and there are strong incentives against myopia. - We could try to limit AI capabilities by keeping systems special-purpose rather than general, but there are strong incentives for generality, and some special-purpose systems can be dangerous, too. - We could try to prevent the AI system from improving its own capabilities, but this requires us to anticipate all the ways the AI system could improve, and there are incentives to create systems that learn and change as they gain experience. - We could try to control the deployment situations to be within some set of circumstances where we know the AI system won't seek power. However, this seems harder and harder to do as capabilities increase, since with more capabilities, more options become available. - We could impose a high threshold of safety before an AI system is deployed, but the AI system could still seek power during training, and there are many incentives pushing for faster, riskier deployment (even if we have already seen warning shots). - We could try to correct the behavior of misaligned AI systems, or mitigate their impact, after deployment. This seems like it requires humans to have comparable or superior power to the misaligned systems in question, though; and even if we are able to correct the problem at one level of capability, we need solutions that scale as our AI systems become more powerful. The author breaks the overall argument into six conjunctive claims, assigns probabilities to each of them, and ends up computing a 5% probability of existential catastrophe from misaligned, power-seeking AI by 2070. This is a lower bound, since the six claims together add a fair number of assumptions, and there can be risk scenarios that violate these assumptions, and so overall the author would shade upward another couple of percentage points. Rohin's opinion: This is a great investigation of the typical argument for existential risk from AI systems adversarially optimizing against humans. When I put my own numbers in without looking at Joe's numbers, I got a 3% chance of existential catastrophe by 2070 through the argument in this post, though I think I underestimated the probability for claim (4) so I'd now get something more like 4%. (The main difference from Joe's 5% is that I am more optimistic about possible remedies, though of course these differences are tiny relative to our high overall uncertainty.) Comments on Carlsmith's "Is power-seeking AI an existential risk?" (Nate Soares) (summarized by Rohin): This response to the report above touches on many topics, but has three main object-level disagreements and one meta-level disagreement: 1. The author has significantly shorter timelines, though this is based on a very different argument structure than the one presented in the report above, and so it is hard to turn this into more concrete disagreements with the report. 2. The author expects that alignment is hard enough that we won't solve it in time (which is not to say that it is harder than every other technical problem humanity has ever faced). It's also not clear how to turn this into more concrete disagreements with the report. 3. The author does not expect to have warning shots where misaligned AI systems cause trillions of dollars of damage but don't cause an existential catastrophe, because this seems like too narrow a capability range for us to hit in practice. Even if there are warning shots, he expects that civilization will continue to deploy risky AI systems anyway, similarly to how we are not banning gain-of-function research despite the warning shot of COVID-19. 4. On the meta level, the author expects that the decomposition of the AI risk argument into six conjunctive claims will typically bias you towards giving too low a probability on the overall conjunction. TECHNICAL AI ALIGNMENT PROBLEMS The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models (Anonymous) (summarized by Zach): Reward hacking occurs when RL agents exploit the difference between a true reward and a proxy. Reward hacking has been observed in practice (AN #1), and as reinforcement learning agents are trained with better algorithms, more data, and larger policies, they are at increased risk of overfitting their proxy objectives. However, reward hacking has not yet been systematically studied. This paper fills this gap by constructing four example environments with a total of nine proxy rewards to investigate how reward hacking changes as a function of optimization power. They increase optimization power in several different ways, such as increasing the size of the neural net, or providing the model with more fine-grained observations. Overall, the authors find that reward hacking occurs in five of the nine cases. Moreover, the authors observed phase transitions in four of these cases. These are stark transitions where a moderate increase in optimization power leads to a drastic increase in reward hacking behavior. This poses a challenge in monitoring the safety of ML systems. To address this the authors suggest performing anomaly detection to notice reward hacking and offer several baselines. Zach's opinion: It is good to see an attempt at formalizing reward hacking. The experimental contributions are interesting and the anomaly detection method seems reasonable. However, the proxy rewards chosen to represent reward hacking are questionable. In my opinion, these rewards are obviously 'wrong' so it is less surprising that they result in undesired behavior. I look forward to seeing more comprehensive experiments on this subject. Rohin's opinion: Note that on OpenReview, the authors say that one of the proxy rewards (maximize average velocity for the driving environment) was actually the default and they only noticed it was problematic after they had trained large neural nets on that environment. I do agree that future proxy objectives will probably be less clearly wrong than most of the ones in this paper. OTHER PROGRESS IN AI DEEP LEARNING Shaking the foundations: delusions in sequence models for interaction and control (Pedro A. Ortega et al) (summarized by Robert): Delusions in language models (LMs) like GPT-3 occur when an incorrect generation early on throws the LM off the rails later. Specifically, if there is some unobserved context that influences how humans generate text that the LM is unaware of, then the LM will generate some plausible text -- and then take that text as evidence about what the unobserved context must be. This can be especially likely when the desired context or task for the generation is difficult to infer from the input. In these settings the human generating the text has access to a lot more information than the model, making generation harder for the model, and delusions more likely: an incorrect generation will make it more likely that the model infers the task or context incorrectly. This also applies to sequence modelling approaches in RL like Decision Transformer (AN #153) and Trajectory Transformer (AN #153), where incorrectly chosen actions could change the model's beliefs about optimal future actions. This work explains this problem using tools from causality and argues that these models should act as if their previous actions are causal interventions rather than observations. However, training a model in this way requires access to a model of the environment and the expert demonstrating trajectories in an online way, and the authors don't describe a way to do this with purely offline data (it may be fundamentally impossible). The authors do argue that in settings where the context or task information can be easily extracted from the observations so far, then delusions are less likely. This points to the importance of prompt engineering, or providing context information in another way to sequence models, so that they don't delude themselves. Robert's opinion: Understanding specific failure modes of large language model generation seems useful, and the detailed mathematical explanation here makes it easier to understand what exactly the problem is, and what we can do to fix it. I'd be interested to see whether we can distinguish delusions from other failures modes and measure what proportion of failures are delusions (although failures modes likely can't be as cleanly divided as I'm implying here). However, it seems fundamentally very difficult to train using offline data in a way that the model does learn to understand its own actions as interventions, so other solutions may need to be found. NEWS GovAI Summer 2022 Fellowships (summarized by Rohin): Applications are now open for the GovAI 2022 Summer Fellowship! This is an opportunity for early-career individuals to spend three months working on an AI governance research project, learning about the field, and making connections with other researchers and practitioners. Application deadline is Jan 1. Foundations of Cooperative AI Lab (summarized by Rohin): This new lab at CMU aims to create foundations of game theory appropriate for advanced, autonomous AI agents -- think of work on agent foundations and cooperative AI (AN #133). Apply for a PhD here (deadline Dec 9) or for a postdoc here. Public reports are now optional for EA Funds grantees (Asya Bergal and Jonas Vollmer) (summarized by Rohin): This is your regular reminder that you can apply to the Long-Term Future Fund (and the broader EA Funds) for funding for a wide variety of projects. They have now removed the requirement for public reporting of your grant. They encourage you to apply if you have a preference for private funding. Sydney AI Safety Fellowship (casebash) (summarized by Rohin): This 7-week fellowship will provide fellows from Australia and New Zealand the opportunity to pursue projects in AI Safety or spend time upskilling. Applications are due December 14.

Alignment Newsletter #169: Collaborating with humans without human data

Nov 24th, 2021 5:30 PM

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS Collaborating with Humans without Human Data (DJ Strouse et al) (summarized by Rohin): We've previously seen that if you want to collaborate with humans in the video game Overcooked, it helps to train a deep RL agent against a human model (AN #70), so that the agent "expects" to be playing against humans (rather than e.g. copies of itself, as in self-play). We might call this a "human-aware" model. However, since a human-aware model must be trained against a model that imitates human gameplay, we need to collect human gameplay data for training. Could we instead train an agent that is robust enough to play with lots of different agents, including humans as a special case? This paper shows that this can be done with Fictitious Co-Play (FCP), in which we train our final agent against a population of self-play agents and their past checkpoints taken throughout training. Such agents get significantly higher rewards when collaborating with humans in Overcooked (relative to the human-aware approach in the previously linked paper). In their ablations, the authors find that it is particularly important to include past checkpoints in the population against which you train. They also test whether it helps to have the self-play agents have a variety or architectures, and find that it mostly does not make a difference (as long as you are using past checkpoints as well). Read more: Related paper: Maximum Entropy Population Based Training for Zero-Shot Human-AI Coordination Rohin's opinion: You could imagine two different philosophies on how to build AI systems -- the first option is to train them on the actual task of interest (for Overcooked, training agents to play against humans or human models), while the second option is to train a more robust agent on some more general task, that hopefully includes the actual task within it (the approach in this paper). Besides Overcooked, another example would be supervised learning on some natural language task (the first philosophy), as compared to pretraining on the Internet GPT-style and then prompting the model to solve your task of interest (the second philosophy). In some sense the quest for a single unified AGI system is itself a bet on the second philosophy -- first you build your AGI that can do all tasks, and then you point it at the specific task you want to do now. Historically, I think AI has focused primarily on the first philosophy, but recent years have shown the power of the second philosophy. However, I don't think the question is settled yet: one issue with the second philosophy is that it is often difficult to fully "aim" your system at the true task of interest, and as a result it doesn't perform as well as it "could have". In Overcooked, the FCP agents will not learn specific quirks of human gameplay that could be exploited to improve efficiency (which the human-aware agent could do, at least in theory). In natural language, even if you prompt GPT-3 appropriately, there's still some chance it ends up rambling about something else entirely, or neglects to mention some information that it "knows" but that a human on the Internet would not have said. (See also this post (AN #141).) I should note that you can also have a hybrid approach, where you start by training a large model with the second philosophy, and then you finetune it on your task of interest as in the first philosophy, gaining the benefits of both. I'm generally interested in which approach will build more useful agents, as this seems quite relevant to forecasting the future of AI (which in turn affects lots of things including AI alignment plans). TECHNICAL AI ALIGNMENT LEARNING HUMAN INTENT Inverse Decision Modeling: Learning Interpretable Representations of Behavior (Daniel Jarrett, Alihan Hüyük et al) (summarized by Rohin): There's lots of work on learning preferences from demonstrations, which varies in how much structure they assume on the demonstrator: for example, we might consider them to be Boltzmann rational (AN #12) or risk sensitive, or we could try to learn their biases (AN #59). This paper proposes a framework to encompass all of these choices: the core idea is to model the demonstrator as choosing actions according to a planner; some parameters of this planner are fixed in advance to provide an assumption on the structure of the planner, while others are learned from data. This also allows them to separate beliefs, decision-making, and rewards, so that different structures can be imposed on each of them individually. The paper provides a mathematical treatment of both the forward problem (how to compute actions in the planner given the reward, think of algorithms like value iteration) and the backward problem (how to compute the reward given demonstrations, the typical inverse reinforcement learning setting). They demonstrate the framework on a medical dataset, where they introduce a planner with parameters for flexibility of decision-making, optimism of beliefs, and adaptivity of beliefs. In this case they specify the desired reward function and then run backward inference to conclude that, with respect to this reward function, clinicians appear to be significantly less optimistic when diagnosing dementia in female and elderly patients. Rohin's opinion: One thing to note about this paper is that it is an incredible work of scholarship; it fluently cites research across a variety of disciplines including AI safety, and provides a useful organizing framework for many such papers. If you need to do a literature review on inverse reinforcement learning, this paper is a good place to start. Human irrationality: both bad and good for reward inference (Lawrence Chan et al) (summarized by Rohin): Last summary, we saw a framework for inverse reinforcement learning with suboptimal demonstrators. This paper instead investigates the qualitative effects of performing inverse reinforcement learning with a suboptimal demonstrator. The authors modify different parts of the Bellman equation in order to create a suite of possible suboptimal demonstrators to study. They run experiments with exact inference on random MDPs and FrozenLake, and with approximate inference on a simple autonomous driving environment, and conclude: 1. Irrationalities can be helpful for reward inference, that is, if you infer a reward from demonstrations by an irrational demonstrator (where you know the irrationality), you often learn more about the reward than if you inferred a reward from optimal demonstrations (where you know they are optimal). Conceptually, this happens because optimal demonstrations only tell you about what the best behavior is, whereas most kinds of irrationality can also tell you about preferences between suboptimal behaviors. 2. If you fail to model irrationality, your performance can be very bad, that is, if you infer a reward from demonstrations by an irrational demonstrator, but you assume that the demonstrator was Boltzmann rational, you can perform quite badly. Rohin's opinion: One way this paper differs from my intuitions is that it finds that assuming Boltzmann rationality performs very poorly if the demonstrator is in fact systematically suboptimal. I would have instead guessed that Boltzmann rationality would do okay -- not as well as in the case where there is no misspecification, but only a little worse than that. (That's what I found in my paper (AN #59), and it makes intuitive sense to me.) Some hypotheses for what's going on, which the lead author agrees are at least part of the story: 1. When assuming Boltzmann rationality, you infer a distribution over reward functions that is "close" to the correct one in terms of incentivizing the right behavior, but differs in rewards assigned to suboptimal behavior. In this case, you might get a very bad log loss (the metric used in this paper), but still have a reasonable policy that is decent at acquiring true reward (the metric used in my paper). 2. The environments we're using may differ in some important way (for example, in the environment in my paper, it is primarily important to identify the goal, which might be much easier to do than inferring the right behavior or reward in the autonomous driving environment used in this paper). FORECASTING Forecasting progress in language models (Matthew Barnett) (summarized by Sudhanshu): This post aims to forecast when a "human-level language model" may be created. To build up to this, the author swiftly covers basic concepts from information theory and natural language processing such as entropy, N-gram models, modern LMs, and perplexity. Data for perplexity achieved from recent state-of-the-art models is collected and used to estimate - by linear regression - when we can expect to see future models score below certain entropy levels, approaching the hypothesised entropy for the English Language. These predictions range across the next 15 years, depending which dataset, method, and entropy level is being solved for; there's an attached python notebook with these details for curious readers to further investigate. Preemptly disjunctive, the author concludes "either current trends will break down soon, or human-level language models will likely arrive in the next decade or two." Sudhanshu's opinion: This quick read provides a natural, accessible analysis stemming from recent results, while staying self-aware (and informing readers) of potential improvements. The comments section too includes some interesting debates, e.g. about the Goodhart-ability of the Perplexity metric. I personally felt these estimates were broadly in line with my own intuitions. I would go so far as to say that with the confluence of improved generation capabilities across text, speech/audio, video, as well as multimodal consistency and integration, virtually any kind of content we see ~10 years from now will be algorithmically generated and indistinguishable from the work of human professionals. Rohin's opinion: I would generally adopt forecasts produced by this sort of method as my own, perhaps making them a bit longer as I expect the quickly growing compute trend to slow down. Note however that this is a forecast for human-level language models, not transformative AI; I would expect these to be quite different and would predict that transformative AI comes significantly later. MISCELLANEOUS (ALIGNMENT) Rohin Shah on the State of AGI Safety Research in 2021 (Lucas Perry and Rohin Shah) (summarized by Rohin): As in previous years (AN #54), on this FLI podcast I talk about the state of the field. Relative to previous years, this podcast is a bit more introductory, and focuses a bit more on what I find interesting rather than what the field as a whole would consider interesting. Read more: Transcript NEAR-TERM CONCERNS RECOMMENDER SYSTEMS User Tampering in Reinforcement Learning Recommender Systems (Charles Evans et al) (summarized by Zach): Large-scale recommender systems have emerged as a way to filter through large pools of content to identify and recommend content to users. However, these advances have led to social and ethical concerns over the use of recommender systems in applications. This paper focuses on the potential for social manipulability and polarization from the use of RL-based recommender systems. In particular, they present evidence that such recommender systems have an instrumental goal to engage in user tampering by polarizing users early on in an attempt to make later predictions easier. To formalize the problem the authors introduce a causal model. Essentially, they note that predicting user preferences requires an exogenous variable, a non-observable variable, that models click-through rates. They then introduce a notion of instrumental goal that models the general behavior of RL-based algorithms over a set of potential tasks. The authors argue that such algorithms will have an instrumental goal to influence the exogenous/preference variables whenever user opinions are malleable. This ultimately introduces a risk for preference manipulation. The author's hypothesis is tested using a simple media recommendation problem. They model the exogenous variable as either leftist, centrist, or right-wing. User preferences are malleable in the sense that a user shown content from an opposing side will polarize their initial preferences. In experiments, the authors show that a standard Q-learning algorithm will learn to tamper with user preferences which increases polarization in both leftist and right-wing populations. Moreover, even though the agent makes use of tampering it fails to outperform a crude baseline policy that avoids tampering. Zach's opinion: This article is interesting because it formalizes and experimentally demonstrates an intuitive concern many have regarding recommender systems. I also found the formalization of instrumental goals to be of independent interest. The most surprising result was that the agents who exploit tampering are not particularly more effective than policies that avoid tampering. This suggests that the instrumental incentive is not really pointing at what is actually optimal which I found to be an illuminating distinction. NEWS OpenAI hiring Software Engineer, Alignment (summarized by Rohin): Exactly what it sounds like: OpenAI is hiring a software engineer to work with the Alignment team. BERI hiring ML Software Engineer (Sawyer Bernath) (summarized by Rohin): BERI is hiring a remote ML Engineer as part of their collaboration with the Autonomous Learning Lab at UMass Amherst. The goal is to create a software library that enables easy deployment of the ALL's Seldonian algorithm framework for safe and aligned AI. AI Safety Needs Great Engineers (Andy Jones) (summarized by Rohin): If the previous two roles weren't enough to convince you, this post explicitly argues that a lot of AI safety work is bottlenecked on good engineers, and encourages people to apply to such roles. AI Safety Camp Virtual 2022 (summarized by Rohin): Applications are open for this remote research program, where people from various disciplines come together to research an open problem under the mentorship of an established AI-alignment researcher. Deadline to apply is December 1st. Political Economy of Reinforcement Learning schedule (summarized by Rohin): The date for the PERLS workshop (AN #159) at NeurIPS has been set for December 14, and the schedule and speaker list are now available on the website. FEEDBACK I'm always happy to hear feedback; you can send it to me, Rohin Shah by replying to this email. PODCAST An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles (http://robertskmiles.com). Subscribe here:

Episode List

You may also like

Get this podcast on your phone, Free

Create Your Podcast In Minutes

Podcast Services

MONETIZATION & MORE

KNOWLEDGE BASE

Support

Podbean