The Nonlinear Library: Alignment Forum

https://storage.googleapis.com/rssfile/nonlinear-library-aggregated-AF.xml

3 Followers 384 Episodes Claim Ownership

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episode List

AF - Solving adversarial attacks in computer vision as a baby version of general AI alignment by stanislavfort

Aug 29th, 2024 5:58 PM

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Solving adversarial attacks in computer vision as a baby version of general AI alignment, published by stanislavfort on August 29, 2024 on The AI Alignment Forum. I spent the last few months trying to tackle the problem of adversarial attacks in computer vision from the ground up. The results of this effort are written up in our new paper Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness (explainer on X/Twitter). Taking inspiration from biology, we reached state-of-the-art or above state-of-the-art robustness at 100x - 1000x less compute, got human-understandable interpretability for free, turned classifiers into generators, and designed transferable adversarial attacks on closed-source (v)LLMs such as GPT-4 or Claude 3. I strongly believe that there is a compelling case for devoting serious attention to solving the problem of adversarial robustness in computer vision, and I try to draw an analogy to the alignment of general AI systems here. 1. Introduction In this post, I argue that the problem of adversarial attacks in computer vision is in many ways analogous to the larger task of general AI alignment. In both cases, we are trying to faithfully convey an implicit function locked within the human brain to a machine, and we do so extremely successfully on average. Under static evaluations, the human and machine functions match up exceptionally well. However, as is typical in high-dimensional spaces, some phenomena can be relatively rare and basically impossible to find by chance, yet ubiquitous in their absolute count. This is the case for adversarial attacks - imperceptible modifications to images that completely fool computer vision systems and yet have virtually no effect on humans. Their existence highlights a crucial and catastrophic mismatch between the implicit human vision function and the function learned by machines - a mismatch that can be exploited in a dynamic evaluation by an active, malicious agent. Such failure modes will likely be present in more general AI systems, and our inability to remedy them even in the more restricted vision context (yet) does not bode well for the broader alignment project. This is a call to action to solve the problem of adversarial vision attacks - a stepping stone on the path to aligning general AI systems. 2. Communicating implicit human functions to machines The basic goal of computer vision can be viewed as trying to endow a machine with the same vision capabilities a human has. A human carries, locked inside their skull, an implicit vision function mapping visual inputs into semantically meaningful symbols, e.g. a picture of a tortoise into a semantic label tortoise. This function is represented implicitly and while we are extremely good at using it, we do not have direct, conscious access to its inner workings and therefore cannot communicate it to others easily. To convey this function to a machine, we usually form a dataset of fixed images and their associated labels. We then use a general enough class of functions, typically deep neural networks, and a gradient-based learning algorithm together with backpropagation to teach the machine how to correlate images with their semantic content, e.g. how to assign a label parrot to a picture of a parrot. This process is extremely successful in communicating the implicit human vision function to the computer, and the implicit human and explicit, learned machine functions agree to a large extent. The agreement between the two is striking. Given how different the architectures are (a simulated graph-like function doing a single forward pass vs the wet protein brain of a mammal running continuous inference), how different the learning algorithms are (gradient descent with backpropagation vs something completely different but still unknown), a...

AF - Would catching your AIs trying to escape convince AI developers to slow down or undeploy? by Buck Shlegeris

Aug 26th, 2024 4:46 PM

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would catching your AIs trying to escape convince AI developers to slow down or undeploy?, published by Buck Shlegeris on August 26, 2024 on The AI Alignment Forum. I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. As an extreme thought experiment, I'll argue this could be hard even if you caught your AI red-handed trying to escape. Imagine you're running an AI lab at the point where your AIs are able to automate almost all intellectual labor; the AIs are now mostly being deployed internally to do AI R&D. (If you want a concrete picture here, I'm imagining that there are 10 million parallel instances, running at 10x human speed, working 24/7. See e.g. similar calculations here). And suppose (as I think is 35% likely) that these models are egregiously misaligned and are actively looking for ways to subvert your safety measures and escape your control. You caught your AI attempting to backdoor some code that controls access to a security-critical service. You investigated this by running other instances on the inputs they'd see if that backdoored code had actually been deployed; the other AI instances never point out the security flaw, and they often try to exploit that code to escalate their privileges. (See here for more on this.) At this point you're almost sure the AI is scheming. So you try to call the US President and they hear you out, and then they go off and call one of your competitors to ask for their take. They say something like this: It would be extremely costly to shut down all the AI projects. China is racing to catch up and won't stop for this. If you force us to stop deploying our frontier models, they'll outcompete us within a year, and you won't be able to get China to pause without substantial risk of war. AIs are well known to do weird stuff. It would be irresponsible to assume one instance of anomalous behavior meant AIs were systematically going to act that way. I can put you in touch with top AI experts and they'll back me up. Even if the AI is indeed doing something systematically funny, we have no evidence that it has the ambition to seize huge amounts of power, and it's not clear that it would be able to. It seems implausible that AIs would do that; there's been no concrete evidence that the AIs are in fact power-hungry. Maybe our competitor just messed up their alignment, which would make sense because, as we've always told you, they're not very competent (especially compared to us). If they want to shut down, they're welcome to. But it doesn't make sense to penalize us for their shoddy work. Maybe they're just lying; this could be a doomer scheme to shut down AI, which the doomers have wanted to do for years. The logs they sent you could be complete forgeries. Or for all we know, someone there (with or without leadership approval) intentionally backdoored their RL data to make their model do this. It would be a mistake to take rash action before independent experts confirm that this wasn't somehow rigged. I'm sympathetic to all of these arguments. The main reason I'd be more freaked out is that I already think egregious misalignment is fairly plausible; if I thought it was very unlikely, I wouldn't change my mind based on one weird observation. (I think it's pretty plausible that news of the escape attempt wouldn't even make it out of the AI lab: all the above arguments could happen inside the AI lab, between the safety concerned people and people who wanted to push forward.) Now, I'm not sure I'm right to be so pessimistic here. But I think a lot of misalignment-concerned people take for granted that we'd be okay in this situation, an...

AF - Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs by Michaël Trazzi

Aug 24th, 2024 4:30 AM

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Owain Evans on Situational Awareness and Out-of-Context Reasoning in LLMs, published by Michaël Trazzi on August 24, 2024 on The AI Alignment Forum. Owain Evans is an AI Alignment researcher, research associate at the Center of Human Compatible AI at UC Berkeley, and now leading a new AI safety research group. In this episode we discuss two of his recent papers, "Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs" (LW) and "Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data" (LW), alongside some Twitter questions. Below are some highlighted quotes from our conversation (available on Youtube, Spotify, Apple Podcast). For the full context for each of these quotes, you can find the accompanying transcript. Situational Awareness Definition "What is situational awareness? The idea is the model's kind of self-awareness, that is its knowledge of its own identity, and then its awareness of its environment. What are the basic interfaces that it is connected to? [...] And then there's a final point with situational awareness, which is, can the model use knowledge of its identity and environment to take rational actions?" "Situational awareness is crucial for an AI system acting as an agent, doing long-term planning. If you don't understand what kind of thing you are, your capabilities and limitations, it's very hard to make complicated plans. The risks of AI mostly come from agentic models able to do planning." Motivation "We wanted to measure situational awareness in large language models with a benchmark similar to Big Bench or MMLU. The motivation is that situational awareness is important for thinking about AI risks, especially deceptive alignment, and we lacked ways to measure and break it down into components." "Situational awareness is relevant to any situation where the model needs to do agentic long-term planning. [...] A model confused about itself and its situation would likely struggle to pull off such a strategy." On Claude 3 Opus Insightful Answers "Let me explain [the Long Monologue task]. Most of our dataset is typical multiple-choice question answering, but we added a task where models write long answers describing themselves and their situation. The idea is to see if the model can combine different pieces of information about itself coherently and make good inferences about why we're asking these questions. Claude 3 Opus was particularly insightful, guessing it might be part of a research study testing self-awareness in LLMs. These were true inferences not stated in the question. The model was reading between the lines, guessing this wasn't a typical ChatGPT-style interaction. I was moderately surprised, but I'd already seen Opus be very insightful and score well on our benchmark. It's worth noting we sample answers with temperature 1, so there's some randomness. We saw these insights often enough that I don't think it's just luck. Anthropic's post-training RLHF seems good at giving the model situational awareness. The GPT-4 base results were more surprising to us." What Would Saturating The Situational Awareness Benchmark Imply For Safety And Governance "If models can do as well or better than humans who are AI experts, who know the whole setup, who are trying to do well on this task, and they're doing well on all the tasks including some of these very hard ones, that would be one piece of evidence. [...] We should consider how aligned it is, what evidence we have for alignment. We should maybe try to understand the skills it's using." "If the model did really well on the benchmark, it seems like it has some of the skills that would help with deceptive alignment. This includes being able to reliably work out when it's being evaluated by humans, when it has a lot of oversight, and when it needs to...

AF - Showing SAE Latents Are Not Atomic Using Meta-SAEs by Bart Bussmann

Aug 24th, 2024 12:56 AM

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Showing SAE Latents Are Not Atomic Using Meta-SAEs, published by Bart Bussmann on August 24, 2024 on The AI Alignment Forum. Bart, Michael and Patrick are joint first authors. Research conducted as part of MATS 6.0 in Lee Sharkey and Neel Nanda's streams. Thanks to Mckenna Fitzgerald and Robert Krzyzanowski for their feedback! TL;DR: Sparse Autoencoder (SAE) latents have been shown to typically be monosemantic (i.e. correspond to an interpretable property of the input). It is sometimes implicitly assumed that they are therefore atomic, i.e. simple, irreducible units that make up the model's computation. We provide evidence against this assumption by finding sparse, interpretable decompositions of SAE decoder directions into seemingly more atomic latents, e.g. Einstein -> science + famous + German + astronomy + energy + starts with E We do this by training meta-SAEs, an SAE trained to reconstruct the decoder directions of a normal SAE. We argue that, conceptually, there's no reason to expect SAE latents to be atomic - when the model is thinking about Albert Einstein, it likely also thinks about Germanness, physicists, etc. Because Einstein always entails those things, the sparsest solution is to have the Albert Einstein latent also boost them. Key results SAE latents can be decomposed into more atomic, interpretable meta-latents. We show that when latents in a larger SAE have split out from latents in a smaller SAE, a meta SAE trained on the larger SAE often recovers this structure. We demonstrate that meta-latents allow for more precise causal interventions on model behavior than SAE latents on a targeted knowledge editing task. We believe that the alternate, interpretable decomposition using MetaSAEs casts doubt on the implicit assumption that SAE latents are atomic. We show preliminary results that MetaSAE latents have significant ovelap with latents in a normal SAE of the same size but may relate differently to the larger SAEs used in MetaSAE training. We made a dashboard that lets you explore meta-SAE latents. Terminology: Throughout this post we use "latents" to describe the concrete components of the SAE's dictionary, whereas "feature" refers to the abstract concepts, following Lieberum et al. Introduction Mechanistic interpretability (mech interp) attempts to understand neural networks by breaking down their computation into interpretable components. One of the key challenges of this line of research is the polysemanticity of neurons, meaning they respond to seemingly unrelated inputs. Sparse autoencoders (SAEs) have been proposed as a method for decomposing model activations into sparse linear sums of latents. Ideally, these latents should be monosemantic i.e. respond to inputs that clearly share a similar meaning (implicitly, from the perspective of a human interpreter). That is, a human should be able to reason about the latents both in relation to the features to which they are associated, and also use the latents to better understand the model's overall behavior. There is a popular notion, both implicitly in related work on SAEs within mech interp and explicitly by the use of the term "atom" in sparse dictionary learning as a whole, that SAE features are atomic or can be "true features". However, monosemanticity does not imply atomicity. Consider the example of shapes of different colors - the set of shapes is [circle, triangle, square], and the set of colors is [white, red, green, black], each of which is represented with a linear direction. 'Red triangle' represents a monosemantic feature, but not an atomic feature, as it can be decomposed into red and triangle. It has been shown that sufficiently wide SAEs on toy models will learn 'red triangle', rather than representing 'red' and 'triangle' with separate latents. Furthermore, whilst one may naively re...

AF - Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025) by Linda Linsefors

Aug 23rd, 2024 2:18 PM

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025), published by Linda Linsefors on August 23, 2024 on The AI Alignment Forum. Do you have AI Safety research ideas that you would like to work on with others? Is there a project you want to do and you want help finding a team? AI Safety Camp could be the solution for you! Summary AI Safety Camp Virtual is a 3-month long online research program from January to April 2025, where participants form teams to work on pre-selected projects. We want you to suggest the projects! If you have an AI Safety project idea and some research experience, apply to be a Research Lead. If accepted, we offer some assistance to develop your idea into a plan suitable for AI Safety Camp. When project plans are ready, we open up team member applications. You get to review applications for your team, and select who joins as a team member. From there, it's your job to guide work on your project. Who is qualified? We require that you have some previous research experience. If you are at least 1 year into a PhD or if you have completed an AI Safety research program (such as a previous AI Safety Camp, PIBBSS, MATS, and similar), or done a research internship with an AI Safety org, then you are qualified already. Other research experience can count, too. More senior researchers are of course also welcome, as long as you think our format of leading an online team inquiring into your research questions suits you and your research. We require that all Research Leads are active participants in their projects and spend at least 10h/week on AISC. Apply here If you are unsure, or have any questions you are welcome to: Book a call with Robert Send an email to Robert Choosing project idea(s) AI Safety Camp is about ensuring future AIs are either reasonably safe or not built at all. We welcome many types of projects including projects aimed at stopping or pausing AI development, aligning AI, deconfusion research, or anything else you think will help make the world safer from AI. If you like, you can read more of our perspectives on AI safety, or look at past projects. If you already have an idea for what project you would like to lead, that's great. Apply with that one! However, you don't need to come up with an original idea. What matters is you understanding the idea you want to work on, and why. If you base your proposal on someone else's idea, make sure to cite them. 1. For ideas on stopping harmful AI, see here and/or email Remmelt. 2. For some mech-interp ideas see here. 3. We don't have specific recommendations for where to find other types of project ideas, so just take inspiration wherever you find it. You can submit as many project proposals as you want. However, you are only allowed to lead one project. Use this template to describe each of your project proposals. We want one document per proposal. We'll help you improve your project As part of the Research Lead application process, we'll help you improve your project. The organiser whose ideas match best with yours, will work with you to create the best version of your project. We will also ask for assistance from previous Research Leads, and up to a handful of other trusted people, to give you additional feedback. Timeline Research Lead applications September 22 (Sunday): Application deadline for Research Leads. October 20 (Sunday): Deadline for refined proposals. Team member applications: October 25 (Friday): Accepted proposals are posted on the AISC website. Application to join teams open. November 17 (Sunday): Application to join teams closes. December 22 (Sunday): Deadline for Research Leads to choose their team. Program Jan 11 - 12: Opening weekend. Jan 13 Apr 25: Research is happening. Teams meet weekly, and plan in their own work hours. April 26 - 27 (preliminary dates):...

More Episodes