The Nonlinear Library: Alignment Forum

https://storage.googleapis.com/rssfile/nonlinear-library-aggregated-AF.xml

3 Followers 384 Episodes Claim Ownership

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episode List

AF - Interoperable High Level Structures: Early Thoughts on Adjectives by johnswentworth

Aug 22nd, 2024 9:12 PM

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Interoperable High Level Structures: Early Thoughts on Adjectives, published by johnswentworth on August 22, 2024 on The AI Alignment Forum. Meta: This post is a relatively rough dump of some recent research thoughts; it's not one of our more polished posts, in terms of either clarity or rigor. You've been warned. The Interoperable Semantics post and the Solomonoff Inductor Walks Into A Bar post each tackled the question of how different agents in the same world can coordinate on an ontology, so that language can work at all given only a handful of example usages of each word (similar to e.g. children learning new words). Both use natural latents as a central mathematical tool - one in a Bayesian probabilistic framework, the other in a minimum description length framework. Both focus mainly on nouns, i.e. interoperable-across-minds clusters of "objects" in the environment. … and the two propose totally different models. In one, the interoperability of cluster labels (i.e. nouns) follows from natural latent conditions over different features of each object. In the other, interoperability follows from natural latent conditions across objects, with no mention of features. The two models are not, in general, equivalent; they can't both be both correct and complete. In this post, we'll propose that while the natural latent conditions over objects still seem to intuitively capture the rough notion of nouns, the natural latent conditions over features seem much better suited to adjectives. We'll briefly lay out two different potential ways to use natural latents over features as semantic values for adjectives. Then we'll talk a bit about implications, open threads and how this fits into a broader research gameplan. The Problem When children learn language, the cognitive process seems to go: Observe the world a bunch … organize knowledge of the world according to some categories, concepts, ontology, etc … those categories, concepts, ontology, etc match other humans' categories, concepts, ontology, etc reasonably well … so it only takes a handful of examples (1-3, say) of the use of a given word in order for the child to learn what the word refers to. The crucial point here is that the categories/concepts/ontology are mostly learned before a word is attached; children do not brute-force learn categories/concepts/ontology from "labeled data". We can tell this is true mainly because it typically takes so few examples to learn the meaning of a new word. The big puzzle, then, is that different humans learn mostly approximately the same categories/concepts/ontology - i.e. the same "candidates" to which words might point - as required for language to work at all with so few examples. How does that work? Mathematically, what are those "interoperable" categories/concepts/ontology, which different humans mostly convergently learn? How can we characterize them? Or, somewhat earlier on the tech tree: can we find even a single model capable of accounting for the phenomenon of different minds in the same environment robustly converging on approximately the same categories/concepts/ontology? Forget whether we can find a model which correctly captures the ontology converged upon by humans, can we even find any model capable of accounting for any sort of robust ontological convergence? Can we find such a model for which the convergent ontology even vaguely resembles the sorts of things in human language (nouns, verbs, adjectives, etc)? What would such a model even look like? That's roughly the stage we're at in this post. Two Previous Models: Naturality Over Objects vs Features Our main tool is (deterministic) natural latents. The usage looks like: Suppose the different minds each look for (and find) a latent variable which satisfies the natural latent conditions over some lower-level variab...

AF - A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed by johnswentworth

Aug 22nd, 2024 7:19 PM

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed, published by johnswentworth on August 22, 2024 on The AI Alignment Forum. This post walks through the math for a theorem. It's intended to be a reference post, which we'll link back to as-needed from future posts. The question which first motivated this theorem for us was: "Redness of a marker seems like maybe a natural latent over a bunch of parts of the marker, and redness of a car seems like maybe a natural latent over a bunch of parts of the car, but what makes redness of the marker 'the same as' redness of the car? How are they both instances of one natural thing, i.e. redness? (or 'color'?)". But we're not going to explain in this post how the math might connect to that use-case; this post is just the math. Suppose we have multiple distributions P1,…,Pk over the same random variables X1,…,Xn. (Speaking somewhat more precisely: the distributions are over the same set, and an element of that set is represented by values (x1,…,xn).) We take a mixture of the distributions: P[X]:=jαjPj[X], where jαj=1 and α is nonnegative. Then our theorem says: if an approximate natural latent exists over P[X], and that latent is robustly natural under changing the mixture weights α, then the same latent is approximately natural over Pj[X] for all j. Mathematically: the natural latent over P[X] is defined by (x,λP[Λ=λ|X=x]), and naturality means that the distribution (x,λP[Λ=λ|X=x]P[X=x]) satisfies the naturality conditions (mediation and redundancy).The theorem says that, if the joint distribution (x,λP[Λ=λ|X=x]jαjPj[X=x]) satisfies the naturality conditions robustly with respect to changes in α, then (x,λP[Λ=λ|X=x]Pj[X=x]) satisfies the naturality conditions for all j. "Robustness" here can be interpreted in multiple ways - we'll cover two here, one for which the theorem is trivial and another more substantive, but we expect there are probably more notions of "robustness" which also make the theorem work. Trivial Version First notion of robustness: the joint distribution (x,λP[Λ=λ|X=x]jαjPj[X=x]) satisfies the naturality conditions to within ϵ for all values of α (subject to jαj=1 and α nonnegative). Then: the joint distribution (x,λP[Λ=λ|X=x]jαjPj[X=x]) satisfies the naturality conditions to within ϵ specifically for αj=δjk, i.e. α which is 0 in all entries except a 1 in entry k. In that case, the joint distribution is (x,λP[Λ=λ|X=x]Pk[X=x]), therefore Λ is natural over Pk. Invoke for each k, and the theorem is proven. ... but that's just abusing an overly-strong notion of robustness. Let's do a more interesting one. Nontrivial Version Second notion of robustness: the joint distribution (x,λP[Λ=λ|X=x]jαjPj[X=x]) satisfies the naturality conditions to within ϵ, and the gradient of the approximation error with respect to (allowed) changes in α is (locally) zero. We need to prove that the joint distributions (x,λP[Λ=λ|X=x]Pj[X=x]) satisfy both the mediation and redundancy conditions for each j. We'll start with redundancy, because it's simpler. Redundancy We can express the approximation error of the redundancy condition with respect to Xi under the mixed distribution as DKL(P[Λ,X]||P[X]P[Λ|Xi])=EX[DKL(P[Λ|X]||P[Λ|Xi])] where, recall, P[Λ,X]:=P[Λ|X]jαjPj[X]. We can rewrite that approximation error as: EX[DKL(P[Λ|X]||P[Λ|Xi])] =jαjPj[X]DKL(P[Λ|X]||P[Λ|Xi]) =jαjEjX[DKL(P[Λ|X]||P[Λ|Xi])] Note that Pj[Λ|X]=P[Λ|X] is the same under all the distributions (by definition), so: =jαjDKL(Pj[Λ,X]||P[Λ|Xi]) and by factorization transfer: jαjDKL(Pj[Λ,X]||Pj[Λ|Xi]) In other words: if ϵji is the redundancy error with respect to Xi under distribution j, and ϵi is the redundancy error with respect to Xi under the mixed distribution P, then ϵijαjϵji The redundancy error of the mixed distribution is a...

AF - Measuring Structure Development in Algorithmic Transformers by Jasmina Nasufi

Aug 22nd, 2024 9:45 AM

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring Structure Development in Algorithmic Transformers, published by Jasmina Nasufi on August 22, 2024 on The AI Alignment Forum. tl;dr: We compute the evolution of the local learning coefficient (LLC), a proxy for model complexity, for an algorithmic transformer. The LLC decreases as the model learns more structured solutions, such as head specialization. This post is structured in three main parts, (1) a summary, giving an overview of the main results, (2) the Fine Print, that delves into various cross-checks and details and (3) Discussion and Conclusions. Structure Formation in Algorithmic Transformers In this work we study the development of simple algorithmic transformers, which are transformers that learn to perform algorithmic tasks. A major advantage of this setup is that we can control several (hyper)parameters, such as the complexity of the training data and network architecture. This allows us to do targeted experiments studying the impacts of these parameters on the learning dynamics. The main tool we use to study the development is the Local Learning Coefficient (LLC) and we choose cases where we have a reverse-engineered solution. Why use the LLC for this purpose? It is a theoretically well motivated measure of model complexity defined by Lau et.al. For an overview of Singular Learning Theory (which serves as the theoretical foundation for the LLC) see Liam Carol's Distilling SLT sequence. For a brief overview of the LLC see e.g. this post. We use the same setup as CallumMcDougall's October Monthly Algorithmic Mech-Interp Challenge. The model is an attention only transformer, trained on sorting numbers with layer norm and weight decay on a cross-entropy loss function using the Adam optimizer. The residual stream size is 96 and the head dimension is 48. It is trained on sequences of the form and to predict the next token starting at the separation token. The numbers in the list are sampled uniformly from 0 to 50, which together with the separation token produce a vocabulary of 52 tokens. Numbers do not repeat in the list. The images making up the gifs can be found here. 1-Head Model Let's first look at the case of a 1-head transformer: The model reaches 100% accuracy around training step 100, confirming that a single attention head is sufficient for sorting, as noted in previous work. Once maximum accuracy is reached, the full QK and OV circuits[2] behave as described by Callum for the 2-head model: In the QK circuit, source tokens attend more to the smallest token in the list larger than themselves. This results in a higher value band above the diagonal and a lower value band below the diagonal. The OV circuit copies tokens, as seen by the clear positive diagonal pattern. In addition, we observe a transition around training step 1000, where the LLC decreases while the accuracy stays unchanged. This is supported by a drop in the sum of the ranks[3] of the matrices in the heat maps. It also coincides with the formation of the off-diagonal stripes in the OV-circuit. We speculate that these are simpler than the noisier off-diagonal OV pattern observed at peak LLC, and correspond to the translational symmetry of the problem. We define a Translational Symmetry measure[1] (see purple line in the plot) to capture the degree to which the circuits obey this symmetry. It increases throughout most of the training, even after the other measures stabilize. 2-Head Model Let's now turn our attention to the 2-head transformer in Callum's original setup. We see a lot of qualitative similarities to the evolution of the full QK and OV circuits for the 1-head model. As the LLC begins to drop (around training step 1000), we note the following: QK circuit: Slight changes[5] to the attention pattern, which crystallize into triangular regions late in the training, long aft...

AF - AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work by Rohin Shah

Aug 20th, 2024 4:22 PM

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work, published by Rohin Shah on August 20, 2024 on The AI Alignment Forum. We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours. Who are we? We're the main team at Google DeepMind working on technical approaches to existential risk from AI systems. Since our last post, we've evolved into the AGI Safety & Alignment team, which we think of as AGI Alignment (with subteams like mechanistic interpretability, scalable oversight, etc.), and Frontier Safety (working on the Frontier Safety Framework, including developing and running dangerous capability evaluations). We've also been growing since our last post: by 39% last year, and by 37% so far this year. The leadership team is Anca Dragan, Rohin Shah, Allan Dafoe, and Dave Orr, with Shane Legg as executive sponsor. We're part of the overall AI Safety and Alignment org led by Anca, which also includes Gemini Safety (focusing on safety training for the current Gemini models), and Voices of All in Alignment, which focuses on alignment techniques for value and viewpoint pluralism. What have we been up to? It's been a while since our last update, so below we list out some key work published in 2023 and the first part of 2024, grouped by topic / sub-team. Our big bets for the past 1.5 years have been 1) amplified oversight, to enable the right learning signal for aligning models so that they don't pose catastrophic risks, 2) frontier safety, to analyze whether models are capable of posing catastrophic risks in the first place, and 3) (mechanistic) interpretability, as a potential enabler for both frontier safety and alignment goals. Beyond these bets, we experimented with promising areas and ideas that help us identify new bets we should make. Frontier Safety The mission of the Frontier Safety team is to ensure safety from extreme harms by anticipating, evaluating, and helping Google prepare for powerful capabilities in frontier models. While the focus so far has been primarily around misuse threat models, we are also working on misalignment threat models. FSF We recently published our Frontier Safety Framework, which, in broad strokes, follows the approach of responsible capability scaling, similar to Anthropic's Responsible Scaling Policy and OpenAI's Preparedness Framework. The key difference is that the FSF applies to Google: there are many different frontier LLM deployments across Google, rather than just a single chatbot and API (this in turn affects stakeholder engagement, policy implementation, mitigation plans, etc). We're excited that our small team led the Google-wide strategy in this space, and demonstrated that responsible capability scaling can work for large tech companies in addition to small startups. A key area of the FSF we're focusing on as we pilot the Framework, is how to map between the critical capability levels (CCLs) and the mitigations we would take. This is high on our list of priorities as we iterate on future versions. Some commentary (e.g. here) also highlighted (accurately) that the FSF doesn't include commitments. This is because the science is in early stages and best practices will need to evolve. But ultimately, what we care about is whether the work is actually done. In practice, we did run and report dangerous capability evaluations for Gemini 1.5 that we think are sufficient to rule out extreme risk with high confidence. Dangerous Capability Evaluations Our paper on Evaluating Frontier Models for Dangerous Capabilities is the broadest suite of dangerous capability evaluati...

AF - Finding Deception in Language Models by Esben Kran

Aug 20th, 2024 9:42 AM

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Finding Deception in Language Models, published by Esben Kran on August 20, 2024 on The AI Alignment Forum. This June, Apart Research and Apollo Research joined forces to host the Deception Detection Hackathon. Bringing together students, researchers, and engineers from around the world to tackle a pressing challenge in AI safety; preventing AI from deceiving humans and overseers. The hackathon took place both online and in multiple physical locations simultaneously. Marius Hobbhahn, the CEO of Apollo Research, kicked off the hackathon with a keynote talk about evaluating deception in AI with white-box and black-box methods. You can watch his talk here. We also had talks by Jacob Haimes, an Apart fellow, and Mikita Balesni, a research scientist at Apollo Research. This post details the top 8 projects, multiple of which are currently being developed further in Apart Lab. In total, we had 189 participants and 21 final entries. Sandbagging Detection through Model Degradation By Cam Tice, Philipp Alexander Kreer, Fedor Ryzhenkov, Nathan Helm-Burger, Prithviraj Singh Shahani The team introduced a new approach to detect AI sandbagging, a form of deception where an AI model strategically underperforms during evaluation to hide its true capabilities. The assumption behind their project is that sandbagging is a more complex task than showing the true capabilities. In the graph above, you can see how they gradually add noise to the model's parameters and evaluate the models during this process. Sandbagging models consistently improve in performance right before the noise level causes a normal model to lose the capability. It is a straightforward, inexpensive technique that doesn't require prior assumptions about how sandbagging is represented within the model. They successfully detected sandbagging behavior in two open-source LLMs: An 8B Llama 3 Instruct model and Phi 3 small instruct model. This approach offers a practical and accessible method for identifying potential deception in AI systems. Read the full project here. Detecting and Controlling Deceptive Representation in LLMs with Representational Engineering By Avyay M Casheekar, Kaushik Sanjay Prabhakar, Kanishk Rath, Sienka Dounia The team focused on detecting and mitigating sandbagging in AI models using recent methods for understanding activations. They fine-tuned a Llama 3 8B model on a custom dataset of deceptive and honest scenarios, then used Linear Artificial Tomography (LAT) scans to detect deceptive behaviors in the internal representations of the LLM. The team's technique not only identified deceptive behaviors, but also showed promise in mitigating them. Importantly, their approach demonstrated good generalization to out-of-distribution scenarios, suggesting its potential robustness in real-world applications. Read the full project here. Detecting Deception in GPT-3.5-turbo: A Metadata-Based Approach By Siddharth Reddy Bakkireddy, Rakesh Reddy Bakkireddy This team tackled the challenge of detecting deception in closed-source, black-box models like GPT-3.5-turbo. They investigated whether LLM API response metadata such as token count, response time, and tokens per second could be used to detect intentional deceptive behavior. The team analyzed 300 prompts and generated 1,200 responses from GPT-3.5-turbo, comparing the metadata of baseline and deceptive outputs. Their results revealed that deceptive outputs tend to have increased response times and altered token usage. This approach demonstrates that deception detection is possible without accessing a model's internal representation, opening up new avenues for monitoring and safeguarding AI systems, even when their inner workings are not accessible. Read the full project here. Modelling the Oversight of Automated Interpretability Against Deceptive Agents on Sp...

More Episodes