The Nonlinear Library: Alignment Forum
Education
AF - Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant by Olli Järviniemi
AF - Mechanistic Interpretability Workshop Happening at ICML 2024! by Neel Nanda
AF - Why I am no longer thinking about/working on AI safety by Jack Koch
AF - Take SCIFs, it's dangerous to go alone by latterframe
AF - Mechanistically Eliciting Latent Behaviors in Language Models by Andrew Mack
AF - Transcoders enable fine-grained interpretable circuit analysis for language models by Jacob Dunefsky
AF - Towards a formalization of the agent structure problem by Alex Altair
AF - AISC9 has ended and there will be an AISC10 by Linda Linsefors
AF - [Aspiration-based designs] Outlook: dealing with complexity by Jobst Heitzig
AF - [Aspiration-based designs] 3. Performance and safety criteria, and aspiration intervals by Jobst Heitzig
AF - [Aspiration-based designs] 2. Formal framework, basic algorithm by Jobst Heitzig
AF - [Aspiration-based designs] 1. Informal introduction by B Jacobs
AF - Refusal in LLMs is mediated by a single direction by Andy Arditi
AF - Superposition is not "just" neuron polysemanticity by Lawrence Chan
AF - An Introduction to AI Sandbagging by Teun van der Weij
AF - AXRP Episode 29 - Science of Deep Learning with Vikrant Varma by DanielFilan
AF - Improving Dictionary Learning with Gated Sparse Autoencoders by Neel Nanda
AF - Simple probes can catch sleeper agents by Monte MacDiarmid
AF - Dequantifying first-order theories by Jessica Taylor
Create your
podcast in
minutes
It is Free
The Meaningful Life with Andrew G. Marshall
The No-Frills Teacher Podcast
Heal, Survive & Thrive!
The Jordan B. Peterson Podcast
The Mel Robbins Podcast