Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My favorite safety papers & sequences - Living post (vers.: May 15, 8 recs), published by the gears to ascension on May 15, 2023 on LessWrong.
Local mad librarian here. I've been urged to hurry up and make a reading list of favorite papers and thoughts on them; this is the first version, and I'll improve it over time, with occasional post refreshes occasionally, especially when the order changes or I improve explanation. This will be most useful for those experienced with reading papers, but regardless, if you're new to the field, I recommend reading things for which you do not have prereqs in as much depth as your lack of context allows, as it will tune your appetite for the prereqs.
I see all of these as interesting ideas which I think every alignment researcher of any subfield should read but not as concluded answers to any of the questions we're trying to ask and potentially not even the right kind of framings. If you're interested in doing a call where I watch you read through and comment on any of these, I'd be very eager, assuming timezones work out - leave a comment or send a dm! Especially if you're not sure which paper is worth the effort for you.
The initial list
Stuff I've read and recommend:
Discovering agents: agency definition I suspect is close to ideal
detailed reasoning below
Agents and devices: agency definition I find useful for rough thinking
detailed reasoning below
Logical induction for software engineers: induction definition I see as critical hunchbuilding and likely also a critical math tool
todo: write detailed reasoning and pitch for reading it
Unifying bargaining notions: working through the math of different game theory perspectives defined here has been insightful
todo: write reasoning and pitch
Finite factored sets in pictures: definition of causal inference based on direction of entropy instead of anything to do with time
todo: write reasoning and pitch, especially for why to start with this one. see also the full sequence
Other agent foundations related work I have high on my todo list:
Geometric Rationality sequence: I haven't actually finished anything besides the first post; including based on strong hunch
Boundaries sequence: has been high on my todo list for a while but keeps getting bumped down
Andrew Critch's Lobposting - all of his posts that mention Lob.
There are a bunch more things i'm planning to add but need to move on to other tasks right now, so I'm posting it as is with far less exposition on why I like each one than I intend to have eventually. Hope it helps someone!
Discovering Agents (Kenton et al)
Arxiv pdf
Lesswrong / Alignmentforum summary and comments
My absolute favorite paper in the safety literature right now, worth a level 3 read, discovering agents attempts to constrain the word "agency" by describing how to recognize it in a learning system that interacts with an environment. It has the limitation that it only works if you know what part of the system is the learner that may be agentic and you merely want to discover what it seeks. Also, the algorithm it defines is computable but thoroughly intractable for agents of any significant size, mostly limiting it to a formal definition, though substituting fast approximate causal inference may be fairly practical. The comments on both lesswrong and alignmentforum are essential to understand what it actually gives you, which is not quite as general as one might hope from the abstract.
As with most of the incredible papers from the causal incentives group, It's not the easiest to understand unless you're quite experienced reading math papers, and it took me several passes of reading it with people to be pretty sure I got it; it was quite worth the significant effort. I had to skim the related paper Reasoning about Causality in Games to refresh my understanding of ...
view more