Download - Can you get AGI from a Transformer? by Steve Byrnes

Discover

Podcast Features
Your all-in-one podcasting solution.

Podcast Studio
Easy-to-use audio recorder app.
Livestream
High-performing audio live, without limits.

Podcast App
The best podcast player & podcast app.
Podbean AI
AI-Enhanced Audio Quality and Content Generation.

Ads Marketplace
Join Ads Marketplace to earn money
through sponsorship on your podcast.

PodAds
Manage your ads with dynamic ad insertion capability.
Patron & Paid Content
The seamless way for fans to support you directly
from your podcast.
Apple Podcasts Subscriptions Integration
Effortlessly publish and manage exclusive episodes for your
Apple Podcasts subscribers directly from Podbean.

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Monetization
Enterprise
Pricing
Discover

The Nonlinear Library: Alignment Forum Top Posts

Education

Can you get AGI from a Transformer? by Steve Byrnes

2021-12-05

Download Right click and do "save link as"

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Can you get AGI from a Transformer? , published by Steve Byrnes on the AI Alignment Forum. Introduction I want to share my thoughts about the calculations that Transformers (such as GPT-3) do, and the calculations that I think are required for general intelligence, and how well they line up, and what I think GPT-3 is doing under the hood, and why I think an arbitrary transformer-based GPT-N might be incapable of doing certain tasks are seemingly essential for a system to qualify as an AGI. Epistemic status: Very low confidence, to the point that I almost decided to delete this without posting it. I think some of my opinions here are very unpopular, and I would love any feedback or discussion. Before we get into it, I want to make a couple background claims. The point here is basically to argue that the question “Can you get general intelligence by sufficiently scaling up a Transformer?” is worth asking, and does not have an answer of “Obviously yes, duh!!!” You can skip this part if you already agree with me on that. Background Claim 1: There are types of information processing that cannot be cast in the form of Deep Neural Net (DNN)-type calculations (= matrix multiplications, ReLUs, etc.), except with an exorbitant performance penalty. (Update: After discussion in the comments section here, I should have framed this section differently: I should have said "For any given specific learning algorithm / architecture framework (e.g. "GPT-like transformer architectures"), there are types of information processing that cannot be performed within that specific framework, except with an exorbitant performance penalty". I should never have mentioned "DNN-type calculations (=matrix multiplications, ReLUs, etc.)", because that's awfully vague (what exactly does the "etc." mean?), and anyway it's irrelevant to this post. Thanks gwern.) By “information processing” I mean anything from sorting algorithms to data compression, random access memories, hash tables, whatever. Let’s take Monte Carlo Tree Search (MCTS) as an example. AlphaZero does MCTS because DeepMind engineers explicitly programmed it to do MCTS—not because a generic RNN or other deep learning system spontaneously discovered, during gradient descent, that MCTS is a good idea. Now, in principle, DNNs are universal function approximators, and more to the point, RNNs are Turing complete. So an RNN can emulate any other algorithm, including MCTS. But that doesn’t mean it can emulate it efficiently! Let’s say we take a generic (PyTorch default) RNN, and train it such that it is incentivized to discover and start using MCTS. Assuming that the gradient flows converge to MCTS (a big "if"!), I believe (low confidence) that its only method for actually executing the MCTS involves: Taking a certain highly parallelized algorithm running on a GPU (matrix multiplications, ReLUs, etc.)... ...and using it to emulate a Turing-machine-type serial computer. ...and using that to emulate a different highly parallelizable algorithm! This is absurdly inefficient when compared to MCTS written by a DeepMind engineer and compiled to run directly on bare hardware with appropriate parallelization. Like, maybe, factor-of-a-million inefficient—this is not the kind of inefficiency where you can just shrug it off and wait a year or two for Moore's law to care of it. MCTS is just one example. Again, you can open up your algorithms textbook and find thousands of ways to process information. What fraction of these can be implemented reasonably well in the form of DNN-type matrix multiplications / ReLUs / etc.? I expect 100%. If any such type of information processing is essential for AGI, then we should expect that we won’t get AGI in a pure DNN. (We could still get it in a DNN-plus-other-stuff, e.g. DNN-plus-MCTS, DNN-plus-random-access-memory, etc.)...