Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.
This is: Can you get AGI from a Transformer? , published by Steve Byrnes on the AI Alignment Forum.
Introduction
I want to share my thoughts about the calculations that Transformers (such as GPT-3) do, and the calculations that I think are required for general intelligence, and how well they line up, and what I think GPT-3 is doing under the hood, and why I think an arbitrary transformer-based GPT-N might be incapable of doing certain tasks are seemingly essential for a system to qualify as an AGI.
Epistemic status: Very low confidence, to the point that I almost decided to delete this without posting it. I think some of my opinions here are very unpopular, and I would love any feedback or discussion.
Before we get into it, I want to make a couple background claims. The point here is basically to argue that the question “Can you get general intelligence by sufficiently scaling up a Transformer?” is worth asking, and does not have an answer of “Obviously yes, duh!!!” You can skip this part if you already agree with me on that.
Background Claim 1: There are types of information processing that cannot be cast in the form of Deep Neural Net (DNN)-type calculations (= matrix multiplications, ReLUs, etc.), except with an exorbitant performance penalty.
(Update: After discussion in the comments section here, I should have framed this section differently: I should have said "For any given specific learning algorithm / architecture framework (e.g. "GPT-like transformer architectures"), there are types of information processing that cannot be performed within that specific framework, except with an exorbitant performance penalty". I should never have mentioned "DNN-type calculations (=matrix multiplications, ReLUs, etc.)", because that's awfully vague (what exactly does the "etc." mean?), and anyway it's irrelevant to this post. Thanks gwern.)
By “information processing” I mean anything from sorting algorithms to data compression, random access memories, hash tables, whatever.
Let’s take Monte Carlo Tree Search (MCTS) as an example. AlphaZero does MCTS because DeepMind engineers explicitly programmed it to do MCTS—not because a generic RNN or other deep learning system spontaneously discovered, during gradient descent, that MCTS is a good idea.
Now, in principle, DNNs are universal function approximators, and more to the point, RNNs are Turing complete. So an RNN can emulate any other algorithm, including MCTS. But that doesn’t mean it can emulate it efficiently!
Let’s say we take a generic (PyTorch default) RNN, and train it such that it is incentivized to discover and start using MCTS. Assuming that the gradient flows converge to MCTS (a big "if"!), I believe (low confidence) that its only method for actually executing the MCTS involves:
Taking a certain highly parallelized algorithm running on a GPU (matrix multiplications, ReLUs, etc.)...
...and using it to emulate a Turing-machine-type serial computer.
...and using that to emulate a different highly parallelizable algorithm!
This is absurdly inefficient when compared to MCTS written by a DeepMind engineer and compiled to run directly on bare hardware with appropriate parallelization. Like, maybe, factor-of-a-million inefficient—this is not the kind of inefficiency where you can just shrug it off and wait a year or two for Moore's law to care of it.
MCTS is just one example. Again, you can open up your algorithms textbook and find thousands of ways to process information. What fraction of these can be implemented reasonably well in the form of DNN-type matrix multiplications / ReLUs / etc.? I expect 100%. If any such type of information processing is essential for AGI, then we should expect that we won’t get AGI in a pure DNN.
(We could still get it in a DNN-plus-other-stuff, e.g. DNN-plus-MCTS, DNN-plus-random-access-memory, etc.)...
view more