The following is a rough transcript which has not been revised by The Jim Rutt Show or David Krakauer. Please check with us before using any quotations from this transcript. Thank you.
Jim: Today’s guest is David Krakauer. David is the president of the Santa Fe Institute and SFI’s William H. Miller, professor of complex systems. He’s got all kinds of other fancy educational credentials, but trust me on this, he’s one of the smartest motherfuckers around.
You can read more details up on the episode page at jimruttshow.com. David’s a returning guest. I went back and looked into records and he appeared way back in time in episode 10 when I had no idea what the hell I was doing and where we talked broadly about complexity science, actually a pretty good episode considering the fact that I was a complete amateur at the time. Welcome back, David.
David: I remember the episode that was a lot of fun. Let’s see if we can improve on it.
Jim: Yeah, we went every which way. This time, I think we may be a little bit more focus, at least to start, but when you and I get talking, we do tend to wander off into the wild blue yonder. That’s entirely okay. We ain’t building a piano here. All we’re doing is doing a podcast.
I think, originally, we’re going to talk about, and I think it’s probably we’re starting there, I reached out to you on the distinction between theory-driven science and data-driven science. And in response, you sent me back some comments and a paper that you’re working on, which is called The Structure of Complexity in Machine Learning Science, not yet published, but quite interesting. And I’ll allude to some of the things in there.
But before we get into more detail from the paper and from other sources, maybe just start with this idea of the distinction between theory-driven and data-driven science and you push back. I said, “Oh, new kind of science, data-driven science.” You said, “Oh, you’re full of shit, but data-driven science is a fact that I thought about it, really.” And I said, “Duh, you’re right.” Data-driven science is where science always starts. Every science starts out as button collecting, right? Anyway, take it away. Data-driven science versus theory-driven.
David: I’ll respond. Right. You sent that email and I think there’s something to what you said, that we’re approaching something like a bifurcation. I think that’s your language.
In science, and I don’t think that bifurcation, incidentally, is about theory and data. I think it’s about something else, and we might get onto that. But the way I express this distinction, which I agree with in some measure, is the fine grain paradigms of prediction, large models that have practical value and the coarse grain paradigms of understanding. And the history of science, physical science, was this very lucky conjunction of the two.
In other words, the fundamental theory turned out to be very useful. And it could be that the divergence is, we now have two styles of science. We’ve always had it, right? If you think about philosophy and metaphysics and ethics, that doesn’t rely on neuroscience and I don’t think it should, but we might be moving into a point where we actually, for any topic, the same topic, you have to go to two classes: the class which allows you to understand it and the class that helps you use it. And I think that’s an interesting moment in time, and to that extent, I agree with you.
Jim: Yeah, that’s exactly right. And I think one of the things that highlights this for me is, by the chance of my trajectory through life, I got involved in one case as a witness and one case as a participant in two of these domains where theory failed, but data science, one. And the first is AlphaFold and its ability to do protein folding. I was actually a consultant to Cray, the supercomputer company, back when I was at SFI, when I was living in Santa Fe. I did some consulting with them on how to use FPGAs to accelerate some of the calculations that were being done in computational chemistry. And we weren’t working on the protein folding problem, but that was everybody’s holy grail. It was the Fermat’s last theorem of computational chemistry, right? That, “Oh, maybe someday we can solve the protein folding problem,” but we realized it was grossly far out, at least until we get quantum computing, which is probably going to be the first useful application, quantum computing’s going to be computational chemistry. But here along comes this relatively small group of people, but with a load of computation and comes up with AlphaFold, which blows away, by gross orders of magnitude, any previous ability to solve the problem, but provides zero theoretical insight. God damn, that’s interesting.
And then, the other one, fortunately, I never actually got involved in this hands-on. I’d been involved with a project that was a sub-project of trying to understand language via a traditional computational linguistics methods, building parsers with rules upon rules, upon rules, upon rules, upon rules, upon rules, upon rules. And the parser they eventually built was one of the best, but it still sucked, right? Because it could not get at the fundamentally, only marginally lawful nature of human language, particularly human language in the wild.
And then, yet again, very simple-minded technologies, these transformer technologies, plus brute force data and brute force computation has produced unbelievably powerful language models. That’s one of the reasons I’m so interested in this. I’ve seen close at hand, really smart, well-funded people who were only making a little bit of progress and these brute force methods have just blown them away, but to your point, provide, at least initially, little insight into the mechanisms. But I also do point out that one of the nice things about these is, they will throw out so many examples that we can then start to apply induction and abduction to these results, maybe. And I think that’s interesting.
David: It’s worth actually just thinking a little bit about origins, where these things are coming from and statistics and the developmental neural networks and the surprising fact, right? The neural networks had absolutely nothing to do with induction. They were deductive frameworks in the 1940s, as I just mentioned that. The origin of the induction as we know it is Hume in 18th century.
And Hume didn’t believe for a moment the humans were rational and the world that we had created wasn’t rational, by which he meant we couldn’t come to some understanding of it through deduction, and hence his focus on associations and all that stuff. Interestingly, the development of statistics, which was a technique, mathematical technique for reducing error in measurements in celestial mechanics. So in other words, this inductive approach to understanding reality, comes out of the most deductive of the empirical sciences, namely classical physics, and gets developed in the 1920s by people like Fisher who comes up with very deep concepts like the concept of sufficiency, which I very much like, which is there is this function that is the minimum function you require that’s the best estimator of some parameter, et cetera.
That all starts happening in the ’20s, and by the ’30s, you’ve got Neyman and Pearson developing hypothesis testing. But the point being that these inductive mathematical frameworks were about parameter estimation, which, by the way, is what modern neural nets was about, in a different sense.
Then, you get this really fantastic thing that happens in the ’40s. And I want to dwell on this, Jim, because people think of neural networks as these deep neural networks that are trained by large data sets, which had absolutely nothing to do with its history. This weird conjunction of two absolutely eccentric figures happens in the forties, which is Warren McCulloch, who was a neurophysiologist and he was working at Yale, and he was really interested in the epistemological problems associated with neurophysiology. In other words, could you ground an epistemological theory in neurons? That’s what he wanted to do.
In the meantime, this young weird genius, Walter Pitts, from Pittsburgh, was running away from bullies and found himself in a library. And while he was in the library hiding from thugs, he started reading Whitehead and Russell’s Principia mathematic, the Principia, essentially their attempt to ground mathematics in logic. And he reads it for three days and he finds errors in it. He writes a letter to Bertrand Russell when he is 12. Bertrand Russell writes back to and says, “Dear Professor Pitts-
Jim: I’ve never heard this story.
David: “Thank you for your errata list. I would love you to come and work with me in Cambridge.” At this point, of course, he’s 12. Eventually, he finds himself in Chicago in the group of Rajewsky, who’s a very important figure in the history of mathematical biology, and meets up with McCulloch and they basically can join their interests. McCulloch interested in epistemology and neuroscience Pitts, interested in logic, reading Turing and Boole and all that good stuff.
And they write this paper in 1943. And if you read that paper, everyone thinks, “Oh, it’s going to be in neuro in that paper.” No, it’s kind of this weird conjunction between Boole’s Laws of Thought, which he wrote in 1854, which is unreadable, which is the origin of Boolean logic and the Principia, which is unreadable and trying to understand how a brain might reason prepositionally. I want to make this point that the history of neural nets, which lead to the world you’re describing were actually about deduction. They were about logic, much closer to what we now think of in symbolic AI, so that sort of statistics, parameter estimation, origin of neural nets which had nothing to do with that.
And then, what happens, very late, I guess in the ’70s and ’80s, and we could talk about that history, is that these things start to fuse again. So neural nets, deduction, induction, and then, eventually, in the ’90s, it hits big data and GPUs and leads to this very complicated zoo of technical phenomena, which I guess we’ll talk about.
Jim: Yeah. It’s very interesting. And in fact, I remember, when I showed up, in 2002, at the Santa Fe Institute. I, at the time, was working with evolutionary neural nets, and you and I had a conversation and you were saying, “Oh, yeah, this ain’t going to work. Those neural nets are so biologically unrealistic, they ain’t going to do shit. You have to make them much, much more complicated to be useful. Great detail down there in the very nuanced aspects of the syntax of the synapses.” And I was saying, “Well, yeah, maybe, but they do solve some problems.” And we did find that even less biologically realistic neurons than I was using, at least I had 10H transfer functions and things like that. Now, they’re using ridiculous stuff like ReLU, right? Because it’s easily calculatable in a GPU. That’s got nothing to do with biological realism, yet somehow, when you stack them up, they can do magical stuff.
David: Yeah, but it’s [inaudible 00:11:24] on that. Because again, what was the objection? What led to that AI winter? And it really does come out of this fact that they were purporting to be deductive models. And when Papert and Minsky wrote that famous book on Perceptrons in ’69, they were critical of their mathematical capabilities, the famous non-linear separability of the XOR function, that kind of stuff. And they said, “This is never going to be overcome because the only way you overcome it is with deep neural nets, and there are just too many parameters to train,” right?
Jim: Yeah. It could be 100. How are you going to train 100 parameters?
David: It’s off the table. And that took a long time, and that took, basically, it’s the mid-80s.
Jim: Even in 2002, when I was fooling with it, three and four levels was as much as anybody was doing at that point. It wasn’t until the GPU breakthrough of just a couple of few years later that people started to be able to move up to eight and nine and much, much, much bigger models. And now, we have something like GPT-4. Well, the information isn’t public, my spies tell me it’s on the order of 1.3 trillion parameters, which is like, what? Which gets us into this space that is way beyond human understanding. You posted in your paper a term you called superhuman models. Talk about that a little bit. That was actually an interesting concept.
David: Just a remark, because there are domains which have regularities, but which are incompressible, right? So human behavior, this is why people like Freud and Jung exist, because they’re capturing something, but that thing can’t be expressed F = ma. The important point is, for me, in the complex domain, it’s not that there aren’t regularities, it’s that they’re fundamentally high dimensional regularities. And the conventional approaches to mathematization, which rely on comprehension, like you said, we want to be able to write them on the back of an envelope. And you don’t get to write a trillion parameters on the back of an envelope, unless you write it in atoms. That’s the first point.
What we’ve discovered is something very strange, which is, by the way, completely at odds with fundamental statistical theory. And we could go there, we might not want you people to get bored, but the whole development, by the way, which you didn’t mention, in support of vector machines in the ’90s, which was the mathematical version of what we’re talking about, and the so-called VC dimension, all these really interesting ideas. Okay.
What we’ve discovered is that if you go across the statistical uncanny valley, and that is the valley where you add parameters and you do badly out of sample, so you don’t generalize, if you keep adding parameters, at a certain point you do well again, and I call those superhuman models that have actually, in some sense, solved the problem of induction that Hume first pointed out. Because induction always suffers from this problem that it’s contingent, right? Yes, that’s true, but there will always be a circumstance where it’s not.
Jim: Yeah, you can never say that the black swan won’t appear.
David: Exactly. And I think that … So unless you have a first principles deductive framework, you can’t ever be certain. And what’s weird about complex phenomenology is, because there are regularities in the high dimensions, that’s why these models can work, so they’re actually telling us something about the nature of complexity itself.
Jim: And just a detail, it turns out that, particularly when using back prop, which was another one of these bits of technology, that was in the late 80s that started to be developed, it only works on differential functions. And it turned out to be a big bottleneck on small models, but on very, very, very large models, it’s essentially a non-problem. And the reason being, if you have 100,000 dimensions, there’s always one that’s pointing down. All it takes is just one. So all you have to do is have a gradient in one dimension and you can navigate. There’s this miracle of ultra high dimensionality, which allows gradient descent to work on a lot more things than we used to think it would be able to work on. That may be in some way related to this issue of the uncanny valley, because again, talking about my own personal history, in addition to neural nets, I was also working with social, political, economic agent based models before I came out to SFI.
And a typical amateur idiot, I’d have 47 parameter models, and I’d be talking to the SFI people, they’d laugh and they’d say, “Jim, 47 models, that’s garbage in, garbage out. Three is about the most parameters that are reasonable, two is better and one is best.” And that was essentially the SFI view of agent-based modeling in those days. And again, these things are all related. This idea of trying to get to highly concise ways of looking at the world when the big data, reinforcement learning, deep learning model went past that to some other side where there’s something which we don’t very well understand how it’s doing this. And I do talk to a number of fairly leading lights in these fields, and the honest ones admit they don’t really understand how these things do what they do. They’ve stumbled into a space that may be conceptually similar to the invention of mathematics.
David: Yeah. But I do think that there are pros and cons in all of this. By the way, that there’s a lot to talk about here. But one thing just to point out, it’s another field that I know you are interested in, which is adaptive computation. And if you remember, the big critique, both of Darwin and of adaptive computation was the local minimum problem. The landscape with all these minimums is highly frustrated. How do you ever find your way to the global?
And as you said, it turns out that problem is ameliorated in high dimension, it’s not made worse, it’s better. Which, incidentally, is a recapitulation of early debates in the 1930s and 1940s between Fisher and Sewall Wright about the role of dimensionality and evolutionary optimization. I think this is one of the reasons, by the way, that genetic algorithms are going to come back because one of their weaknesses was this, and now, I think we’ll get there when we talk about symbolic regression, which is how you get science out of neural nets. But I think that the tools that we sort of forfeited might come back for similar reasons.
Jim: They are starting to come back. You made a very interesting point for a different purpose. I wouldn’t call it quite identity between Darwinian evolution and gradient descent deep learning, but they’re fairly closely related. And it turns out that anything you can solve via deep learning, gradient descent, you can also solve via evolutionary computing on a neural net. And in some domains they’re about equally fast, interestingly.
And it used to be, the advantage was the evolutionary approach would work on non differential problems, and that’s what we use it for, but now, since there really are very few non differential problems that hide dimensionality, the question is, can you use it again as a general tool? And here’s why it may be making a comeback. The gradient descent methods require a unity of memory, which is why these damn machines are so expensive, these H1000 NVIDIA machines, we read about, why they’re on these really tight clusters with really high speed gigabit fiber between them, because you have to have very high speed correlation amongst the memory elements.
With evolutionary approaches, it’s the embarrassingly parallel approach to problems. So you can have the equivalent of SETI@home working on building large language models if you use an evolutionary approach. I do believe these things may be coming back, but they’re not quite isomorphic, but they’re close to it in terms of the class of problems they can solve.
David: Well, yeah, I find that really interesting. But also, I suspect that statistical learning theory in support of vector machines might come back too, because their big advantage was always, they have their own vocabulary. They take data, they do the so-called shattering operations where they put hyper planes between points to separate them. But the problem with SVMs was that the scaling in the input was quadratic.
So under the computational power of the time, it was a little bit like your situation of, “My God, you’ve got four parameters? Forget it, man.” I think that things have changed in the way that we can paralyze processes and so forth. So I think a lot of techniques, which were a little bit closer to science, before the bifurcation, as you describe it, might come back and actually ameliorate the bifurcation. We might actually achieve a kind of consilience, paradoxically, by virtue of having more computing power where techniques that were analytical died because you couldn’t run that on a Vic 20, but now, maybe we could. That’s an interesting issue.
Jim: It is interesting. Maybe you could apply that insight to language. Again, as we know, there have been thousands of thousands of PhD computational linguists beavering away at trying to understand language in formal closed form ways. And all that beavering must be a million man years or something and they made no progress. Or I shouldn’t say no progress, they made progress. But a small team of people with lots of computers and lots of data just crushed the problem. The level of leap of capacity is just unbelievably big. How would you imagine possibly having that consilience between the traditional linguist and an actual understanding of language?
David: Yeah. Let me make an empirical reservation first, which is that, if you think about something like language, the current estimate of the memory footprint of the training set for a human is 1.5 megabytes. Okay? The footprint of memory for these large language models is hundreds of gigabytes, so a megabyte is 10 to the minus three gigabytes. We’re talking about a scale difference of five orders of magnitude, hundreds of thousands of more presentations required to train large language models. And that’s a very important point. And Chomsky, if you remember, made that very astute observation that what suggests that we have strong priors or innate aptitudes in language is the so-called poverty of the stimulus. That was his phrase, right?
Jim: Yeah. I had a very good conversation with John Holland out at SFI one time about that, and his view was that Chomsky was wrong, and what he didn’t include in his calculations were the constant instruction that was going on between parents and peers and kids and language, so that what they were getting was not just raw data, they were also getting constant instruction on generalities. Weak ones, but …
David: Right. That’s a very interesting literature. Pinker talks about this a lot. It’s not actually that they were getting corrected on their grammar, but it could be that when I said something, you grimaced, right? So there is a reward signal.
Jim: And it’s around the learning of the rule, or the loose learning of the rule. You used approximately the right tense, and plurals and singulars, and you’re getting constant feedback when you violate those things, and that is gigantic. And in fact, this points to one of the giant limitations of the deep learning paradigm, which is exactly the point that it typically takes five or six orders of magnitude more X to get the same result. How does a kid, an 18-month-old progress to be a three year old? The amount of stimulus they get is nowhere near what they use to train the chess playing program or something, and yet, the kid learns all kinds of stuff. Humans and animals in general have some algorithms very, very different than deep learning that allow them to lever up from much, much, much smaller levels of examples.
And in fact, I actually had this insight when I, again, got hired by a game company to improve their AIs. And when I was talking to them, I said, “I think I played a lot of these military war games in my life,” and I did a back of the envelope calculation. I said, “I think I probably played 3,000 sessions of these games.” And then, I realized, 3,000 is nothing for a deep learning system, and yet, I have this deep body of structured ideas that I can play any military game and often beat them because they fit some patterns that I’ve deduced from a mere 2,500 or-
Jim: … patterns that I’ve deduced from a mere 2,500 or 3,000 games. So what I’m doing when I’m learning the zen of playing a new war game and adding it to my repertoire of techniques is vastly more efficient than a deep learning approach. And I was using something like deep learning, I was using an evolutionary variant on it, but you had to play these games millions of times to get any signal out of the evolutionary search while good old human … few thousand, and you have a pretty rich repertoire. And make that even more crazy, my 3,000 games is across probably 200 games.
My millions was on a single game. And for it to gain any significant lift at all, you had to have it played millions of times. And again, that’s qualitatively utterly distinct. So it tells us there’s something a whole lot we don’t yet know. And this is where I do tend to argue with the deep learning uber alles guys that all we need is deep learning, all we need is attention, all we need is whatever it used to be, support vector machines. That, no, these things are miraculous when the datasets are large enough. But most real world problems don’t have those [inaudible 00:25:09].
David: That’s a very … This is something that we’ve obsessed with at SFI since its founding, which is very contra to the empiricism of deep learning, which is the simple to the complex, right? That if you think about the early work of John Holland and then Eigen, certainly through their books, like Laws of the Game and so on, they were interested in how very simple rule systems could produce extraordinary complicated patterns.
There’s actually a very nice paper by Springer and Canyon on using convolutional networks to encode the game of life. So we happen to know that there are simple deterministic rules that generate very complicated patterns. And you can ask, “Okay, let’s give the patterns to a convolutional neural network,” where we know that there are simple rules that are analogous, I think, to the rules of grammar that we might have retrospectively inferred from language use. And they’re terrible at it. They are terrible at it.
Jim: Here’s the test case. Give it Mandelbrot set-
David: Well, I think it’s a similar concept.
Jim: … see what it comes up with. Yeah, it’s a extremely compact representation. Just play it a zillion of those videos of Mandelbrot set and say, “All right, what process created that?” I suspect they wouldn’t even come close the other standard.
David: This is important, right? Because what they conclude in that study is there’s this idea, I’m not sure you’re familiar with it, it’s called the lottery ticket hypothesis for machine learning. And it says, for patterns that we would encode minimally, they encode maximally. And the question that they ask though is within that maximal encoding, the so-called lottery ticket, you just buy tons of tickets hoping that you’ll win, is there one winning ticket? And I think the question that you’ve been asking, and I’m asking now is can we use deep neural networks as pre-processors for parsimonious science? The question, another way to put it is, within the lottery ticket model, is their one winning ticket, and can we find it somewhere buried there in those models?
Jim: Yeah, that’s an interesting question. And the way the models work, I would guess the answer is no, right? These things are extraordinarily interwoven and ridiculously high dimension and unstructured and whatever they’ve developed in terms of basins of attraction around semantics, let’s say in the language case, are representationally, so far, possible to extract.
David: Well, no, but let me give you example, actually.
Jim: Okay, I would love to have an example.
David: There are good examples of where it hasn’t been. You have to do some tricks. There’s a cosmologist, Miles Cranmer, and he’s asking the following way, if I give you loads of cosmological astronomical data, can you infer Newton’s laws of motion or something more? If I give you data, for example, pertaining to dark matter and dark energy, can you give me a function that will be a Newton’s law for dark energy? Which we don’t have. And the way it works is essentially is that you take a special kind of neural net. These are called graph neural nets. We’d be familiar with them from our world, but it’s basically a social network that you impose a neural net on top of. And so the net encodes things like particle interactions. So there’s an explicit encoding in the topology of the neural net of the process. So there’s a bit of human intervention going on here.
You then sparsify it, you remove edges, you quantize it, and then what you do is … That now is your data, right? That’s now the lossy encoding of the astronomical dataset. Now what you do is you run a genetic algorithm on that to do symbolic regression, and that means produce formulae, algebraic formulae, that encode the regularities in the quantized graph neural network and out pop equations of motion. And through this process, Crammer and colleagues discover a new parsimonious encoding for the behavior of dark energy. And so that to me is really tantalizing, right?
His name’s Miles- .
Jim: What’s this guy name?
David: … Kremmer.
Jim: Kremmer?
David: He did work for many years with David Spergel at Princeton, and Spergel is now the director of the Simons Foundation. So he’s involved in that work too.
Jim: Interesting. It’s been a while since I’ve looked at symbolic regression. I have a friend who’s quite an expert on it, but it’s been probably 10 years since I’ve looked at it. It’s probably a good time to go revisit the field.
David: Well, but I think actually that tracking this literature, both in terms of geological estimation of components of rocks found on Mars or threading algorithms for inferring 3D structure of proteins, they’re all playing the same game now, which is you take neural net, you train it. It’s the surrogate for reality. You sparsify it and you do symbolic regression on it. And what’s interesting about that to me is those two worlds we started with, right? You keep the neural net that does the prediction for you, but then you do this other thing that does the understanding for you.
Jim: Yeah, that’s really interesting. That may be a whole new way of doing [inaudible 00:30:40].
David: I consider that the new way. Now you could argue, Jim, that that’s sort of what Pearson and Fisher were alluding to early on in factor analysis and statistics. When you do things like principle component analysis, you say, “Okay, all this data,” but somewhere hiding in that is some what they would call orthonormal basis, some sort of lower dimensional manifolds. And machine learning is obsessed with manifolds. And what you do is you write down your mechanistic causal theory using the effective dimensions, the dimensions of the manifold, not of the raw data. So we’ve been doing that I think for decades, but not for complex phenomena of the … as you say, with language or whatever, social institutions.
Jim: Folding of proteins, right? These problems that were just totally intractable to try to do that way, because we couldn’t process anywhere near enough data. You couldn’t build anywhere near rich enough a model that [inaudible 00:31:44]-
David: I do want to push back though, your point about things like DeepFold because … So as you know, one of our faculty at SFI won the Nobel Prize in 2009 for structure function of the ribosome, and that’s Venki Ramakrishnan. And Venki consults to those companies. So this is a huge area of expertise. So what goes into those models? Well, symmetries, conservation laws, right? Oh, the right data is the amino acid distance matrix. So there’s a huge amount of human expertise that’s in some sense establishing constraints and priors on how the model is built. And that sometimes gets a little bit swept under the rug when you say, “This is pure induction.”
Jim: Well, yes, of course. No free lunch theorem would tell us that, right? Wolpert and friends. And it’s always best to incorporate as much knowledge of a domain as you can into your representation. And so that doesn’t surprise me in the slightest, actually. I would certainly expect them to do that. You have this gigantic high dimensional space. Anything you can do to reduce the dimensionality of the problem by binding it with reality has got to help. And again, this will become an artful form. Where is it useful to bind with reality in the building of these models versus where is it better to do open-ended search?
David: And the other side of this, which I find deeply fascinating, that in these shareholder meetings with these companies which are purporting to be researcher meetings, they’re hiding a very intriguing fact, which is worth deliberation. So take GPT-4, trillions of parameters, hundreds of thousands of kilowatt hours, millions of dollars. So energetically, this will get back to that in perhaps in a minute, just a disaster. But they can’t do arithmetic. They can’t add two numbers.
Jim: Well, it’s only two digits, they do okay. You get the three, they start to run out of gas.
David: [inaudible 00:33:48] what a four-year-old could do, then it becomes problematic. So here’s an interesting fact. The Hewlett-Packard 35 calculator developed in the 1970s, the so-called slide rule killer …
Jim: I was there at MIT when that thing came out and I couldn’t afford one. Unfortunately, MIT had a rule against using them for tests because at the time they were like $400 and most people couldn’t afford it. So everybody had to learn how to use a slide.
David: So these by current standards, as you say, were thousands of dollars. So the point being this, the HP-35, which does arithmetic pretty well, calculates logarithms, solutions to transcendental functions, exponentiate, 1K ROM. Okay? A model GPT-4 is worse at doing elementary arithmetic than a calculator over 50 years old that had 1K of memory. There’s something very deep in that that gets lost when people say that these models are sentient.
Jim: Yeah, no, they’re certainly not sentient. And actually this is a good setup for an approach I believe is extraordinarily fruitful. And this is a perfect example of it. And that is what my friend Ben Goertzel and his colleagues call cognitive synergy. That for instance, there are many different cognitive skills that something like humans use, and we can offload some of them pencil and paper, but then we bring them back into an orchestration. And that group, the OpenCog group, and now SingularityNET are looking at bringing together deep learning, genetic algorithms, symbolic AI representations, math machines, provers, solvers, and having them all interrelate with each other with this hypergraph-based probabilistic logic language. And whether they’ll succeed or not, I don’t know, but it does allow them to address problems exactly like that. There is no reason that you can’t have the perceptual equivalent power of deep learning in close proximity to the context of deep mathematical skills and mechanical simulation and genetic algorithms and art models and everything else, let them all interact through a common ground.
David: Yeah, but then we have to be very careful because … I mean, I don’t disagree. I think that’s clearly what’s going … If I told you and you were the CEO of DeepMind that I can give you a better calculator that’s 1K that will cost you 10 to the minus 9 cents, you’d have to be an idiot not to take my offer seriously. So it’s going to happen for economic and practical reasons. But there’s something else going on here, which is a philosophical issue on the nature of tools, right? Because see, we were raised with the ideas of people like Jerry Fodor on the modularity of mind, and Marvin Minsky on the society of mind. And if you remember, their whole point was, what makes us humans is that we run loads of programs in parallel, that we can do arithmetic, we can choose our favorite ice cream flavor, we can run and do hurdles and the Fosbury flop and all that.
And so if you are going to tell me now that these models constitute, you won’t, but some people will, a general intelligence, but actually they’re outsourcing capabilities to tools, then I think you have to revise your notion of what they’re really capable of. And I think that for me at least, one of the really interesting features of human intelligence is it can internalize modular functions like arithmetic. It can internalize the abacus. You don’t have to use the tool anymore. It’s better, more reliable if you do, but you don’t have to. And so I think that that’s a clue. So there’s the energetic clue, there’s the sparsity of the stimulus clue, and there’s the fact that we can internalize many functions and have a society of mind clue. And this thing is looking less and less and less like a truly intelligent system.
Jim: And the one, my favorite, this is one I campaign whenever I talk about AGI, I suspect that one of the express highways to AGI, if everybody can figure out how to do it, is heuristic induction. We are amazingly good at finding heuristics, often unconscious. The famous heuristic on how a outfielder catches a baseball in American baseball. Turns out, and mostly even great players, major leaguers, don’t know the heuristic, but they practice it, which is to move such as that you maintain a consistent angle between your eye and the ball. If you do that, you’ll end up catching the ball. And so if it’s within your athletic ability, that simple-minded heuristic works 99% of the time. And those are just the kinds of things that so far deep learning is not particularly good at, explicit creation of heuristics, whether conscious or unconscious. And if someone could figure out the problem of heuristic induction, the rate of accelerations towards AGI [inaudible 00:38:48].
David: And the modular composition of them. Again, back to one of our favorite topics, which is constrained minimality. This is one of the things we learned from games, from chess and Go, which is think about timeless games with near infinite strategic capability. You can learn them in a minute. The rules of Go are pretty trivial, and chess only marginally more complicated, but less generative interestingly. So our fascination, certainly at SFI, with things like that, generative systems based on rules which are parsimonious, I think needs to be exported to these systems because at the moment they’re finding … They are discovering compressed rule systems. That is what they are. That’s what they are. But they’re still massive. And our experience with games is that those massive models are not strategically open-ended. The difference between playing, as you said, I don’t know, a war game and chess.
Jim: I think about war games, is they’re ridiculously high-dimensional. Chess … I did my report to this game company. One of the things I pointed out to them was that, “Your game, very swag estimate, per turn, there’s about 10 to the 60th moves,” as opposed to chess where I think the average is like 22 or something like that. And so these multi many pieces, turn-based war games, the dimensionality of problem space is just utterly in a different league than even Go or chess or anything else. And so you can’t even imagine using even things like the current class of deep learning. Well, you probably could use reinforcement learning. Haven’t tried that yet. But the dimensionality of the problem space are way bigger than they are for these canned games. And I think it is important, and this is where though the deep learning has been able to show tremendous traction on things. Again, like language learning, where the amount of data is just so huge than trying to find some tractable way to reduce it to a small number of rules so far has been unsuccessful.
I’m babbling here a little bit. Sorry about this. But this gets back to your concept that what complexity science and deep learning have in common is that they are both ways of getting handles on incompressible representations. So talk about that a little bit.
David: We don’t need deep learning to simulate a billiard table, but we might need it to simulate societies. And in the language of complexity science, we’d say that one domain is typified by simple rule systems and symmetries and simple initial conditions. And the other one is elaborate rule systems and broken symmetries and evolving initial conditions. So in that sense, the complex world that we study is quite different from the physical world. And as I said earlier, the way that we have proceeded at SFI is by being clever at taking averages. If you think about Jeffrey’s work on scaling, right? Jeffrey West, he doesn’t look at every blood vessel in the body and describe the hydrodynamics of every artery. He looks at bulk properties. So complexity science is dealt with complexity by taking averages. But sometimes you throw the baby away with bathwater. And I think what neural nets will do for us if we use them in a kind of mercenary way, is that, well, they’ll allow us to take slightly less draconian averages, like I said.
So we take them, we keep sparsifying them, we mutilate their connections until we reach the absolute minimum that they can sustain as predictive engines and then take those as core screenings in our theories. So I think that’s the key use. And I think what society then has to accept, as I said to you at the very beginning, is you use the big engine to do prediction and you use the small sparsified model to do science. And the question then is, will society be tolerant of a pursuit that’s not fundamentally utilitarian, but about understanding the world we live in? And science has had this really interesting grace period where it got to be both.
Jim: Because it at least thought to pay off.
David: It did on average.
Jim: And it did on average.
David: And now the question is, “Wait a minute, are you telling me that you are just no better than a metaphysician?”
Jim: Well, I know you love the history of science and technology. The great example was the mechanics were way ahead of the scientists in the early days of the heat engine world. Before Carnot and those guys had any representations at all, Watson and his friends were making progress. And so the mechanics were well ahead of the scientists. And in the early days of electricity, that was true as well. I mean, Edison was pushing ahead without knowing anything about Faraday, or very little. But then by the ’20s, science now started to lead technology in many ways. You were not going to get electronics, or certainly not microelectronics, and certainly not GPS and certainly not super conducting, via the tinkering approach. You needed science first and then technology.
And so I suppose what you’re alluding to here is that the world of big data and useful artifacts produced from data could potentially sideline science again, as being behind the technological instantiations and then just putting words in your mouth, spit them back out if they don’t fit, which. Is society might say, “Well, maybe we don’t need you guys.” Right? Because explaining this shit ain’t paying anybody’s bills. Right?
David: I actually think that these are cyclical, to your point. I mean, yeah. So complexity science’s roots are entirely in the Industrial Revolution, the study of machines natural and engineered. So yes, statistical mechanics and thermodynamics came out of the study of engines that were developed by engine builders. And so that’s true. But look what happened. We realized that there were other kinds of engine and other principles. So if you think about electric cars that are driven by batteries, that’s not the same principle, but there are shared principles, right?
And I think that the same thing happens with telescopes, right? So you have lens grinders that lead to theories of optics, you get big … But what happens then is you get radio telescopes which generalize the principles of understanding electromagnetism and so forth. So I think the .. Now gravitational lensing, which by the way is one of the uses of machine learning to discover gravitational lensing. So there is this back and forth between design based on basic principles, and then the discovery of new engineered devices through human ingenuity and tinkering, which then gets formalized. And what’s so fascinating is exactly your point, is if large language models are the steam engines of the 21st century, what is the statistical mechanics of the late?
Jim: Yeah, I love that. I love this. This is a wonderful framing. This conversation is worthwhile for that one idea alone, right? That in the same way the heat engines led to tremendous amount of modern science, what new science will LLMs and related technologies lead to? Do you have any thoughts, speculations?
David: As I said, I think we’re already starting to see it, for example, at the level of the study of cognitive phenomenon, where we are using scans which we interpret through with large learning models, or not language models necessarily, machine learning models, which we then sparsify or reduce. I think for me, the question is not can we do low dimensional causal mechanistic modeling, but do we derive new principles for explaining adaptive reality? Are there new effective laws of that world? And I think what if there was an entirely new actually working economic theory that wasn’t neuroeconomic BS, but actually worked? Imagine. And I don’t mean just in terms of tiny quantitative improvements on market gambling. I mean actually a real theory of how the market works. And I think that’s the kind of thing that I expect we might actually come up with.
Jim: Now let’s switch a little bit, get a little bit more to principles. In your paper, you dive into the concept of constructs and you generalize them a little bit. So talk about what you mean when you say construction. How do you think that that lens is useful for this discussion we’re just having right now?
David: Yeah, I was talking to Sean Carroll actually on his show about some of these issues. And one of the things that distinguishes a complex system from a purely physical system is that complex systems encode reality. They encode history. If you open up a rock, there’s nothing in the rock about the hill that it lives on. If you open up a microbe or a worm or a human brain, inside them, you find simulacra, mirrors of reality.
David: Mirrors of reality. That’s right, you find the history in it.
Jim: 3.5 billion years of history, right?
David: So complex systems encode adaptive history. But they don’t encode it all, it’s not like you can open up a dinosaur skull and find the environment of the Triassic or something. And what we now know is that they find these parsimonious encodings, they discover things like frequency codes, phase codes. They have beautiful pinwheel representations of space. There are ways in which we encode time that is quite rational. And Murray Gell-Mann and John Holland got very interested in this. And so they would say when you asked them what a complex system is, they’d never say to you, “Well, networks and all this and that.” They’d say, “Actually, a complex system is very simple. It’s the thing that when you open up, it has a schema in it, and that schema encodes of course grained history.”
And in the 1950s a physicist realized that, actually, that’s what physical theory does. What is the theory of gravity? What is quantum mechanics? Actually, it’s a schema of reality, it’s a propositional schema. So there’s this really interesting connection between what a complex system is, and what a good theory is. So what a complex system theory is, it’s a theory about theorizers.
Jim: Oh, okay. So what you’re saying is that every element in a complex system is a theorizer of a sort?
David: Correct, is a theory of the world.
Jim: Yeah, so the bacteria that follows the glucose has a theory that following glucose is a useful thing to do.
David: And we know what the mechanism is in that particular case. So in a way, there’s this very deep correspondence between the entire scientific enterprise and all of the systems that we study. And again, it makes this really interesting connection to deep learning. Because my point is, by training a deep neural net, you are actually creating a theory of the phenomenon. You’re creating a rule system, you’re creating a complex system.
Jim: Yeah, you actually are creating an organism. In the same category as an organism, it’s something that responds to stimulus, has response, and the response is not simply tractable. In the same way that even a bacteria is extraordinarily complex piece of machinery compared to anything humans fool with. And the fact that a big language model is unbelievably complex in some sense doesn’t matter, because its inputs and outputs are fairly simple.
David: Exactly, and a key point about the schema by the way, is that they have to have to be robust and evolvable key features. And this is partly what Margene pointed out, they have to be extensible in a certain way, they have to be composable in a certain way. And one of the questions about neural nets is, is that true? Are there internal representations composable? And there was some very beautiful work done early on by Paul Smolensky, and what he called compositional semantics. Which is if you in a rigorous way interrogate the interior of a trained neural net using these generalized matrices called tenses, you can actually demonstrate that they have a kind of internal compositional encoding, which is much closer to what Murray and John thought of as schema and Margene thought of as constructs. It’s not just this massive diffuse encoding by many, many, many cells, which you can’t really interrogate, but actually a concentrated encoding through a schema, through a kind of rational basis.
Jim: Well, sort of-ish, but not exactly. It’s still extraordinarily dirty today, and everything is still fully connected between the levels, which is kind of nutty. I think the next generation [inaudible 00:52:00].
David: Well, no. That’s why people are using variational autoencoders, because what you’re doing with autoencoders is you’re bottlenecking the network in order to be able to then interrogate those layers which capture the regularities.
Jim: Yeah, that is true, that is interesting. And then if you go up a level, again, think of each one of these models as the equivalent of an individual organism. We then have the relationships among them, and then with us. And this is some of the work I’m currently doing, which is my current hot project start out just as a hobby project, but people seem to be demanding that I turn this into a business, which is a tool for letting script writers use large language models in development of movie scripts. And sometimes it’ll hit the model a hundred times while it gradually sky hooks itself up and then interacts with the person at the right time in the process. And already I realized that some models are good for X, and some models are good for Y, so I’m about to add a link to Anthropics LLM, which has some different capacities than OpenAI’s.
And OpenAI is good for this part of the problem, but Anthropic is good for this other part of the problem. So this program will actually be using both, be using OpenAI probably 90% and Anthropic about 10%. And we’re few months past Wright brothers 1903, when it comes to these models, the amount of work that’s going on in the wild of people building models for different purposes, trained on specialty data sets, using different algorithms. Trading off data versus processing turns out to be very interesting, maybe there’s a fairly smooth trade-off between data set size and the amount of processing you do. This is going to lead to a bestiary, a [inaudible 00:53:44] explosion of types. And then the result in society of minds that comes from them interacting with each other and with us, and us connecting them, is taking us into a whole new manifold of possibilities as well.
David: I’m interested in what you think about following, because you know as well as I do, and there’s lots of data here on metascience that my colleague James Evans has done for example, in terms of scientific discovery. That the way you do discovery is by not being a part of the herd. I mean, that’s very well attested, and in every domain. These models are pure herd, they’re the ultimate herd. As Alison Gopnik says, “They are libraries.” And what libraries are good at, is they’re reference material for established knowledge, they are not discovery engines, they’re the background to your discovery. And I’m curious what you think about this, because people are talking about these in terms of creativity, but actually creativity is almost the opposite. Everyone I know who’s really made a breakthrough has done everything in their power to not read the literature, right?
Jim: Yeah. People ask me why I moved to the mountains of Virginia, and I said, “To escape the conventional wisdom.” I actually wrote an essay on that, on Medium, on my work with the movie writing program. And I think you have to make some distinctions. Because, “Oh, they’re not creative.” Compared to the vast herd of humans, they’re very creative. Their ability to come up with clever plot ideas, to do extraordinarily interesting things with dialogue, to plot actions. And things that were never done before, these are compositionality, particularly if you orchestrate them to accentuate the compositionality of their capabilities. And so within what most humans do, 99.9% of the time, they’re very creative. They could design a better house than the typical architect, if you could encode the problem correctly. I’m already getting feedback from actual people in Hollywood that even at the relatively crude level we’re at now, it’s as good as a paid Hollywood screenwriter for a first draft. We get to the first draft level in about 20 hours of human labor combined with the LLMs, as compared to a thousand hours of a human doing it on their own.
And it’s at least as good, maybe more creative. On the other hand, it seems to me that unless you very carefully… As you talked about with the astronomical data, the dark energy model. They’re not going to solve what is dark energy by themselves, because that’s not what they’re designed for. And they’re not libraries either, I think that’s a very bad metaphor. They’re terrible actually libraries, they lie all the time, they confabulate like crazy. No, they’re getting better, but they’re not reliable as sources of data, they’re better at sources of analysis. So if you say, “What did David Krakauer actually say about X?” They’re [inaudible 00:56:42]. Oh, right. But if you say, “All right, explain the theory of complexity science versus Newtonian mechanics.” They’re actually pretty good at that. So I don’t think of them as libraries, I think of them as kind of quick and dirty analysts and synthesizers, and they’re really good at that.
Again, if you want to just do fun, very first thing I did with ChatGPT, in fact when it first came out, the first week. I decided, all right, I’m going to take an exercise from Mrs. Carr’s 12th grade English class. And I said, “Compare and contrast Moby Dick and Lord Jim.” Just a classic 12th grade English class. And it did a great job, it really did. It would’ve gotten a A minus probably from Mrs. Carr, and would probably have gotten a B at a mid-grade university for freshman English. And since then they’ve gotten… That was 3.5 when it was brand new, they’ve gotten way better. The other thing I do quite regularly when I’m using it as a generative tool for my own thinking, is I’ll say, “Let’s describe Marxism, capitalism, and Game B.” This thing I’ve been working on for 10 years. Surprisingly, it knows a lot about Game B, it surprised the shit out of me. I guess, just because there’s been a lot written about it and stuff.
And then I say, “Compare and contrast them. You choose the categories…” This is a very clever thing. “You choose the categories somewhere between 7 and 10, create a table, and then fill in all the cells.” And it does it. And it’s way better than… Well, it’s at least as good as if you’d gotten a postdoc in sociology to do it for you. And it does it in 90 seconds for less than 2 cents. And so for things like that, it’s amazing. And those strike me as creative work, so it depends what you mean by creative. Is it the next Einstein, can you get to what Einstein did, which is to visualize the world geometrically to see relativity, which is still one of the most fucking amazing things in human intellectual history? Nope, LLMs ain’t going to do that, because they don’t have any geometry, they’re just language correlation models. But if you had cognitive synergy that included the ability to manipulate language with the ability to do physical simulations to, “Oh, let’s throw in non-Euclidean models of space-time in our simulator.” Then have it talk back, it [inaudible 00:58:53].
David: It’s interesting, but this raises, again, this question that I’ve obsessed over, which is the importance of bandwidth limitation and constraints in scientific revolutions and creative breakthroughs. In other words, if you look at the history of science carefully, it’s a history of constraint, it’s not a history of excess power. I mean, I’ve written about this a lot, that if we could travel back in a time machine and give Tycho Brahe massive computing power and telescopes, he would never have hired Kepler. Because Kepler was his calculator, because Tycho Brahe wasn’t very good at math, good at observation. And similarly, Kepler then had Kepler’s laws, and Newton comes along and says, “That’s a bit complicated and phenomenological, let’s come up with the inverse square law.” And on and on it goes, right?
Jim: It’s funny, I actually have a whole page of notes on that for this conversation, Tycho Brahe, Kepler, and Newton, and how that’s an expample of…
David: It’s an example of bottlenecking of the phenomenon into increasingly simple sets of causal relationships and constraints. And my point is that that’s the history of human invention, and if you did it the other way round, is that you gave more computing power and more data to humans, we’d still be in the Stone Age. And I think that I am curious about that.
Jim: Also, you’re saying if we had the ability to [inaudible 01:00:23] planetary motions, we would never have bothered to learn Newton’s law.
David: [inaudible 01:00:27]. The development of the calculus was a human’s way of dealing with complicated data sets. And it goes on, and on, and on. I mean, essentially the whole notion of a negative number, which seems very counterintuitive, an imaginary number, which is even more counterintuitive.
Jim: Had to do with mathematics, right? With inverse electrical flows and things of that sort.
David: So we should be very clear about this distinction that… I mean, let’s take another example that interests me, like music for example. I mean, Western music is based on a diatonic, heptatonic seven unit interval notation system.
Jim: And it doesn’t quite mathematically work either, which is one of the more amazing things about it.
David: Right, but it’s a set of constraints that we’ve worked with now for a long time, and most of the music we listen to works within that set of absolutely weird contrived constraints. If you move to five intervals, a pentatonic scale, then it sounds like Chinese music or Japanese music. So human thought and creativity is based on the huge limitations of memory, huge limitations of inference, and self-imposed constraints. And that’s why I mentioned earlier Go, or The Game of Life, and chess, and now music. And my question to the Pollyannas of machine learning is, do we think we’ll get true scientific revolutions, not predictive revolutions, which is what statistics is always doing at some level, by adding data and computational power as opposed to subtracting it? Maybe the next generation really of machine learnings is, “Okay, let’s hobble them.”
Jim: If you incrementally decayed them until there may actually be a sharp phase transition at some point, where they stop working at all, and then back up to where they work just barely good enough. That might be the area in which it’s useful to understand what’s going on.
David: [inaudible 01:02:30], but I would argue let’s reduce them until they get smart. I mean, as you know, I always make this joke. If I said, “Hey, Jim, I’ve got this exam for you. You can take that exam in the library and use Google to search for answers. Or Jim, you have to take this in a dark cave on your own with a pencil.” Which one of those two exams would be revealing of your intellectual ability? The latter. And I think that maybe the next stage in the evolution of these models is, let’s do the equivalent of giving them a pencil and put them in a dark cave.
Jim: See, I’m going to push back on this one a little bit. And again, this experience with the screenwriting, and now talking to Hollywood people, talking to Netflix later in the week has led me to believe that this is like trying to solve planetary mechanics while it’s chewing calculus. These tools are a gigantic leveler up. And if we’re not going to use them, those people are going to be left behind. In the same way that if someone were trying to do say scientific research without having access to online databases of papers, they would be being left behind. If you had to go trudge to the library down in Albuquerque, every time you wanted to track the history of an idea, your ability to work forward in your scientific endeavors would be tremendously [inaudible 01:03:54].
David: It’s a subtle distinction, I’m not denying the value of the models or the library, which I’m very dependent on. But I’m making this empirical point that intelligence.
Jim: There’s a certain kind of intelligence that having just the pencil, why were the Russians so good at mathematics? Because they were too poor to afford computers, so they had chalk. [inaudible 01:04:17] would be happy to send chalk and blackboards out all over the country, so Russia had a tremendous amount of very talented mathematicians because they were constrained by the lack of computers, and laboratory equipment, and anything else. So I do get that point, that there are some times where even artificially applied constraints, but found constraints. Harold Morowitz’s idea is that emergences often came around pruning rules, that basically diverted the world in one direction or another. Not that they were a hard limit, but they made the flow easier in one direction. And that’s how the world evolved, around those pruning rules. Is probably true for a certain class, a certain kind, the Einsteinian breakthrough of relativity, maybe Darwin. Let’s talk about Darwin.
Here’s an example, it’s one of the ones I prepared actually. If we think about data first, Linnaeus and his taxonomy is an interesting one. They had no idea why animals varied, but they saw that they did and in a somewhat systematic way. And I mean, Linnaeus’s contribution was actually huge, he and his collaborators. They had it fairly close, they had the kingdoms, and the file on the orders, and the classes, and all that stuff. And yet from a theological, metaphysical perspective, they thought they were all created by Jehovah in this weird kind of taxonomically structured way. They had no principled way how it could have happened otherwise, and yet they got it close to right. I mean, we now know of course, that some things that look like they’re taxonomically related aren’t, that they’re parallel convergent evolution. And now with things like Gen-X sequencing, we can see, that doesn’t belong here in a tree, it belongs way over there.
And we also now have the ability to at least, at some level, have the timings down based on genetic evolution, et cetera. But anyway, so you go to Linnaeus, then you get to Darwin. Who goes out and famously just wanders around for five years looking at stuff, comes home, sits in his library for a couple of years, and then, “Aha.” Has this astounding idea that is so simple. Thomas Huxley famously slapped himself in the forehead and said, “How could I be so stupid as not to have thought about that?” Because that idea has been sitting on the table since Aristotle. Anybody could have picked it up, it wasn’t even very hard. But to have had that breakthrough somehow to have put that together with just the pencil, is I think another interesting example of what you’re talking about. And I suppose we do need to be careful enough to provide the ecological, economic, and social niches for the pencil crowd.
David: Yeah, the Darwin one’s very interesting. And it’s worth bearing in mind that there was the data, as you say, but Darwin was not very good that way. He was not a big data person at all. His book, Origin of Species, is full of these beautiful anecdotes. In fact, you can now go online and read the books that Darwin read, and read his annotations, which are fascinating incidentally. He was always looking for examples after the fact for his theories. But it’s worth remembering what Darwin was reading. So the 19th Century was obsessed with design. And so Paley writes Natural Theology in 1809, that’s a huge book that Darwin reads, that says, “The natural world is designed, it has logical principles, it’s logical because God designed it.”
And that was a very important thing for the Victorians, because it wasn’t just a high dimensional mess, it was an exquisitely designed set of machines. Which then got elaborated in these beautiful eight volume Bridgewater Treatises between 1833 and 36. So Darwin’s already steeped in this manure of anyone with a brain who was designing something would do it beautifully, and geometrically, and wouldn’t waste any effort or energy. That’s the background. So then he gets Lamarck, who’s interested in transmission. So it was a conjunction of a set of beliefs and principles. And in the origin of species, he actually says explicitly in the last chapter that the mechanism that he’s suggesting is as elegant as gravity, and he was right, yes.
Jim: He was right. We could put a little promotion here for one of our SFI affiliates. Daniel Dennett’s book, Darwin’s Dangerous Idea is just a phenomenal exploration of how fundamentally powerful that idea is, and how it impacts a lot of things that we don’t even think about as evolutionary, [inaudible 01:08:45].
David: Yeah, and we might get there because one of the concepts I introduced in that paper I sent you is what I call meta Occam.
Jim: Let’s go there, that’s on my list of things. First, let’s for the audience remind people what Occam’s razor is, and what it’s not. Because people think it’s a law, it’s not, it’s a heuristic. And then this idea of meta Occam, let’s do that.
David: Yeah. So Occam’s razor is the principle associated with a British scholastic of the 13th century I think, who makes the claim, sometimes it’s described as, “One should not advocate or generate a plurality without necessity.” I think that’s how he said it. And the idea is there that always take the simpler single explanation for a phenomenon over a more complex one that offers no more insight. And all of mathematical science is full of this assumption, or physical sciences, parsimonious. And it’s parsimonious by virtue of applying the razor. So if you have 10 alternative ideas, they vary in the number of parameters, take the one with the fewest. We now call that model selection or regularization, but it’s all the application of this notion. And if you talk to a physicist, they’ll say, “Look at our physical theories. We’ve applied Occam’s razor, and we’ve produced beautiful, simple things like the Dirac theory of relativistic quantum mechanics, and so on.”
Okay. The complex world, well, as we’ve already agreed, full of these horrible high dimensions, which seem irreducible, Occam’s razor doesn’t seem to apply. But wait a minute, you mentioned Dan Dennett, you mentioned Darwin, they came up with processes which are very parsimonious. You can express and explain the theory of evolution by natural selection, a few sentences, and you don’t need more natural selection to explain a worm than an elephant. It’s not a more complicated theory because it’s a more complicated object, and this is the notion of meta Occam. There are domains of human inquiry where the parsimony is in the process, not in the final object. So physics has parsimonious theories of the atom, we have parsimonious theories for generating complicated objects. And it turns out that machine learning and evolution by natural selection share one, which is reinforcement learning or selective feedback. Which turned out to be mathematically incidentally equivalent. So I’ve been very interested in these sciences which move parsimony away from the primary object of scrutiny, which is expressed by the minimal theory, to a process which is minimal, but what can generate arbitrarily complicated objects.
Jim: I did a really good podcast, this came out yesterday with Sara Walker, who’s also affiliated with SFI, and Lee Cronin, on their idea of time as a compositional object of complexity. That’s not quite doing full justice to their idea. But they’re doing some very, very interesting work on how extraordinarily simple concept of composition can lead to what-
Jim: … Concept of composition can lead to whatever level of complexity the universe is capable of evolving but from very simple principles which is to your point.
David: Yeah. Their idea is that evolutionary time is measured through the assembly index, which you talked about. And that complexity is, in some sense, colinear with the assembly index. And so, physical theory produces structures at equilibrium with low assembly index.
Jim: And then without life, you can’t get to an assembly index of any higher than 13 or 14. With life, you can get to maybe 35. With human technology, maybe 50 or 100. And the thing, of course, that was so amazing for me when I read their papers was that they used this to make a quite principled overthrow of Einstein’s block universe on the grounds that it can’t all happen at once. You can’t get the complexity other than through evolutionary time. And that’s really the guts of their story which is quite amazing.
David: Yeah, it’s a second lawlike argument. Yeah, it’s true. But notice that, again, what they’re doing is trying to explain why life is not parsimonious, but the mechanism there would be natural selection for them, and that’s the meta outcome.
Jim: So look for simple processes that can elaborate over time and produce arbitrary complexities.
David: Yeah, and notice the difference to physics there. The great paradox that is often not commented upon is that physical theory has infinite models for minimal objects. So for example, if you ask, “Okay…” It’s called the fine-tuning problem. “Where did these constants come from?” They’ll say, “Oh, no problem. They’re an infinite number of universes, and the one that has the fine-tuned variable is the one we live in. So we can drop it.”
Jim: Yeah, I loved your send up on that. I’m not a buyer of those answers. Multiverses, simulations, continuous inflation. To my mind, they’re all just hand waving. I think we just don’t fucking know. So let’s just leave it. Don’t be afraid to say we don’t know.
David: [inaudible 01:14:07] I agree with you obviously, but notice that the interesting point there is that complexity science and physical theory substitute their position with respect to parsimony because they have parsimonious theory with infinitely complicated meta outcome. We have parsimonious meta outcome, natural selection, reinforcement learning, but very non-parsimonious objects. And it’s another way, by the way, in which those two sciences are different.
Jim: That’s very interesting. Yeah. They’re the opposite. So hence one could say that complexity science is the search for meta-Occam-ish processes?
David: Yeah, I think we are. We’re looking for by and large, relatively simple rule systems that have open-ended properties.
Jim: In the same way that alphabets are very simple, and they’re open-ended in their composability and their evolution. We go from bookkeeping in Samaria to Finnegan’s wake in a few thousand years. Oh, yeah. The other example I had in my little set of examples to try to probe on this is Mendeleev’s development of the periodic table without any understanding of atomic number, protons, electrons. And not only was it useful immediately, but it also predicted elements that did not yet exist, all of which were eventually found. Whoa, that’s really interesting. How do you think about that as something that fits into your schema?
David: Again, I think it has that very… It’s called the periodic table for a reason, and human beings are good at observing repeat occurrences. And it was almost a quasi harmonic, melodic, geometric pattern instinct that he had. And that transcended, as you say, counting neutrons and protons, or understanding electron orbits. It’s not clear to me, Jim, if we had had all of the microscopic data whether we would’ve had such a neat taxonomy. It required him not knowing. It’s, again, that really deep insight.
Jim: Well, we know for sure that he didn’t know, but whether the reverse is true or not, I don’t know. That’s a conjecture I’d suggest.
David: It probably wouldn’t be a simple spreadsheet. It might be a vastly more complicated object. In fact, nowadays when you look at the representation of things like the standard model, the new graphical visualizations, then they’re a bit of a mess. They’re very hard to read because they are taking into account more parameters. I have one topic I want to raise with you.
Jim: Okay, let’s go with it.
David: Existential risk. So there’s so much talk these days about the existential risk of these models, and it’s all completely undisciplined. And I have some thoughts on it I’m curious to know further. Do you buy this hand ringing, slightly neurotic narrative, or do you think it’s a weird form of marketing ploy?
Jim: It’s a bunch of different things, including working out a mental illness in some cases, having no [inaudible 01:17:22] some of the people involved actually. But I do believe there are significant risks including the full on Eliezer paperclip maximizer, but not imminent. What I do know about human cognitive science and the study of consciousness, et cetera, humans are pretty damn weak. General intelligence. We are just over the line. Nature’s seldom profligate in their evolutionary steps. You look at things like the size of our working memory, the low fidelity of our long-term memories. They say the $1 calculator can outdo us in any kind of mathematical thing. The idea of machine intelligence is vastly greater than humans is certainly something that will happen someday unless we were to consciously stop it. And there are some of those kinds of risks that I’m glad there are people like Tegmark and Bostrom and Yudkowski, et cetera, that are working on it. However, the current overheated speculations about the current state of AI strike me as essentially marketing to get more resources to work on those longer term problems.
And again, yes, but there are a number of other risks that we really should be cognizant of. One is just a very obvious thing of people doing bad things with narrow AI. I think the example of China building a state-of-the-art police state that Stalin would look at with great envy. The ability to, “Oh, that’s an Uyghur. We know we’re there. Let’s track them in real time! In multiple dimensions!” We can track them physically through facial recognition. We can track their data paths, et cetera. They’re certainly a risk, but is it existential? Probably not. But it’s a risk to fuck up the world. Another one, and this is, I think, a very interesting one. I call this the idiocracy risk, which is just in the same way that very few Americans could survive by growing their own food and when making their own clothing these days.
And then kids today as us boomers will complain, half of them can’t even read a map. You take away their phone, they couldn’t get home, some of them. If even narrow AI continues to getting better and better and better at more and more and more tasks and humans stop investing in achieving those intellectual skills, we may actually devolve down to the very humorous movie Idiocracy which looks out at the world 500 years from now where people have forgotten how to do everything essentially other than watch TV and do drugs and have sex. And that could happen if we’re not careful. We continue a step at a time to delegate our capacities to the machines. And then, keep in mind, we think of our technosphere as this great accomplishment of wonderful homosapiens. It’s pretty damn fragile. A solar flare at the size of the Carrington event of 1857, I think.
A solar flare at that level estimated by astronomers to happen about once every 500 years would be enough to totally destroy the grid, possibly knock it down for years. What happens if this happens in 100 years when we’ve forgotten how to do anything? Oops. Well, we’d be in deep doo doo. Another risk, and again, this is fairly subtle. Our work in the Game B movement, we argue that the status quo so-called Game A is heading off the cliff. It has done an amazing job of bringing mankind a long way for over the last 300 years. And from the time when we were only 600 million of us consuming a relatively small amount of resources and energy each, to now we’re well past the carrying capacity of the earth with no real sign of slowing down. Well, here’s one of the problems of all this, even narrow AI.
It’s going to accelerate Game A. If Game A was going to hit the edge of the cliff by the end of the century, if it accelerates, if manufacturing accelerates, the cost of extracting raw materials goes down, some people come up with clever ways doing room temperature superconductors, and everything just becomes more and more acceleration, acceleration, acceleration, and these enabling technologies make Game A run off the cliff in half the time. So instead of having 80 years to figure out how to solve it, we only have 40. That’s a fairly subtle one. And there’s a few others. So I look at it in the same way I look at prepper. You look at a bundle of trajectories. You don’t just say there’s one trajectory. There’s a bundle of trajectories. There’s a bundle of risks. And that’s what I find so unprincipled and unsatisfying about so much public speculation on this area of AI risks. I know that’s a long and not simple answer to your question.
David: No, it is. But what’s interesting… What’s not present, and I have to say in a lot of these narratives, is precedent. And if you look, it’s interesting. The world that I was very steeped in was genetic engineering, and recombinant DNA was invented in the seventies. And CRISPR, late eighties. The nuclear weapon first tested at the Trinity site here in New Mexico, 1945. And the automobile. So look at what we actually have done. And it’s quite interesting that more or less within a few years as recombinant DNA had been made, the Asilomar Conference takes place. And there is a kind of coming together of morally responsible humans, and they impose a self moratorium. There is no regulation. Now, same with CRISPR. CRISPR is legal in the United States. Of course, if a drug is developed, it has to be FDA approved. But you can use CRISPR.
Jim: You could do CRISPR in your basement these days.
David: You could do CRISPR in your basement. Right.
Jim: For 50 grand, you could be doing CRISPR.
David: So I think we’ve learned, and the atomic bomb, it was very interesting. 1946, we already had from Truman an idea of how we do the lifecycle of nuclear materials. And after Cuban missile crisis in ’62, in ’63, we have a whole number in the sixties of non-proliferation treaties. The automobile is interesting. Again, as data, I think looking at this, over the 100 years from about 1920s to now, there’s been a 95% reduction in fatality per mile. Why? Traffic lights. Drunk driving. Seat belts.
Jim: Airbags. Antilock brakes.
David: Driving tests.
Jim: Better tires.
David: So we know actually from history that the kinds of regulations that work that minimize true risks like atomic bombs… And by the way, cars kill about 50,000 Americans a year. About 2 million for-
Jim: No, not anymore. It’s about 25,000 now. 30,000.
David: All in the tens of thousands, and I think is small regulatory interventions that are fairly non-unobjectionable. And what I find ridiculous about the current debate around these models are these super Draconian proposals, as opposed to an empirically informed discussion like we have with automobiles, like we have with atomic bombs. So just a plea for the use of empirical precedent as opposed to science fiction prognostication when it comes to regulating new technologies.
Jim: I will say, to take one hat off, put the other hat back on, the advocate for why to be more concerned about the car or even the atomic bomb is that’s moving way faster. And we are somewhere probably in the very steep part of the curve and we just don’t know what GPT-5 could do or a GPT-6. And that the idea of Future Shock, which came out in the seventies I think, is now at a point where it could just be staggering in its implications. Now on the other side, there are people who argue that, like all technologies, this is an S-curve. And let’s say language models in particular may getting pretty close to the top of the S-curve. Don’t know that for sure. GPT-4 in particular is very, very, very impressive in its ability to use language. Not as a library, not as a math machine, not as a scientific engine of discovery, but as a device for manipulating language.
It’s extraordinarily fluid and directable in ways that you can’t imagine until you try to start to actually do it. So there may not be a lot of improvement possible there, but perhaps these other parts. For instance, we know that GPT-5 is going to be one. It’s probably going to be trained on video so that it may be able to induce physics, and that may be a qualitative phase change. The people who are exploring these models, the ones I’ve talked to, talked to some of the people in the open source world, they seem to see a phase change at around 13 billion parameters that a good model above 13 billion will be relatively competent with language. And the difference between that and GPT-4 is a matter of degree rather than a matter of kind. Is there another such phase transition when you start training it on video where it starts to understand reality in a completely new way? We don’t know.
And things are moving so rapidly and the costs on the scale of things are so low. The atomic bomb costs about the equivalent of $300 billion in current money, I believe. It was something like that. And you can build a state-of-the-art small model for $1,000 now. Take the llama model which escaped confinement and do a large amount of fine-tuning for some special purpose and $1,000. Build a full model from scratch, a few million dollars these days. And those prices are going to come down via Moore’s Law. So, exponential growth. Things already escaping confinement. The fundamental technology itself isn’t hard at all, and it’s all published. So they will be qualitatively different from these other ones. That’s the counter argument there. But I do take your point that historically we have managed a number of risks.
Remember Bill Joy and the gray goo back in the 1990s. He was convinced that nanotech was about to destroy the universe. And as far as I know, there was never any regulation on nanotech, and the universe has not yet been destroyed. So we shall see. And the flip side is the good. We talked about this in the pre-game. And we do talk about the potential negatives, particularly from large language models, which I believe are not even close to sentient. Not even anything like sentient. There’s very simple evidence. They’re feed forward only networks. There’s no feedback loops in them. They can’t modify themselves. They can’t learn. Though you can embed them in contexts that do all those things, which are quite interesting, the models themselves are something you put stuff in, stuff comes out the bottom in a purely feed forward kind of way.
But the negatives are they are certainly including the flood of sludge. It’s noticeable if you’re on the internet that the amount of just bullshit websites that pretend to be news sites that’s going up exponentially, the quality of spam is increasing, the proliferation of low end direct response advertising is, because it costs to create this shit now, dropped by a factor of 10 essentially. However, for one of you young people out there who wants to be the first trillionaire in the world, take these technologies and use them to build info agents that we could put around ourselves. My hypothesis is that the rise of the flood of sludge will produce, as a natural evolutionary reaction, the development of info agents that we will all surround ourselves with. And we will no longer go out and deal with things like Facebook or Twitter directly.
Instead, they will be filtered by AI agents working on our behalf. And further, we will be able to build connections to other people’s info agents so we can build constructive networks of mutual curation amongst each other. I would pay $10 a month to get the output of the Kraków info agent, for instance. And one could imagine an ecosystem where this occurs, and also where people who set themselves up as curators in a domain and there are pipes coming out of their bubbles. Anyone who wants to can subscribe to them and add them to their bubbles, and we could see this world where we are no longer being overwhelmed by these attacks on our attention.
David: Here’s the crack hour solution for you, Jim. I don’t have any social media. I don’t use any of those platforms. It’s really cheap and super effective.
Jim: Yeah, that’s true. But on the other hand, you’re missing out on certain things too. I can guarantee it.
David: I don’t mind.
Jim: No, it is true. I’ve just started my annual… I’ve been doing this for five years now, six month social media sabbatical from the 1st of July to the 2nd of January for the last four years. And this will be the fifth year. Complete cold turkey. Don’t go on them at all. And I find the combination of the two is actually very interesting. People say, “Well, don’t you miss it?” And I go, “When I go back, it’ll be exactly the same. Same shit, different day.” And I have made a number of useful connections. I’ve learned a lot of useful things. But on the other hand, it’s a constant depletion of our attentional resources.
And this idea of an information agent that surrounds us and buffers us from the wilds, I can see it. I can smell it. I can taste it. We’ll end up better than we were before. So I think that this could be a magnificent opportunity to use the technologies as they exist today. They’re good enough. The latent semantic vector space databases plus large language models for doing summarization and rough and dirty curation. There’s a few other interesting technologies combined could build this right now.
David: What I love is you’ve just told me, which is wonderful, that all of this technology really comes down to being a very refined spam filter.
Jim: That’s what you should use it for essentially. And I use that metaphor because if you remember in the mid-nineties, email almost melted down. 1995, ’96. It wasn’t clear whether spam filters were going to be able to stay ahead of spam, but then there was a breakthrough in old school computational linguistics. I forget which one it was, but they turned the tide, and spam has never flooded the seawall since. And I used this exact analogy for this info agent idea, that this is essentially God’s own spam filter for everything electronic, including TV. It would tell me what TV shows I might be interested in and provide me one paragraph summaries created with a large language model that knows me and knows everything I’ve ever watched, and tells me why I might or might not be interested in it. And it also rates how high it ought to be on my threshold. So if I’m on a day when I want to work, I say, “I only want to get five messages a day from the outside world, period.” And the algorithms will probably do a fairly good job taking out the top five.
David: It is interesting though because this is where I don’t disagree with the end point you want to reach, but I have a very different philosophy that the problem with the model you are suggesting is the outsourcing of human judgment. And let me give you an alternative future scenario, which is a kind of paleolithic alternative, which is that many of us… And it’s interesting, when I talk to young students here at SFI, “Forget all this rubbish. Go and read books, physical ones.” It’s not that this will go away. It’s not that we’re returning to the Stone Age. It’s just that in our life decisions, we simplify radically. And certainly what I’ve done, and I don’t think scientifically or institutionally it’s been negative. I just have to deal with much less BS than other people. So you can counter technology with technology, but you can also counter it by ignoring it.
Jim: Yep. Absolutely correct. Though if you look at the numbers, you’re a way outlier. But think how much time you’ve earned back not being involved with beefs on Twitter. It’s probably a good 45 minutes a day one could easily get sucked in. Some of your colleagues at SFI have periodically gone and become full-on Twitter animals. It’s been pretty funny. Most of them though, because they’re smart people, eventually have pulled back. But it can be a really open-ended time and attention suck. And I guess this also goes to your point about the fact that we do know how to deal with technologies. We learn how to deal with them incrementally. Fire. Think about how fucking dangerous fire is. And yet, it’s really the whole basis of human civilization because we’ve gradually learned how to control fire. And maybe the kids, because I am hearing some more signs of the Zoomers, people 22 and under, abandoning online dating, learning not to bring your phone into your bedroom, just basic hygiene.
And they may figure it out as being true digital natives. Some of them, of course, will probably end up with bifurcation. One group will be completely hypnotized by psychologically very, very, very advanced, direct response, attention hijacking. And the others will develop a reaction against it and develop a whole different pattern away from it. Because humans are smart. We do know how to adapt. I think that was your point earlier, that we have a track record for dealing with all kinds of dangerous stuff, starting with fire. And language! That was a goddamn radical idea, which can be misused in all kinds of ways. Think of every demagogue and dictator and religious crank who’s indoctrinated people by the millions. On the other hand, without it we wouldn’t be humans.
David: Yeah. All right.
Jim: Well, David, this went this way and that way, and every which way. In fact, it kind of modeled the kind of conversations we’d have out at SFI.
David: Yeah. No, I’m wondering what kind of edit you will do. But yeah, of course.
Jim: None. None. We won’t do any edit. We’ll get rid of the ums and the ahs, and we’ll run it. That’s what we always do.
David: Do it. Fantastic.
Jim: All right. Thank you very much, David Krakauker, one of the most interesting guys out there.
David: Yeah. And thank you so much, Jim. It’s wonderful to see you actually.