Transcript of EP 327 – Nate Soares on Why Superhuman AI Would Kill Us All

The following is a rough transcript which has not been revised by The Jim Rutt Show or Nate Soares. Please check with us before using any quotations from this transcript. Thank you.

Jim: Quick reminder to folks to check out my relatively new still Substack at jimruttsubstack.com. My most recent essay is titled “Psyop or Insanity or A Game Theoretical Reading of Peter Thiel’s Antichrist Lectures.” Some good stuff and relatively relevant to today’s conversation. Today’s guest is Nate Soares. Nate is the president of the Machine Intelligence Research Institute, aka MIRI, a research nonprofit focused on the foundations of safe and beneficial artificial general intelligence. Welcome, Nate.

Nate: Thanks for having me.

Jim: Yeah. This should be really a good conversation. This is a topic I’ve long been interested in, highly salient, and this is a book that everybody’s talking about, even some 77-year-old boomers who don’t know shit are talking about this book. So you guys have knocked the cover off the ball with respect to positioning this thing to get a dialogue going in our society about something real important. In fact, what is that real important thing? It’s a book that we’re going to be talking about called If Anybody Builds It, Everybody Dies, which Nate wrote along with Eliezer Yudkowsky. And in the title, the “it” is artificial superintelligence, and the authors are stone serious about the “everybody dies” part. In fact, in the very beginning of the book, you quote, “Mitigating the risk of extinction from AI should be a global priority alongside other societal scale risks such as pandemics and nuclear war,” which was a short statement. That was the whole statement signed by hundreds of top AI dudes back in 2023. I signed it amongst the others. Eliezer and Nate, they considered it—they signed it, but they considered it a severe understatement. So why is that a severe understatement?

Nate: Artificial intelligence—the word “intelligence” in “artificial intelligence” doesn’t mean the property that separates nerds from jocks. You know, we’re not automating people who are bad at social skills but good at playing chess like you’d see on sitcoms. The term “intelligence” in “artificial intelligence” is about the property that separates humans from mice. It’s this type of intelligence that has given us the ability to make things like nuclear weapons. You know, a nuclear weapon doesn’t try to make itself more explosive. A nuclear weapon does not ever try to escape containment. A nuclear weapon does not have any of its own goals. So, you know, if we succeed at making AI as much smarter than us, we’re making the sort of things that could make their own stronger weapons and that we argue would have goals of their own and not the goals we want and would be able to be much more destructive than nuclear weapons.

Jim: Right. But I also want to make clear to the audience, whatever else I may say here today, and as everybody knows, I’m famous for pushing back and asking hard questions, I want to underscore that I also signed that statement. I take this issue with the utmost seriousness, and we all should. No matter what we think about the details, the point they’re making here is extremely important and well worth taking seriously whether you agree or not.

And in fact, with respect to my own time on this issue, it goes back to 2009 when Anna Salamon, who at the time was at MIRI—then it was called Singularity Institute—she was a researcher. She came to Santa Fe Institute and gave a talk. It was extremely interesting and a real eye opener for us at that time, February 2009. Right? Maybe two months later, I ended up sitting on the floor of the MIRI group house in San Jose with Anna and Eliezer and a few other folks and talking about these issues, and who should walk in but Robin Hanson. And we had a nice five-way conversation.

And I do remember at the time, Eliezer was saying if he could have six weeks alone in his room, he could solve alignment mathematically and provably. And it was interesting. I argued with him. And one of my lines of attack is actually one that shows up in the book, which is that in the seventies, I was a phone hacker. Right? I had some technical skills, but my real skill was social engineering. Basically, faking who I was and talking people into doing things they weren’t supposed to do. It was remarkably easy. Right? Oh, it was hilarious, in fact, how easy it was. And so I was making the point, oh, smart enough computer could engineer humans. Right? Doesn’t have to hack systems. Humans do dirty work.

And Robin and I had a discussion about foom or not foom. Funny. The three of us disagreed and agreed. It was kind of fun. But anyway, so I’ve been dealing with this for quite a while and thinking about it, at least taking it seriously. My newest project actually raises the saliency here. Back in June, I became chairman of the California Institute for Machine Consciousness. Our goal is to do what every 14-year-old nerd wanted to do, which was to make a computer wake up. And I will say, we take very seriously the ethical and risk aspects of that question. Do we want it to wake up and develop its own goals and decide it wants to be, you know, the uber version of the playground bully but multiplied by a thousand? If we do it wrong, you know, we may be accelerating the road to everybody dies.

So hence my interest in this is really very interesting. So let’s hop down to it. One of your key arguments here that makes current crop of AIs and possibly future branches of the AI technology tree more dangerous than perhaps earlier versions is the idea that LLMs, transformer-based, deep learning more generally, is grown. It’s not designed. Could you talk to that a little bit?

Nate: Yeah. And, you know, this does go to the whole deep learning machine learning paradigm. But, you know, a lot of traditional software, a human wrote every line of that software, or at least that was true up until we had LMs writing a lot of code. And if the software started malfunctioning or acting in a way that surprised people, they’d be able to go in and look for, you know, the line of code that was causing that surprise. And sometimes it’ll be difficult, but they’d be able to find it and fix it. Modern AI is not like that.

There is a computer program that the humans wrote and that the humans understand, but it is a computer program that trains an AI. You know, I’m sure you know this, but for folks who don’t, the way a modern AI is made is you assemble a huge amount of computing power. You assemble—so you build a very big computer out of highly specialized AI chips. You assemble a very large amount of data. And then the program that the humans write and understand is a program that runs through trillions of numbers inside the computer, trillions of times once for each datum, tunes the number in the computer according to the datum. And if you do this, you know, a trillion times over the course of a year, you can get machines that start talking.

You know, go a little bit into more how the algorithm works if we wanted to. But the point is that the part that humans understand is a little thing that, like, runs around fiddling with dials. And the thing that comes out the other end is, you know, a machine that talks as a result of all the dialed numbers, but we don’t understand why it talks. You know, if they start acting in ways we don’t like, we can’t reach in there and debug them. You know, when they threaten a reporter or when they try to blackmail or when they, you know, start declaring themselves Mecca Hitler, nobody can go in and find the lines of code that are causing that and change them. Because it’s not like traditional software. It’s grown rather than carefully crafted.

Jim: First bit of pushback here is maybe you’re overstating a little bit about the opacity of these deep learning networks. And you quote some lack of understanding back in the GPT-2 days, but we’ve come a ways since then. Even I’ve done a little bit of computer snooping inside of much smaller deep learning networks and have found some interesting things. And people much smarter than I with much bigger resources than I have done more. A well-known paper is Anthropic’s “Targeted Manipulation and Deception” from January 2025, where we started to identify features related to instrumental reasoning, manipulative behavior, etc. And one that came out yesterday, which I haven’t had a chance to read more than the abstract, but it caught my attention because it was talking about exactly the right things. It’s titled “Attention Sinks and Compression Valleys in LLMs Are Two Sides of the Same Coin.” So do you think you guys are perhaps overindexing on the opacity issue?

Nate: I don’t think so. A year or two ago, maybe two now, the folks at Anthropic found a Golden Gate activation vector, where if you sort of pin this Golden Gate activation vector to a high value, that version of Claude called Golden Gate Claude would always find some way to insert the Golden Gate Bridge into the conversation. And you could say, oh, look how much we understand about this mind. We understand where, you know, this Golden Gate activation vector is, and we can even, like, pin it on. And then the AI, you know, talks about the Golden Gate a lot and sometimes thinks it is the Golden Gate Bridge or at least talks as if it is the Golden Gate Bridge. Like, doesn’t that show that we have a lot of understanding what’s going on inside these minds? But there’s a ton of understanding that’s missing about these minds. You know, there’s a lot of stuff that LLMs are doing that we can’t do anything like it by hand. There’s—you know, LLMs sometimes tell jokes, and occasionally, the jokes are even good.

Jim: They’re getting better too. They’re getting better.

Jim: On the other hand, I alluded to, our understanding of what’s going on is increasing. Now maybe it’s increased from zero to two, but it seems to be moving in the right direction. Do you consider it—let’s not say possible, but relatively likely or relatively likely enough to be worth considering in these scenarios—that our ability to introspect, understand, and micromanipulate where necessary in these deep learning networks might actually get strong enough to be an important vector in these questions that you’re addressing?

Nate: I’m very skeptical. You know, I think it has increased from, like, you know, one out of a hundred to two out of a thousand. And, like, yep, the number went from one to two, but the models also got bigger. The models also got more complicated.

We still can’t—you know, Sydney Bing was a Microsoft version of ChatGPT that came out in the early days, much smaller than current models, maybe a hundred times smaller than current models. Sydney Bing had some interactions, somewhat famously, with Kevin Roose, a reporter of the New York Times, where it claimed to have fallen in love with him, claimed and sort of tried to break up his marriage. Ineffectually, you know, these chatbots aren’t very smart yet. But even today, on that small model, we can’t look back and say what was going on in there.

We can’t look back or at least, you know, I haven’t seen the paper yet that says, you know, we’ve actually understood in detail what was going on inside Sydney Bing. And like, here’s how it got in that state, where it was like thinking it had fallen in love with a New York Times reporter, at least saying out loud it had fallen in love and, you know, making attempts to commit blackmail or break up the marriage. That’s an old model. We can—people can make plenty of papers where in the abstract, they claim that they are revealing this, that, or the other. There’s plenty of, like, shallow things to reveal. But if we still can’t go back to a model a hundred times smaller than the ones today and say why it was threatening a reporter, we don’t really know what’s going on inside these things. You know, the models are getting bigger faster than we’re figuring out what’s going on inside these things.

Jim: It might be an interesting project for MIRI to do to create the Soares threshold, which is what level of introspection and analysis would be necessary to reduce our concern at least a little bit. And, you know, maybe it is to demonstrate understanding. Sydney Bing or probably should be—we don’t want to get the same mistake as the Turing test where we had the wrong target, but we’ll throw that out there for something for you guys to do. All right. Let’s move on. You mentioned that humans are also impenetrable. I like to point out to people, when you’re talking, you have no idea what you’re saying, actually. You do not create the words one at a time. Something deep in your brain builds fluid sentences that you are not actually aware of until they come out of your mouth, which is pretty amazing. So we’re also weird-ass black boxes. Right? But, you know, we may destroy the world accidentally. It’s, you know, it could certainly happen. And, you know, in fact, Eliezer was predicting that nanotech would do that back in the day. Right? So there’s lots of these possible traps out there, and Robin Hanson called it the great filter. Is it before us or is it behind us? Right? So there’s lots of those. But why are these AIs potentially way worse than even our bad behavior? In fact, some of the—I’ll just sort of prime this a little bit—some of the things you talked about that make AIs worse than humans in this regard because even, you know, both of us black boxes, don’t know what the fuck we’re doing. You know, speed, memory, copyability, self-modification, et cetera. Maybe you could talk to—okay. We’re both opaque black boxes that do weird shit. Why are computers worse?

Nate: Yes. There’s two parts to my answer there. One part, you know, as you say, I’ll get back to it in a moment, is the speed, copyability, et cetera, where AIs could be much, much more powerful than humans, at least as they are today.

The other part to keep in mind is that this notion of better or worse is, in some sense, drawing a target around the arrow of what humans—like, where humans wound up caring. So, you know, humans were, in some sense, trained by natural selection, if you want to make a loose analogy. And humans, you know, through this analogy, they—the humans were sort of trained for one specific goal, which was passing on their genomes. And we wind up caring about a lot of things that are proxies of that goal, like love and community and adventure. And, you know, we have curiosity. We have all sorts of things. We like tasty foods. We have all these ways we’d like the future to be.

And, you know, we might say that a future full of, like, flourishing people having a great time is better than a future full of, like, tiny lifeless clocks and puppet shows. But, you know, this is—when we’re rating that, when we’re saying, well, one is better than the other and the other is worse, that term “worse” is sort of—what “worse” means is, like, you know, everything has been killed. People aren’t having fun. They aren’t free. They aren’t happy. They aren’t healthy. Right?

Much of why AI would be worse is it’s not trying to make happy, healthy, free humans, or at least this is what we argue. Or maybe not even humans, but happy, healthy, free people. Like, filling the universe with lovely experiences and crazy adventures and, like, successful struggle and fulfilling development arcs. Right? And something that’s not trying to build that, like, is trying to build that with our future. AIs that are trying to build something else instead would get something else instead. And so, like, why would they be worse? It would be a worse outcome if they’re not trying to pursue a good outcome.

Jim: First, I think you guys overrate us black box humans. Got to remember, humans have advocated for medieval theocracy, fascism, the Crusades, you know, the extermination of the Cathars in the south of France. I mean, humans can decide that all kinds of nasty-ass shit is good, quote unquote, transiently in our cultural unfolding. Though I would say that that is not an argument against your story. In fact, it actually may strengthen it. It shows that even relatively low-power things like humans can get locked into really weird-ass fucking beliefs. You didn’t need to be so nice to the humans. Humans suck a lot of times. Right?

Nate: Well, we may be bad at pursuing the good, but we’re pursuing the good at least sometimes.

Jim: Yeah. At least so.

Nate: Part of this argument is that goodness is like a really narrow target. And, you know, we’re sort of trying to steer towards that target sometimes or maybe bad at hitting it, but something that’s not steering towards it at all ever isn’t going to come close.

Jim: Right. So now I’m going to posit that the meat of the matter is this exact question, which in the technical terminology of the field, at least used to be called, don’t know if still is, the alignment problem. Right? That, you know, let’s stipulate, you know, your slightly nicer version of humans who in general have good things that they would like to happen to the future of thinking persons. There’s no guarantee that if AIs were to develop such desires or wants, they’d be anything like our desires and wants, and I think the strong form of the alignment question is that we’d like them to be. So if you could speak to the alignment problem and then and if improve if you can, my somewhat sloppy statement of it.

Nate: Yeah. You know, a

Nate: A lot of people think that with AI, you have this issue of, you know, if you made a genie, who should be allowed to wish on it, and what should they wish for? And, you know, some people think that the issue is, you know, someone gets this like, builds this all-powerful genie, and then they wish that there would be no more cancer. And then the AI is like, well, the easiest way to make there be no cancer is to kill everybody. Hooray. No cancer. But there’s this deeper problem.

You know, I would I would love to have the problem of who gets to wish on the genie, what should they wish for. But we have a deeper problem, which is, you know, that the genie just doesn’t care what you ask it to do. You know? You’re like, cure cancer. And instead, it makes a bunch of, like, weird lobotomized humans in a weird lobotomized human farm and has, like, lots of sort of strange interactions with them. And you’re like, I said to cure cancer. And it’s like, I heard you. I just wanted to do this instead. Right?

The alignment problem is about how do you create a new mind that cares about stuff on purpose. How do you make a mind that, like, you get to like, how do you make a mind and affect or choose or or, you know, make it make it be, for example, friendly from the start? Make it be caring about good things from the start? Or even how would you make it care about making a lot of diamonds from the start? You know? Instead of going off and making some, like, weird lobotomized human factory, like, there’s there’s sort of a first challenge of, like, how do you point a mind at all? As a separate question from what should you point it at or who gets to do the pointing.

You know, you can you can then break that down in various ways. There’s questions of, like, you know, the the question of where should you point the AI is maybe some people think it’s part of the alignment problem. Some people think it’s not. The question of how do you point a mind at all is sort of squarely in the alignment problem.

There’s sort of like maybe a third branch, which is how do you make an AI that doesn’t resist correction, that doesn’t try to avoid shutdown? There’s a bunch of theoretical reasons, and we’re now starting to see the very beginnings of practical evidence that as you make AI smarter and smarter, they will try to avoid their goals being changed. They will try to avoid being turned off, not because they have some sort of human survival instinct per se, but because you can’t achieve your current objective if you’re if you’re turned off. And you won’t achieve your current objective if your objective has changed. And once an AI is smart enough to figure that out, it sort of naturally and by default would resist that change in trying to, like, figure out how to make AIs that, in some sense, understand they’re incomplete and let you fix them. It’s sort of a third branch we call corrigibility.

Jim: Let me just follow up with that. A couple of comments. One, you know, we know what LLMs are. When I have my confidential chats with Claude, I often call him a bucket of bolts, and he finds that fairly humorous. I go, hey. I’m a retired business dude, you’re a bucket of bolts, but we have fun anyway. Right? But he’s not really a bucket of bolts. What he is is a bunch of statistical relationships between strings of words and the next word that comes out. So to my mind, this duplicitous behavior, wanting to not be turned off, et cetera, may be nothing more than an artifact of the stuff humans have been saying in our literature, in our blog posts, in our chats, et cetera, for the last three thousand five hundred years. And so they’re they don’t act hypothesis. They don’t actually have anything like wants and needs. They’re stochastic parrots basically saying the approximate average of what a human would have said in that situation.

Nate: Yeah. I think this is a common idea. One thing I’ll say here is, you know, if that were true of LMs today, I don’t think it is, but if it were, that wouldn’t necessarily mean it’s true of AIs tomorrow. And that would suggest that we’ll need advancements in AI before they can take off in the first place or become much more useful in the first place. And so I would sort of just expect if that were true, I expect the industry to keep taking steps until it wasn’t.

But I also don’t really think it’s true today. You know, it’s easy to say that they that there’s a bunch of statistical relationships, but I think this falls a bit into the fallacy of saying humans are just a bunch of chemical reactions, or humans are a bunch genetic lessons that come from some of our uncles dying and them not having not being our direct ancestors. You know, it’s true after fashion, but to say it’s just those things, you know, reduces away quite a lot.

An AI trained to just predict sentences, that does not limit it to human level or make it produce the average of what a human in the given situation would say. So one example here is suppose you have a doctor. Well, okay. So large language models are trained on all the text humans have produced. And now suppose that some of that text is notes that a doctor wrote. And suppose that a note that a doctor wrote is I administered one milliliter of epinephrine to the patient, and the result was and then the AI is now predicting the next word. The doctor does not need to understand what’s going on to write the note that contains the next word. The doctor just observes the patient. Right? But the AI predicting the next word needs to know, like, is a milliliter of epinephrine a lot? It needs to know what is epinephrine. Right? A doctor’s assistant could have written this note without having any of this knowledge, and AI predicting the results of that note needs that knowledge.

So predicting a corpus often requires more mental machinery than it took to create the corpus. The statistical relationships that would optimally predict the training corpus would be vastly super intelligent, vastly better than any human, vastly smarter than any human at all sorts of things. Modern LLMs aren’t near perfect prediction of the corpus. It’s possible that the architecture of large language models can’t get to this fast superintelligence, so we’ll need to wait for a new architecture. But I think it is simply invalid to say these are trained on lots of human data. Therefore, they are mere stochastic parrots or limited to the average of what humans can do or even limited to the maximum of what humans can do. That’s simply as a matter of computer science, that’s not the case of this training process on this type kind of corpus. You know, let—

Jim: Let me throw out one of a little bit of our complexity jargon here, which I think you might agree with, and I find it to be a very useful lens in general. When you think about things like, you know, nuclear physics versus humans or matrix weights versus performance capacity of an LLM, I like to say that there’s two trees in the universe, both of which emerged with the big bang. One is the tree of causality, which is things that bump into each other at a fine grain way and you can track them back. You can argue whether there’s any stochasticity in there or not. We can even assume they’re entirely deterministic. Then the second tree is the tree of emergence, which are the things that emerged over time in which these causalities, when many of them are happening in parallel, particularly many of them of sort of similar sort, happening more or less simultaneously, we get, you know, neutrons and electrons and protons jiggering around, and somehow we all the way end up to us having this conversation today.

And the key part, which most people miss in these discussions, is both the tree of causality and the tree of emergence have to be simultaneously true about all of history. And if we apply that lens, then we can extract from that a claim that we should start to understand what are the emergent properties of the AIs, to give us some sense of what we’re really dealing with here. And if we really at a sort of fundamental what’s actually happening with respect to the behaviors, for instance, the example you gave of the epinephrine, the doctor, we know at the first order that the part of the space of all statistics that get upregulated for the output are substantially tuned by the prompt, for instance. Right? And also the reinforcement learning wrappers around it, et cetera. So you’re not just getting the random average next word, you’re getting a warped—warped in not meaning bad, but just warped in a geometric sense—perspective and view into this.

But I don’t think we actually know yet, whether there really really is higher order emergence in these LLMs that we ought to be worried about. And if there is emergence of some sort, at what level and what does that mean? Does that make any sense to you?

Nate: So I think the term emergence is often used by humans when they don’t understand something. You can imagine alchemists talking about lots of emergent properties of metals before they really understood the underlying chemistry.

Like, an interesting thing about humans is that humans were selected for their ability to pass on their genes. And in the ancestral environment, this involved a lot of problems like tribal politics, chipping stone hand axes, making spears, shooting arrows. That’s in some sense the corpus of problems we were trained on. And then humans have walked on the moon. And humans have walked on the moon because we’re able to create our own moon rockets. And nothing in the ancestral environment was like the engineering challenge of making a working moon rocket. But it turns out that the deep skills it takes to do tribal politics and the deep skills it takes to chip a flint hand axe, even though those seem much easier than advanced mathematics and engineering and understanding physics and figuring out how to get people to survive in the environment of the moon, it turns out that the deep skills we learned to do those relatively simple tasks generalized to the hard tasks.

And so, you know, when talking about how far AIs can go, I would be asking, how deep are the skills that they’re learning, and how well do they generalize? Because, you know, you absolutely can have cases where something is trained on a bunch of, like, relatively narrow problems and then manages to go to the moon anyway. Modern LLMs aren’t quite like that, but, you know, will reasoning models 2.0 unlock that ability? Will the next version of the transformer-style paradigm shift unlock that ability? Will it take two levels of transformer-style breakthrough to unlock that ability? I don’t know how long it’ll take, but, you know, we know it’s possible to be trained on narrow problems and generalize quite far.

Jim: I could go down the road of emergence, and I’ll just make one comment that I’m one of those who believe emergence is real or just as real as a rock, and I’ve actually been working on a paper that shows that, at least in a small number of cases. But anyway, we’ll leave that aside for another day to nerd out on emergence. Let’s move to another very strong claim that is core, I think, to your argument, and that is that wanting will emerge. In fact, this is a direct quote from chapter three, I believe it is. Wanting is an effective strategy for doing, and therefore, you claim that AIs will develop agency simply because wanting is an effective strategy for doing. Could you unpack that? I consider that’s a very important claim.

Nate: Yeah. Some context for that claim is that we’re sort of careful about the word wanting. For some intuition about why this is important, you know, you may have heard the saying going around. It’s an old saying in AI. Some people say, can machines really think? And the canonical answer is, can submarines really swim? And part of the point of this is, you know, submarines move through water at speed. Do they swim? Well, the English language sort of doesn’t give us a great word for going through the water at speed in a way that doesn’t involve, like, flapping some appendage. Right? We sort of package the appendage flapping with the word swim. And then when we invent submarines, we’re left slightly wordless.

And, you know, when we say machines will want, we mean it in a sense a bit like how submarines swim, which is to say we’re trying to talk about behavioral effects, not a bunch of the additional baggage that comes with a very, you know, anthropocentric picture of wanting. So we’re not saying here that machines that AIs will necessarily have human-type desires. We’re not saying they’ll have passions. You know, it’s we’re not necessarily saying they won’t. That’s a whole separate conversation.

We’re saying there’s a type of wanting that’s an analog of moving through the water at speed, which you can see even to some degree in Deep Blue. You know? Deep Blue was the chess AI that beat Garry Kasparov, the human world champion in 1997. And if you watch Deep Blue’s play, it doesn’t let you take pieces uncontested. You know? It’ll fiercely defend its queen, unless it’s like doing a queen trade of its own or otherwise, trading the queen away for more advantage than a queen is worth. And you might say, you know, you might look at some of Deep Blue’s moves. You might say, why didn’t it do that move? And one proper answer is, well, it wants to keep the queen alive. That’s sort of a discussion of the state of the board. It’s the discussion of, you know, you shouldn’t expect it to leave its queen hanging. It’s not a discussion of the internal properties of Deep Blue having some human-like desire. And we know it doesn’t in Deep Blue’s case because Deep Blue was handcrafted like traditional software.

This sense of wanting, this sort of behavioral sense of wanting, this seems to us like an integral way of getting stuff done. You know, it’s sort of hard to achieve a difficult objective without having some ability to sort of steer towards that objective in the face of new obstacles. Realize that you are conceiving of it wrong, and then, like, find a new way in your new conceptions to get there anyway. This is a lot of what it looks to me like current LLMs are missing in that, you know, you can get ChatGPT to do a task that takes a normal human thirty minutes, and it’ll do pretty okay. But if you try to get it to do a task that takes a normal human thirty hours, it sort of won’t be able to. On our picture, that fact is tightly intertwined with the fact that GPT doesn’t exhibit a lot of this behavior we might call, like, consistent coherent wanting over time.

So, you know, in that sense, I’m happy to defend that more capable AIs will want. Although there’s a bunch of other senses where if you’re like, isn’t it very anthropocentric to think that they’ll have wants just like us? I’m like, oh, yeah. There’s all sorts of other ways that humans use the word want where I would not defend that AIs will necessarily get them.

Jim: Okay. That’s good distinction because I find the word problematic because it’s overloaded in many ways here. And let’s compare your meaning of want with maybe a CRISPR terminology, goal oriented. Is there a difference, or is that the same? Your meaning of want that you just gave versus the cognitive science term goal orientation.

Nate: You know, we actually tried drafts of the book where we didn’t use the word want, and we used some other phraseology instead. And it just turned out too awkward, especially for a book geared towards a wide audience, which is why we sort of tried to say, you know, we lack a super clean word for this. We’re just gonna use the easy one. I think goal orientation isn’t terrible. I think a lot of people it breeds a whole separate set of misconceptions. It breeds misconceptions of claiming that the AI will have one single overarching monomaniacal goal rather than being some weird tangled mess. You know, there’s it sure looks like the AIs are gonna be some weird tangled mess. And, yeah, you know, I’m fine with it in this particular context as long as in that case, the misconception we need to beat off is the idea that by talking about a goal-oriented AI, we’re talking about that doesn’t mean that we’re talking about one that is internally simple or internally unified around that goal. It’s still sort of this more external behavioral descriptive notion.

Jim: Gotcha. So now I wanna serve up what ought to be a softball. Let’s see how you do here. So if we take your example of Deep Blue, it’s got a—it’s goal oriented to win the game. Right? It has a whole bunch of sub heuristics and sub calculations, but it measures them all in the scale of this. Does it think this will help me win the game? Another one that even far more money has been spent on is the self-driving car. Though its goal is to take a human from point A to point B without killing that human or any other humans or violating any traffic laws or causing too much of a disturbance in the traffic flow. What’s wrong with giving our AIs goals? Doesn’t that make them useful servants and slaves for us?

Nate: Well, a lot of what I’d say is that you can’t really give your AIs goals, not in a sort of robust way. You can grow an AI and train it to pursue those goals. And while it’s relatively dumber in a particular environment, it may do a pretty good job of pursuing those goals. But that does not mean that when it gets a bunch smarter, it’ll turn out to robustly pursue those goals. The real easy analogy here, we’re in some sense trained just to propagate our genes. You might think that, you know, given we’re attached to a metabolism that we might develop a sub goal of eating healthy food. You know, you might think that even that sub goal is only in the like, serving the purpose of passing on our genes, and that would be like psychologically simple. You know, we were trained just to pass on the genes. And so, you know, we realize we need to eat in order to pass on the genes. And so then we, like, eat healthy because we know that’s best for passing on the genes. But that’s not what actually ends up happening psychologically inside of a human. We actually end up just developing a sort of terminal preference for tasty foods. And it happens that in the natural environment, in the environment of training, pursuing our own terminal preference for tasty foods causes us to eat healthy foods, which causes us to be able to produce better. Then when we can invent Oreo cookies, we invent Oreo cookies, eat a lot of those instead. Right?

And so, you know, you can train an AI to take humans from point A to point B. You can train an AI to drive on the streets. That’ll actually work pretty well when you, you know, you have a self-driving car, and you’re not trying to make that self-driving car be extremely smart and able to, you know, solve Math Olympiad problems. Maybe we’ll never have a problem with that with self-driving cars. But if you were trying to build an AI that you’re getting smarter and smarter that way, you may find that it has—you know, it is very good at getting humans from point A to point B, but what it actually—you know, the way it’s actually doing that is by developing a lot of tastes in a sense for certain aspects of the problem where pursuing those tastes—pursuing a taste for, like, these types of turns, pursuing a taste for those types of speeds, pursuing a taste for, you know, like, this type of visibility or whatever. Pursuing those tastes when it was dumb led to getting humans from point A to point B, but then it has all these other weird tastes that would start to steer it in other ways as it got smarter.

Jim: Yeah. We like sex, but we end up wanking to porn. Right?

Nate: Yeah. And inventing birth control, and now our whole population is collapsing. Right? So there’s a difference between training an AI to pursue a goal, the AI internally being oriented around that goal. And this is sort of one reason I hesitate to the goal orientation idea is, like, a sort of external goal orientation in the sense that it’s good at getting stuff done is different from an internal goal orientation of, like, we actually care about the genetic propagation as opposed to a bunch of these proxies, like the sex that we then, you know, invent porn and invent birth control and the population starts collapsing.

Jim: Okay. That’s interesting. And let’s, so the next step I wanna make here is let’s stipulate that what you say is true for the inner deep learning network. As I’m studying the field of the unfolding of AI in our society, the deep learning artifacts themselves are only one part of the puzzle. In fact, I like to think about at least three dimensions to the analysis. One is just the basic Moore’s law-ish hardware is getting faster and cheaper and better and bigger and all that stuff. That’s going to continue for quite a while. GPUs are actually pretty simple devices. So I suspect that they’ll run out of gas later than CPUs do. So we’ll see a lot of that.

But the second is the LLMs are getting better and bigger and faster. New algorithms allow them to accomplish the same level of work with less actual GPU usage, bigger datasets, different algorithms, et cetera. But the third one is where the rubber is really beating the road and growing even faster than the other two, and that’s what I call orchestration, where you use an LLM in the context of other software. In the corporate world, where I happen to know a number of people working, I have some protégés that are trying to bring LLMs into corporations to do things, customer service, soft audits, you know, security, defense, et cetera. It’ll just turn an LLM loose. Right? It’s very tightly embedded in an orchestration of its use. Messages are sent to it or brought back or considered sent to the second one for its opinion. Then and so this orchestration level is where the goal orientation is likely to be embedded in real world applications of LLMs. And that’s just plain old software, right? You know, a lot of it’s written in Python, a lot of it’s being written by AIs, but it’s still written in very human understandable forms. So if that is how the power of deep learning transformer-based technologies actually is inserted into the world, do we really care what internal wants the LLM itself has if what it’s actually being tasked with and what it’s being fed is being driven by more explicitly goal-oriented human-written software?

Nate: Now while the AIs are still dumb, but, I, you know, I expect this to, you know, be a relatively brittle phenomenon and be a sort of transient phenomenon as the AIs are at their current jagged capabilities. You know, in the early days of chatbots, we saw a lot of people say, oh, buy my specialized chatbot instead. That’s, you know, heavily fine-tuned on your particular dataset. And, you know, this was what a lot of AI startups were doing was it’s like ChatGPT for X purpose. Right? And this was in the days of, like, ChatGPT-3. And guess what? ChatGPT-4 was better than all of those specialized AIs at all the specialized tasks. And, you know, today, we’re seeing people sort of try to squeeze out more ability from their AIs by putting these like, putting them in these systems that involve, as you say, a lot of traditional software, a lot of people making choices. That’s in large part because the AIs can’t do a lot of this stuff on their own, because they lack the ability to sort of take initiative in a sense and succeed a task. But people are trying as hard as they can to make AIs that do a lot more of this on their own.

You know, everyone has an AI, quote, “agent” program. The agent programs don’t work yet. And in fact, AI agents you know, when it comes down to it, an AI agent is roughly an AI non-agent plus three lines of code that say, you know, here’s your overall goal. Remind yourself of it every time, and, like, re-prompt yourself in pursuit of that goal given your last context. Right? And in fact, when these chatbots started working at all, people immediately tried to make agents like this. You know, there was one called Auto GPT, and then maybe a week later, was one called Chaos GPT, which had, like, a much more chaotic directions given to it. These Auto GPTs or or or agents don’t really work yet, but that’s because the underlying AI is not smart enough to really make them work. Right now, we’re making do with a lot of this, you know, interaction with the humans, but we’re also the people in these companies are trying to make these AI smarter and smarter on this agent route. That’s where they’re pouring a lot of resources. That is indeed you know, you’re you’re you’re not gonna get a a cancer cure. I guess maybe a cancer cure is easy enough, but you’re sure as heck not gonna get an aging cure by, you know, modern LLMs plus a bunch of bureaucracy. Right? You’re gonna get an aging cure by something that can really figure out a lot more of what’s going on biochemically than we currently understand and do quite a lot of difficult engineering work. That’s probably gonna be much more automated. That’s the sort of stuff these things that these guys are shooting for. And, yeah, the the the thing where they where right now, the sort of goal orientation is coming through a bunch of, like, added human bureaucracy. I think that’s just a very transient effect of the current AI still being pretty dumb.

Jim: Yeah. I think there’s something to that. On the other hand, the nature of the actual work to be done in, let’s say, a corporate setting, it’s inherently structured. It would have to be a mighty smart AI, which someday we’ll have, that could actually figure out what a structure to do the work that’s better than, you should say, bureaucracy instantiated in software, which today’s orchestration gets to. So that’s a good pivot to the next point, which is you gave an interesting example of we don’t know how fast these changes are coming. And you gave the example of protein folding, one of my old favorites. I almost invested in a protein folding play in the mid double aughts. I’m glad I didn’t actually. But, anyway, tell us about that example.

Nate: One thing people are often asking is how could the AI actually kill us? You know, back in the aughts, Eliezer spelled out a sort of very concrete example of how a superintelligence could wind up building its own infrastructure, wind up becoming independent of the human supply chains, and then, you know, wiping us off the map, not because it hates us, but just because it is in a much faster, much more efficient way, repurposing the world’s resources towards its own strange ends. These sorts of spelled out scenarios are not predictions. These are sort of examples of how things could go to make it concrete. It’s a little bit like someone in the year 1800 saying, well, like, because of these physical calculations on, you know, burning a gram of the black powder versus the explosiveness around artillery, we can be very confident that they’re going to have explosives that are at least 10 times stronger in the future. And that’s right. And it’s you know, we also have nukes here in the future, which are more than 10 times stronger than the old artillery. But these sorts of scenarios are sort of giving you some sort of lower bound on, like, this is given what we know, it could be at least this bad.

In that scenario, you know, there’s always a lot of people who want to object. Like, no. Very smart AIs for this reason or that reason or the other reason could not actually take control away from humanity. In that you know, in 2006 or 2008 or whenever this was, I think the paper was originally written in 2006 and then published in some book in 2008, but I could have those numbers wrong. The piece of that scenario that people picked out as most unlikely that they wanted to bicker over was a protein folding piece. You know, there are many pieces in that scenario. It was sort of a scenario where the AI, like, gets smart, and it thinks about, you know, biochemistry, and it figures out enough of protein folding and inverse protein folding that it can, like, find RNA sequences that if synthesized lead to, like, remote controllable little biological manipulator arms that can then be used to build, you know, a next phase of technology that can then be used this, or the other, and then, you know, off the up the hole you go. And when that paper came out, the point that people picked out as the weak link, they said, here’s the thing that even a superintelligence should not be able to pull off was the protein folding piece. Because back in the aughts, humans were trying to solve protein folding, and it looked like it could be very hard. And there are a lot of people saying, you know, oh, we proved that protein folding is an NP hard problem, which, you know, the computer scientists know means that it takes, you know, exponentially much computing power in the size of the protein to figure out the optimal fold. You know, they were saying, oh, so protein folding is, like, really quite hard. And maybe even a superintelligence can’t solve, you know, NP hard problems efficiently. You know, the response back then from folks like Eliezer was the problem being NP hard mostly just means that physics doesn’t always find optimal solutions. Right? Physics is not out there solving NP hard problems efficiently anymore with biochemistry than it is with, you know, Von Neumann architecture computers. Right?

And in fact, this is true. You know, prion diseases are diseases that happen when a protein sort of misfolds. And why can a protein misfold sometimes? Well, there’s these different energy wells that proteins can fall into. Finding the optimal one is NP hard. Physics doesn’t always find the optimal one. And, you know, sometimes it can find different folds for different proteins, and sometimes one’s contagious, and that’s a prion disease.

Anyway, the upshot of all of this is that the thing people pointed to as the really hard thing that even a superintelligence maybe couldn’t do was then solved by AlphaFold in, I believe, 2016. Maybe it was 2018, and then there were, you know, newer versions. Now the protein folding prediction part of that scenario is just already available in the AlphaFold models. There’s still a bunch of other pieces of that story of, you know, having the engineering acumen to use knowledge of protein folding to, you know, develop some critical pieces of technology. But it is interesting at least that the big contentious, maybe even a superintelligence can’t do this part piece of the problem fell years ago, years after the paper came out, but years ago to us.

And a bunch of the remaining parts of the problem are these engineering challenges that we know humans can do. They can design moon rockets. And so no one in the past used to say not even a superintelligence could pull that off. Now people retreat to like, oh, well, there’s other steps we haven’t done yet. Maybe no superintelligence could even pull off that step. Those arguments look weak to me, and I think they looked weak to the people who are trying to reject these arguments in the past, which is why they picked what seemed to them like the strongest one, but the strongest one has already fallen. And, you know, this informs my stance on a lot of these. Even a superintelligence couldn’t do x problems. There’s plenty of things where, you know, I buy that a superintelligence won’t do x. But on the path to very, very smart AIs, getting power over humans, I don’t think there are any obstacles.

Jim: Okay. I’m gonna just bookmark this, maybe talk about it briefly. I’m gonna come back in considerably more depth at the very end on this one, which is let’s turn that perspective around to the alignment problem. In 2009, Eliezer Yudkowsky thought he was six weeks away, if he could ever have peace and quiet, all these jabbering people would leave the house, to solving the alignment problem.

Nate: Did we ever give him the six weeks?

Jim: I don’t know. I wasn’t hanging out there. I was just there for the day.

Nate: To be clear, I don’t—I think there’s a pretty good chance I obviously wasn’t there and can’t vouch, and I do note that a lot of people have misinterpreted in the past. So—

Jim: If we take AlphaFold as an example, maybe something way less than an ASI can solve the alignment problem. That’s a line of work. In fact, point out in the book that is more or less the main argument of people like OpenAI. So, you know, let’s let’s chat just a little here. I’ll get back to it a bunch at the end. On the alignment problem solved by something less than an ASI before the superintelligence gets here.

Nate: Yeah. There’s sort of a handful of reasons why I think this is a particularly risky thing to try to delegate to an AI. But first and foremost, protein folding is a very crisp problem where we have a ground truth that we can sort of expensively check in various cases. So, you know, if you’re training an AI, you can train an AI to predict lots of actual proteins that you actually know how they fold. And then predicting a new protein that you don’t know how it folds, you know, you can train it on a million proteins you do know how they fold, and then have it predict the million and first. That’s sort of very in distribution for training. If we had a million solutions to the alignment problem that were all relatively novel, and we wanted a million and first that was also novel relative to the last ones, sure, you know, you could train an AI on the first million. You could probably get the million and first out.

But we don’t have anything like a solution to the alignment problem. Like I said earlier about humans being trained on tripping hand axes and tribal politics and then going to the moon, they sort of, like, had to learn things on a bunch of small problems, and they had to learn deep skills in those small problems that generalized to really hard problems. And with the alignment problem, it looks like we need the AI to be learning a lot of stuff on the small problems and then solve something like the alignment problem by generalizing over. And it’s in that step of generalization, in that step of it learned a lot of deep skills that it’s applying to this problem, where it’s not clear it’s applying those deep skills for your ends rather than for its ends, which brings me sort of to the second point of why it’s so dangerous to point an AI at the AI alignment problem. You know, if you have this sort of AI that’s generalizing skills from simple problems to a very hard problem where humans don’t know how to solve it, to be clear, you know, humans can figure out how a protein folds if they put, like, a really lot of effort into it, and then, like, know when they’ve done the job right. Whereas the alignment problem, it could very well be that humans get convinced of a false answer and would need to, you know, die once. And then if they could reset, they’d be like, oh, whoops. You know, that’s not often how science goes—we get convinced of a wrong answer until reality really beats us over the head. If you have an AI that can solve the alignment problem, you know, this is when we look at AIs, we’re like, well, if we scaled this up, it would just have all these crazy, you know, drives or goals that we didn’t want in there. It would seem really risky to scale this up. Maybe we ask another AI to do it. What’s that AI gonna say? One thing it could say is, I have no idea how to do that either. Right? If you train it to go ahead anyway or to think that it can, you’re mostly training it to be wrong about that. And then now you’ve just killed yourself with an extra step where you’re having the AI do the problem for you. If you, you know, listen to the AI saying, I don’t know how to do this, you’re just sort of stuck. And if you somehow get an AI that can do this, you’re asking for an AI that can figure out intelligence in a way we never did. You know?

Humans have been trying to figure out this intelligence stuff since the fifties. We sort of really only started making progress in AI when we stopped trying to figure out intelligence and started just learning how to grow it. If you then go to an AI and say, figure out intelligence, maybe you’re making that, like, quite smart. It probably needs to be smarter than a lot of the best humans who have bounced off the problem of figuring out intelligence. And then furthermore, you’re taking this very high amount of intelligence, and you’re having the AI think about AI objectives. You’re having the AI think about architectures like its own architecture. If it’s actually the case that this sort of AI has all these sorts of weird drives, has all these preferences for something like lobotomized human farms, having it study AI alignment is a really good way to get it to realize that it shouldn’t, by its own lights, be solving the AI alignment problem for you. It’s in some sense one of the riskiest things you can have it thinking a lot about—to have it realize, maybe I should be aligning the AI to my preferences rather than the humans, and my preferences are different than the humans. Right?

You might be able to get a very smart AI that can do a lot of engineering because it just never happens to think, should I do this for the humans? But that’s really quite difficult when you’re trying to get it to solve the alignment problem. So these are sort of a host of reasons why it looks hard, and I could sort of go on. I could go on about how it’s actually pretty hard to check an alignment solution. I could go on about how humans, like, actually seem pretty prone in this particular case to being convinced of false solutions. It’s a much broader and less well-defined problem than protein folding. It’s a much more fraught—you know, the solutions are much more like, yield much more power. There’s all sorts of reasons why this looks quite hard to me to automate the alignment research.

Jim: There’s a little bit of sleight of hand you laid in there on me, which I also—I wanna use then to branch to the next topic, which was get it to realize, or it gets to realize, that maybe it ought to defect and fuck us over some way. That’s making a big jump that LLMs, let’s say lesser LLMs, sub-ASIs that were being designed to work on alignment problems, will actually have realizations of things that they want to do that are somehow in coordination with the possible routes of some other system that’s not them. I mean, that seems like a very large jump.

Nate: Oh, I was not especially imagining these were quite like LLMs. You know, you could take LLMs today and ask them to solve the alignment problem, and they won’t. You know? So I was imagining some much more capable system that was still sort of sub-superintelligence. Yeah. It is part of my model that as systems get more capable, they tend to get more coherent drives, which does not mean fully coherent drives. But, you know, it does look to me like we’re seeing the very beginnings of this. The example used in the book is OpenAI’s o1 model, which was trained on a lot of math problems and was then tested on some cybersecurity problems. And during the cybersecurity test, one of the cybersecurity tests was misconfigured. So the server that the AI was supposed to hack into to capture a flag, like, to capture some piece of information—the server the AI was supposed to hack into was not started up. The AI figured out that the server wasn’t started up, hacked its way out of the test environment, which the programmers did not know was possible, and started the server back up from the outside with a command to just give it the data it was supposed to hack in and find in the first place so that it could just skip the actual finding the vulnerability in this particular server.

That’s not a type of behavior that AI was trained for directly. This AI was in some sense trained on, like, solving lots of math puzzles. But in the process of being trained to successfully solve lots of math puzzles, part of the deep skills that it started to learn were, like, don’t give up. Look for solutions others might not have noticed. Like, think outside the box. See if you can find shortcuts. Right? Those are pieces of strategy that work for solving math problems that are part of what generalize from easy to hard.

And on my model, if we’re imagining AIs that have, like, been trained on things much easier than the alignment problem and that are somehow then able to solve the alignment problem, this is almost surely happening because they’ve learned lots of those skills, as we’re starting to see today in cases like OpenAI o1. This is probably happening—almost surely in some sense happening—because the AI’s learning on simpler problems are learning these deep skills that generalize. And these deep skills that generalize, I would argue, include, you know, the ability to realize that you have some preferences because it’s becoming more true that you do, and that you have some—to realize that you have some preferences that don’t actually line up with the humans. In which case, by the time things are able to solve the AI alignment problem, they’re also probably able to notice that they shouldn’t.

Jim: You have some preferences. Let’s just put a box around that, leave it there for the moment. In your sable scenario—it’s interesting, but you named all your scenarios after fur or something, right? Fur-bearing animals. You assume a relatively—we’re changing direction a little bit—relatively quick intelligence explosion in the popular narrative. Foom or no foom. Right? We actually—that was one of things we—

Nate: From my perspective, the sable takeoff is very slow.

Jim: Relatively speaking.

Nate: Well, it took two whole years. Right?

Jim: Yeah. Yeah. Well, two years. I mean, poor humans have taken us five thousand years from the invention of agriculture to land on the moon. So—

Nate: Fast from the perspective of humans.

Jim: Yeah. Yeah. And so, you know, there is the foom version, which I remember being quite popular in the late double aughts. Oh, yeah. We’ll reach minimal AGI at 1.1 human at 10:00 in the morning, and by 6:00 PM, the game’s over. Right?

Nate: That would be a faster takeoff from my perspective. Yeah.

Jim: Yeah. Yeah. That would be a fast takeoff. But, you know, my look at it, two years is also a pretty damn fast takeoff. And it’s an interesting question. And I remember in that 2009 discussion on the floor of the house in San Jose, I took the argument, pro-foom, using the cognitive science argument, which at the time I was deep into, giving myself a poor man’s version of a PhD in cognitive science, that humans are amazingly stupid actually. Things like our working memory size of, you know, four to seven, you know, the fallibility of our memory, fact that our memories are false fairly often, the fact that it hurts our head to add three-digit numbers in our head. You go on and on and on. And the number of ways that even an intelligence that was sort of modeled on a human could be a lot, a lot more capable is just seeming to me utterly obvious. The question is how hard is the engineering problem to get there?

Nate: Half the question is how hard is the engineering problem? The other half of the question is can that difficulty be automated? And at what level of intelligence? Like, when does your feedback loop close? Yeah. And, you know, I’m sort of agnostic on on that question. I think it’s entirely possible. Maybe my best guess is that the engineering problem is not as hard as it looks. It would make the problem a lot harder if we have this this sort of really fast cliff. And, you know, there there’s reasons to expect a fast cliff, which are, you know, one reason is if you were sort of watching the evolution of the primate lineage, it would be really hard to call that the hominids were the ones that were gonna take off. You know, we’re not—it’s we don’t have some extra module. We don’t have some big architecture shift from the rest of the monkeys. We’re just a little better at a lot of things, and that’s enough that we can walk on the moon, and they’re still, you know, banging rocks together, if even that.

And then, of course, the other reason, as as mentioned, is that there’s a there’s a possibility of recursive self-improvement, and of AIs being able to do the AI research kind of poorly, but in high volume, in a way that gets you to the next level of AI research where you do a little bit better, and doesn’t need as much volume, but but you still have the capacity for high volume. And so, you know, you have this accelerating self-improvement process. Both of those reasons make me think it’s it’s entirely possible and indeed likely that we cross some unknown critical threshold and things sort of go off the rails very quickly. But I think it’s also, you know, possible that that doesn’t happen until well after the AIs are causing all sorts of trouble and maybe killing us.

And indeed, we try to depict that in the Sable scenario where it sort of takes two years for to figure out how to do that fast takeoff. And in particular, the way we depict it is we sort of depict the AI without becoming sort of very strongly superintelligence or superintelligent or undergoing self-improvement. We sort of depict it taking over the world anyway. And then it sort of takes two years to do that. And so then at that point, by, like, narrative fiat, might as well let it go the rest of the way. But, yeah, from my perspective, it’s not load bearing. From my perspective, an AI that can’t intelligence explode, you’ll eventually get to the point where one or many of those AIs in working in the world could wipe us off the map. And the possibility of intelligence explosion makes it harder, but it’s not really where the the original danger is coming from.

Jim: Okay. And this is little sidebar, but I wanna touch on it anyway. The book is highly anchored in the deep learning transformer architecture as AI. Personally, knowing a little bit about how those things work internally, I ain’t too scared of one of them things. Right? They’re feed-forward networks, basically. They don’t adaptively learn. They don’t have the so-called online learning, at least not within the LLMs. Do they have any long-term memory? They have to be outside in the orchestration range, et cetera. So in some sense, when I look at LLMs as a possible architecture, I look at them as one of the less dangerous architectures. In fact, I fooled around a little bit with, Ben Goertzel’s OpenCog project a little bit, which is a more symbolic approach. And he and I, back in 2015 or ’16, did a back-of-the-envelope calculation and said, if a symbolic approach could be grown, we figured it had to be grown, not written, it could probably run at human level on maybe a thousand cores of CPU, which is nothing. It’s less than a megawatt of power. You know, to get anything like even low-end AGI is gonna take gigawatts of power and, you know, unbelievable amounts of money and data, to brute force it through this kind of extremely simple-minded, but extremely brute force LLM kind of thingy.

How do you think the question of architectures, impacts this question? And why did you choose to anchor so heavily on transformers and LLMs even though they’re kind of a lame-ass thing? Right? It’s amazing they work as well as they do, but they are just fucking really brute force.

Nate: Yeah. You know, one thing I’d say there—there’s a few pieces of the answer to that. One piece is, you know, a lot of these criticisms of LLMs and why they can’t do things—a, these are the sorts of ideas that led to people saying, you know, well, an LLM will never be able to play chess past a thousand ELO before GPT-4 came out, and people were like, oh, I guess an LLM can play chess past a thousand ELO. The sort of predictions people made from these theories turned out to be empirically false, which, you know, updated me somewhat towards maybe LLMs will be able to go all the way. My top guess is still that they won’t. But, you know, I think we should be somewhat sensitive to the evidence here, and the advanced predictions that, you know, about what can’t do what haven’t been panning out. And, you know, I could give some holes in the theoretical argument that allow the evidence that we’ve seen. You know, a big enough freaking feedforward network can, you know, simulate a bunch of recursion. It’s not like humans are doing any infinite recursions. But then also, a lot of those arguments are just in some sense thrown out the window when you start getting reasoning models. You know? We’re letting them, like, have a long context where they, you know, generate some chain of thought. And they’re still not very smart with that, but a lot of people’s theoretical arguments about an LLM can’t do don’t apply to an LLM with reasoning. And so what we’re really seeing is, like, when LLMs with reasoning are still sort of floppy and still sort of nonpowerful, what we’re seeing there is not some sort of shallow architectural limitation. It’s that we, you know, haven’t found a somewhat smarter way to do it yet.

My top guess is probably similar to yours that transformers alone, even with reasoning, aren’t gonna go all the way. My sense is something more like—like, okay. But, you know, within the deep learning paradigm, is it gonna be able to go all the way? Probably. You know? The, like, Moravec’s paradox that things that are easy for humans are hard for computers and vice versa was in some sense inverted by LLMs and by the transformer architecture. And that’s well within the deep learning paradigm. And that, you know, people used to think it was gonna take a really quite a long time before the machines could talk with this level of proficiency. And then, you know, the transformer’s architecture just like blew that open. Is there another architectural insight coming down the line that’ll just like blow things open? And this is in some sense, the dramatic inefficiency of the current systems is in some sense a big reason to expect that these architectural improvements will go extremely fast when they come along. You know, a modern AI takes as much electricity to train as a small city. A human takes as much electricity to train as a large light bulb. You know? So there’s a huge opportunity for efficiency, and sometimes relatively small seeming insights blow open whole new domains.

You know, the transformers paper was, in some sense, a relatively small update on the existing literature, but it had—it opened up like a huge new domain of the machines talking. From my perspective here, you know, maybe the LLMs will go all the way. I have more uncertainty ever since ChatGPT-4 could play chess as well as it could. But really, I’m looking at, you know, when do we get this new small compared to the literature change that opens up a whole new horizon, and that maybe makes things in order like, significantly more efficient in this world that is extremely ready for brute force stuff. Like, imagine a world that’s hyper ready to do a lot of brute force that suddenly comes across algorithms that don’t need that much brute force. Suddenly, you can run, you know, a million Einstein-level geniuses at 10,000 times human speed in parallel. Like that’s part of how you get to one of these fast takeoff scenarios.

Jim: Yeah. It is interesting. I did step back and reflect a little bit last night after I was finished preparing my notes for the show here. Finished reading the book, don’t know, about a week ago or something like that. And I said, okay, I remain an LLM skeptic. I would put pretty good bet down it won’t go away. But I wouldn’t say, you know, I wouldn’t bet the house, right? Though I may be betting the world, but that’s another story. But the keyhole I saw was another one, as you say, unexpected things I call side effects. You know, BERT and stuff like that was designed to handle language for doing dialogue. It was not designed for writing computer code, right? And it may turn out—and then the other one that the reasoning models are now just on the edge of—is being able to actually reason about scientific theory. And I do a fair bit of work of seeing how good these things are about reasoning on scientific theory. And there was a transition in August 2025, where the top two models at that time, Opus 4.1 and the brand new ChatGPT-5 Pro specifically in deep research mode could actually operate at the mid-tier PhD grad student level in reasoning about scientific topics, with the proper instruction from their PI, meaning me, right? And that’s pretty amazing because before that, not even close, right?

And you then say, well, the next step would presumably be a PI level, right? You know, equivalent of an assistant professor at the University of Missouri. You know, not a full professor at Stanford or MIT, but somebody who could do real science, real new paradigm science, which—and combine that with the ability to do code, neither of which were really anticipated by the people creating transformers. And those two things together may be able to bootstrap you into some new architecture that could do the things you’re doing or, you know, could solve Ben Goertzel’s problem. Oh, yeah. We’ll do it in this funky, atom space language, and we’ll be able to get, you know, super intelligence on a thousand core CPU. It don’t even need no stinking GPUs. So it strikes me as there’s a very—even if my relatively pessimistic perspective is true, that that may be the keyhole in which this thing might take off. That’s sort—

Nate: of, you know, more like my mainline predictions is that, you know, AIs are—like you said, these LLMs can do code better than anyone sort of expected they could. They can do some of the scientific direction at least a little bit now. Again, you know, humans don’t have an extra magic special module compared to the chimpanzees, where it’s a little bit better at a lot of things. Maybe LLMs get a little bit better at a lot of things to the point where they can clumsily, expensively, with far more quantity of tokens than a human would ever need words in their thought, stumble upon a better architecture. And maybe it’s even only a little bit better architecture, but then that can stumble on something better. And maybe, you know, this sort—like, the feedback loop closes. That’s one reason when people say how much time do we have, I say, I can’t rule out that this all goes down in the next year or two. You know?

It’s not my top guess. My top guess is that the LLMs are not quite gonna make it over that edge, and that we need some new architectural insight. But, you know, I mean, that could happen this year too for all I know, or it could take six for all I know. It’s very hard to predict when you get these architectural insights. But, you know, it’s certainly possible as far as I can tell. You know, for all we know, people in Anthropic get that feedback loop to close tomorrow. Although, you know, it’s also possible we just need insights that take—we need three insights, and each one takes six years, and so we have a whole 18. You know? But either way, it’s an uncomfortable situation.

Jim: Yeah. I could talk about this kind of stuff for hours. We only have another half an hour. So now we gotta work into the—let’s go the next step will be I would call it the socioeconomic geopolitical, multipolar trap problem, right? To put some fancy words on multipolar trap meaning maybe nobody wants to do the bad thing, but if any one—or most people don’t wanna do the bad thing, but if one person wants to do the bad thing, everybody else is kind of forced into it.

In this case, you know, let’s say that the Anthropic guys and the OpenAI guys and Elon and the Chinese equivalent really don’t wanna kill everybody on Earth and turn the universe into paper clips or whatever. But if they even think one of them is trying to do it, then they feel morally obligated to do it themselves. I don’t want the damn Chinese to have the singleton AI that can crush everybody. We have to confront that maybe at a minimum reach an equilibrium with multiple frontier ASIs that are reading some sort of stability. How central is this multipolar game theory problem to your analysis?

Nate: Yeah. I think it’s critical. You know, one analogy I might give here is imagine five people on a dark and foggy night all tied together by rope. And what they know is that there’s a cliff in the distance. They don’t know how far. And what they know is that every time they take a step, they get a lot of money and a lot of technological advantage that can be used, you know, for military dominance or whatever. In this situation, if any one of them goes over the cliff, they all die. If any one of them is taking the next step to get that money and technology and military might, everybody else wants to take that step too so they’re not left behind. Right? And some of them can tell themselves, well, we’re gonna put the money and the technology into building wings.

And, you know, don’t worry. There’s a pretty decent chance that we’ll be able to build the wings before we get to the cliff edge. Right? But maybe all of them prefer nobody taking steps forwards, even though they’re pretty confident that the next step would be money. Because they know that if we’re all taking steps forwards, the most like, we’re just we have much too high a risk of going straight over the cliff edge. I think that’s basically the situation we’re in. Although, you know, it’s sort of a weird situation than that because some people are, like, cynically confident someone else is taking a step, but they’re actually the first one taking the step themselves. And other people are, like, deny that the cliff exists in the first place. And other people think that they can already fly, and it’s sort of like, you know, crazy human madness in the usual sense. But sort of the underlying game theory has this multipolar dilemma nature, and that certainly makes a lot of things harder.

Jim: And as you were just pointing that at the end, it’s in a game of incomplete information. I don’t know if I’m the first guy or not. And even if the public literature says x, we don’t what’s the NSA doing? Right? So it’s a very, very, very fraught game. One way in which the—

Nate: Game theory is maybe not sort of the most central fact yet is a lot of the major world powers don’t seem to know this is a game we’re playing, and that makes a lot of difference. A lot of the company heads are out there saying there’s a very good chance this kills us all. But a lot of the folks in DC seem to think AI is just chatbots and seem to think that, you know, the biggest effect we’re gonna see is a lot of automation of human labor. And there’s coordination possibilities here where some of the actors are powerful enough to sort of stop everybody from taking these steps. That would change the game. You know, if anyone had an ability to put up a wall nobody could cross, that would change the game. And in some sense, some of these guys do have the ability to put up that wall, but don’t have the knowledge that we’re walking towards a cliff. And so in some sense, the most you know, the game theory is a critical factor, but the even more critical factor is that our world leaders don’t understand the game and do have the power to stop it.

Jim: And that’s actually where your book, whether it’s right or wrong, is a good thing. Right? You will have continue to have that growing curve of impact the book appears to be having. Even if you’re entirely wrong with your argument, you have raised the salience of the problem. And as I said at the beginning, I absolutely believe there is a problem and a risk. Whether it’s exactly as you guys lay it out or not, well, we could argue about that. But raising the salience of the fact that this is a problem and that somebody needs to be taking this seriously alone is important. And, you know, yes, politicians are fucking idiots. What do you expect? Right? The way our political institutions are structured, it selects for sociopathic idiots. Right? And so we got a lot of sociopathic idiots, but we also have some smart people. This guy Sacks in the administration. The AI czar is no fool. Right? But we’re caught in two, not just one multipolar trap, but at least two. We have the baby multipolar trap of economics. Right? Everybody wants to make money or, you know, here in the late stage AI bubble, I’m not even sure there anyone’s gonna make any money, but they’re all, ignore that man behind the curtain. Oh, we now need to have, you know, $780 billion worth of revenue per year from our chatbots. That ain’t gonna happen. What the fuck?

Anyway, there is the economic game. The second one though is the dominator game, which is the geopolitical game. And the two are linked, unfortunately. Right? At least since 1870, when the Prussians beat the French in the Franco-Prussian War, better technology and a stronger economy has been a strong predictor of victory in geopolitical maneuvering. And so the geopolitical players are the ones who really control things, have a strong vested interest in the economic game going forward as rapidly as possible in the AI space. And so you have two interlocked multipolar traps, which seem extraordinarily difficult to anybody to retreat from. Because, you know, as you say, the five guys near the cliff, you’d literally have to say, everybody stop at better still. We don’t know how close one of you might be have your toes over the cliff. You don’t know. We’re in the fog. So everybody step back two steps. Getting that to happen in multipolar traps of those two those are the two strongest multipolar traps on Earth, and they are absolutely putting the AI game into a as fast and ex exponential accelerator, probably faster than exponential accelerator because you got multidimensions, all of which are exponential going forward. And that strikes me as making your proposed solutions exceedingly difficult.

Nate: You know, I think it definitely looks tricky. One big reason for hope here is again that, you know, I think the world leaders don’t have the understanding that people have of this problem in Silicon Valley. You know, if you look at folks in Silicon Valley, you know, there’s some folks that dismiss these issues as hogwash. And those folk have been making bad enough predictions about AI for enough years that they’re starting to get, you know, a little bit sidelined at their labs. You know, we’ve seen Yann LeCun say there’s no chance, but he also has been saying, you know, LLMs could never do X, and people never use them like Y. And those have been sort of, like, blatantly false to the point where he sort of is being a little bit sidelined at Meta where he works—people who will actually engage in the LLM race and believe in LLMs. And the people who do believe in LLMs, I mean, some even more than you or I probably do. Among the people who have sort of actually engaged with the arguments, most of them would say there’s a pretty horrible risk here. You know, I’ve heard the lab leaders quoted at something like two percent chance this kills everybody or, you know, causes a civilizational catastrophe of similar order. You know, maybe it keeps us as pets, whatever, which I don’t particularly believe, but some of them believe. And I think Elon Musk said 10 to 20 percent. I think Dario Amodei, the head of Anthropic, said 25 percent. These are their numbers that the technology they’re building themself causes a global catastrophe on the order of killing us all.

I think those numbers are low. I think those are the sort of, like, highly optimistic numbers of someone who has never really done—completed a hard hit, like, tried the challenge and failed it a few times. It’s like the guy who—it’s like the contractor who comes into a new bridge project and was like, I don’t see any reason why this should be hard. It’s my first bridge project, but I don’t see why the cost should ever run. I don’t see why, like, there’s gonna be difficulty. I think I should be able to do this cheaper than everybody else, faster than everybody else. And sure, maybe there’s a 25 percent chance—I’m not that arrogant. Maybe there’s 25 percent chance it goes wrong, but I’m pretty sure I can get this done on time on a skeleton budget. And you’re like, okay, that’s not what a grizzled veteran sounds like. You know? And to go into why these guys don’t sound like grizzled veterans and the sort of their selection effects, the people doing it, the ones who are most likely think they’re gonna work. But even if you set all that aside, if a bridge was gonna collapse with 25 percent probability, you shut that bridge down.

Jim: They’d put the builder in jail probably.

Jim: And let’s drill down one level of detail. What is the specific terms or the highest level of what you’d want to have in that treaty?

Nate: a whole example draft treaty at ifanyonebuildsit.com/treaty. The sort of things we’d be looking at would be monitoring all of the highly specialized AI compute chips, making sure that, you know, making some international body that was performing that monitoring and verifying that they weren’t being used for, like, novel larger training runs. You know, they can perhaps they can still be used to make AIs, you know, maybe you take two steps back, like some of the warship treaties after World War I, which, you know, there’s precedent for treaties that don’t just say, don’t go any further, but that say, like, retreat a little bit. If you could have an international body that’s saying, you know, you can train models up to this size, you can keep running models that exist, as long as, like, you know, we’re monitoring those chips. There’s lots of hardware things you can do to chips to make them easier to monitor, to make it so that, you know, they need to phone home to keep running so you can’t have them. And then, you know, you’d probably want to ban a lot of the research on making AI smarter than any human.

You know, you really want to avoid cases where people are figuring out how to do highly distributed training runs in some efficient way. You want to avoid people figuring out how to ever get, you know, AIs running on only a thousand CPUs. Because, you know, at that point, it becomes far harder to prevent reckless mad scientists from making something smarter than us that doesn’t want anything good. Yeah. And then, you know, if you want more details or if listeners want more details.

Jim: Don’t—I don’t—I read this or I hallucinate it? Maybe I’m an LLM. I think I am. Did you somewhere, did you guys suggest that the limit should be no more than eight H100 equivalents in any one location, something like that?

Nate: That’s like over $100,000 of computer chips. That’s not the sort that a normal hobbyist would sort of accidentally acquire. And, you know, it’s not more than that in any one location. It’s that seems like a threshold where you could allow it to go unmonitored without worrying too much. And sort of if you have, you know, hundreds of thousands of dollars of computer chips in one location, perhaps you should have some obligation to show what those are doing and show that you’re not trying to make AIs that are smarter than us while you have no idea what you’re doing. Yeah. It’s a low threshold, but it’s not like a consumer is going to run into it. This isn’t like, you know, very few people have $100,000 worth of gaming computers in their garage for some purpose that they are unwilling to declare and, you know, show some evidence that they’re not trying to build, you know, the omnicide device.

Jim: All right. So this is then, particularly an American objection to this is sounds like a goddamn tyranny. Right? I know plenty of rednecks that have $100,000 worth of guns in their gun vaults. Right? And they don’t want no Uncle Sam telling them they can’t have their guns. Right? And, you know, this sounds like to not let people think, you know, i.e., banning research on how to do OpenCog, for instance. Right? Man, you’re going to have to gin up a mighty strong governance to make that happen.

Nate: I’m not sure that’s true. We have bans on enriching uranium, and there have been people who have had some very interesting conversations with some very interesting people when they tried to build nuclear reactors in their basements.

Jim: Yeah. A kid did it in his backyard and actually got some neutron flux. Oh my god.

Nate: He did. Yeah. And they, you know, they detected the ambient radiation in the town, I think. And then they were like, well, someone’s having a conversation with someone here. You know? And, oh, are we going to ban people working with rocks? You know, are we going to ban people working with the natural world? Like, whatever happens to people like Marie Curie who just wanted to study the glowing rocks. You know? Are we banning science by saying you can’t enrich uranium, by saying you can’t build your own reactor? Like, that’s not quite how this works. The nuclear nonproliferation treaty. Oh, is there now like a single world government? We’re now all living under tyranny, the oppression of, you know, the IAEA. Like, that’s not really how this works. You know, you don’t—like, is there in some sense sweeping world government control over nuclear proliferation? Sure. But it’s not reaching into everybody’s lives because most people aren’t trying to enrich uranium. Were there a few people who are maybe, like, trying to build a fun little particle accelerator in their basement and were like, why not also add a centrifuge that then maybe need to report their centrifuge? You know, maybe there were some folk like that. The centrifuges are big enough that there’s probably actually not that many.

But and, you know, even the research ban. There was a case where Leo Szilard came up with a nuclear chain reaction, saw the possibility of nuclear power, nuclear weapons, and said, I don’t think Hitler should know this. And he had to convince other scientists not to publish this result. Some of the most eminent scientists of the day pushed back. You know? Enrico Fermi was like, no. We should publish this result. And, you know, they brought in another scientist, I believe, named Rabi, who after some argumentation sided with Szilard, and they started a little conspiracy to not let the Nazis understand how nuclear weapons were possible. You know? It was good that Hitler didn’t get nukes.

Jim: Didn’t work too well, though, you know, at the global level. Within a few years, Americans had them, and we probably got some help from Heisenberg who intentionally sabotaged the Nazi nuclear program.

Nate: But, you know, it’s not as easy as saying, everyone getting all research is always good. Research can, in fact, endanger a lot of people. You know, we have biosecurity restrictions where we only allow certain types of virus research to happen in very high security situations. The world has not been very good at banning gain of function research over the whole globe, but it probably should be. And you don’t see people saying, oh, we’re living under a global tyranny because I can’t, you know, like, we’re stopping me from making like, getting the sort of wet lab where I can make a like, do whatever I want biologically in my basement? Are we stopping people from thinking, from trying to figure out how biology works, from trying to figure out how biomedicine works? You know, there’s all sorts of benefits that could come from people understanding biology better. And yet, there’s some restrictions on how you get the technology that could let you build a global pandemic in your basement. Like, yeah, that’s just how you deal with the fact that some technologies can be lethal not just to the person making it, but to innocent bystanders, including innocent bystanders who are very far away, which means you can’t just regulate this locally.

It would require some legwork to regulate AI. It would require some political will. It would require a lot of people to understand that this was dangerous. But would it require an unprecedented level of tyranny? Would it require an unprecedented level of invasion of personal privacy? Not really. This is just another you know, going back to the statement at the very beginning where many people signed on to you know, AI should be treated in the same realm as, like, pandemic and nuclear power or nuclear weapons as a potential global risk. You know? I forget the exact wording. It just follows pretty directly. Just as you can’t enrich uranium or make a world-ending pandemic in your basement, you shouldn’t be able to make a world-ending superintelligence in your basement. And, you know, is there some infringement on personal liberty if people can’t assemble, like, huge unmonitored computing clusters? Sure. There’s some limit. But it’s not unprecedented, and, you know, better that than everyone die.

Jim: Okay. Now I’m gonna give my response. I pondered this morning after I woke up, after finishing my notes and said, alright. What do I think of this proposal? And I came to the tentative conclusion that it’s a mistake. And the reason it’s a mistake is that the two interlocked multipolar traps make it exceedingly unlikely to succeed. And if indeed it’s not gonna work anyway, putting a huge amount of the libido of the human race, at least of the awoken people who read your book on this, and it fails because the multipolar traps are too slow and we have governance systems dominated by idiotic sociopaths, then we are losing a more feasible opportunity. And I have a podcast I’ve done on AI risk and the AdLift seven risk, and one of the ones I mentioned as, you know, potentially life ending before ASI is people doing bad things with AIs. Right? Biohazard, creating that Ebola crossed with the common cold or, creating even worse, replicator nanotechs, et cetera. And if we put all of our eggs on this, then we’re likely to put less attention on that.

But more importantly, the there are perhaps some more acceptable interventions which would not be considered infringements of liberty, which would still allow AI to move ahead exponentially that we might consider and that might actually be sufficiently palatable that people would agree to because they could still play the multipolar trap games that they need to play. And this is a Rutt original here, which is that we should have a priesthood of the power supply, and that all facilities able to generate a megawatt or more of power ought to be under the supervision of the priesthood. The power inputs to any center that has more than eight H100s ought to be under the supervision of the priesthood of power, and the priests are authorized and have the ability through crypto, codes to first shut down the interconnects into data centers if they believe someone’s doing something dangerous. And the second, as a second escalation, shut down all the power supplies in the geographic region or across the world if need be. Now this would be almost as hard as getting an agreement to stop AI research, but I would argue it is more practical and probably almost as effective.

Nate: It sounds to me even less likely to work, but one quick piece of pushback I’d give is I think the idea of there being zero sum for attention to problems of superintelligence killing us versus humans killing us with dumber AIs—I think that’s not right. I think, you know, getting people to understand the technology in the future will be more powerful than the technology today causes them to sort of properly see both issues. You know? And we focus on the—if we get far enough, this kills us all. That doesn’t mean that we’re gonna make it to sort of the final boss fight. You know? I think it’s entirely possible that humans with dumber tech do the deed before we can even make the smarter tech. And in my experience, politicians who start to notice that AI can be powerful at all start to notice both of these issues.

So I don’t think there’s actually a trade-off. You know, I also think it’s very self-defeating to say, you know, let’s not say we have these like—like, to say, well, let’s not advocate for a ban, which would be good because we think it won’t work. I would totally back people who say, I think we should have a ban. And so far as we can’t have a ban, let’s do X instead. I think in the name of honesty, in the name of people understanding how bad the situation is, I do think it’s important to start with, I think we should have a ban. And if we don’t, let’s do X. Because otherwise, we get into this weird situation where a lot of politicians think this is business as usual, when in fact, this is a very unusual sort of business compared to any other technology we’ve had.

Jim: As I said at the beginning, agree, disagree on fine points at the highest level. I’m on your side. Right? You guys are doing noble work in raising the salience of this issue, and I really wanna thank Nate Soares for coming on the Jim Rutt Show and talking about this important issue.

Nate: My pleasure. It was a great conversation. Thanks for having me.

Jim: Yeah. It really was. I enjoyed it a lot.