The following is a rough transcript which has not been revised by The Jim Rutt Show or Ben Goertzel. Please check with us before using any quotations from this transcript. Thank you.
Jim: Today’s guest is Ben Goertzel. Ben is one of the world’s leading authorities on artificial general intelligence, also known as AGI. Indeed, Ben is the one who coined the phrase, or at least so it’s said on Wikipedia and other places. He’s also the instigator of the OpenCog project, which is an AGI open source software project and SingularityNET, a decentralized network for developing and deploying AI services. Welcome back, Ben.
Ben: Hey. Yeah, good to be here. Yeah, it’s been only a couple of months since we last chatted I think, but the AI world is moving fast. It’s just amazing to think about everything I’ve been doing on my own team and that’s happened in the world just in the last couple of months. Amazing time to be talking about AGI.
Jim: And as you know, I’m working on a LLM based project, not AGI, and I’ve never seen anything like it where the substrate is changing in material ways, and the strategic decisions you have to make about which way this high dimensional space will evolve on a weekly or two weekly basis, especially now that the open source models are starting to come to actually bite. Very interesting, very interesting times we are living, as I often say, it reminds me though at 25 times the speed of the introduction to the PC to the world around 1979, 1980, where there’s just so much low hanging fruit and so many things you can do as these new affordances come up.
But today we’re going to go deep on Ben’s main project. Last time we did a broad horizontal look at various aspects of the road to AGI. Today we’re going to talk about mostly from a paper, Ben and a bunch of other folks wrote, called OpenCog Hyperon a framework for AGI at the human level and beyond. Now we’ve done this before and I’m sure Ben’s done it so many times he’s ready to vomit. But for the audience, let’s do a very quick recitation of the difference between Narrow AI, AGI and ASI.
Ben: Sure. So what I mean by AGI, or artificial general intelligence, is software that can do smart stuff, solve hard problems, achieve complex goals that are qualitatively different from the training data or the programming with which the system was started out. So it’s got to be able to make leaps of generalization and that’s different than just being broad in scope because say LLMs are broad in scope, but it’s largely they’re not entirely. It’s largely because their training data was so broad in scope in the first place. So they don’t need to make a big leap to be broad in scope and humans are not omnipotent at generalization, but we are able to take fairly substantial leaps beyond our programming and training. We’re not the most generally intelligent entities you could imagine. The notion of a super intelligence is an AGI whose general intelligence is far beyond human level, which has many aspects probably including the ability of that super intelligence to fairly fully understand its own software and hardware and self-improve and self modify, which humans can do, but only to a quite limited extent.
Jim: Yeah. And I was actually providing feedback on a paper this morning and this author actually laid it out in a way I thought was pretty nice, which was that AGI was software that could do most economically valuable things that humans can do, at least as well as humans in a reasonable range. While ASI, artificial super intelligence, was clearly well beyond humans in many, if not all, economically valuable category. Now this person’s very economically oriented, so maybe they were over.
Ben: But this becomes interesting with LLMs because you could imagine deep neural net based system that could do 90, 95% of the jobs that people do for money without being able to generalize very well, which would because the modern economy has reduced most jobs to something that’s rote and repetitive and similar to what other people have done before. But on the other hand, an economy made a 100% of this deep neural net would never advance fundamentally as human society has. So I’m no longer sure you would need AGI to automate the economy very substantially, although I do think you would need it to automate science or art to a full extent.
Jim: Yeah. Robert Hanssen makes this point as well that perhaps the real binding point is the rate of advancement of a society.
Ben: Which is hard to measure, isn’t it? Hard to quantify in an objective way.
Jim: At least when you’re inside of it, right? The historians, there’s all this metrics that are quite interesting. But anyway, we are very short on time today so we could talk about this all day, but we’re going to move on and give us a two-minute version of the history of OpenCog, which got you up to about today.
Ben: Sure thing. So my own AGI journey began more on the theory of mind side. I was trying through the eighties and nineties going with a mathematical model of how intelligence works and I was screwing around with a bunch of self-organizing complex system approaches to AGI in the ’90s. And then when the internet came out I pivoted a bit and thought, well, we could make a distributed decentralized AGI system running a whole big network of processors feeding in the whole bunch of data from all over the place. I conceived a system called Web Mind at a company of that name in the late ’90s, which was basically a huge decentralized knowledge graph of AI agents. Now that proved a hard architecture to make work efficiently given the computing infrastructure of that time. So I created a new project called Novamente.
Novamente had an architecture where you had a knowledge graph or knowledge hypergraph, then a bunch of AI agents carrying out different learning reasoning algorithms acting together on this knowledge hypergraph and acting, or at least so was the idea synergetically on this knowledge hypergraph helping each other to do their different AI things. Now OpenCog originated as an open sourcing of a parts of a proprietary AI project called the Novamente Cognition Engine, which we open sourced in 2008 or so after Novamente had been around maybe six, seven years. And it was developed by myself and a number of colleagues. There was some random open source developer. It was mostly a group that was working with me doing various government and commercial projects using OpenCog engine, and there was the idea that we have OpenCog as a generic AGI platform and then a particular cognitive architecture developed on top of that that we were calling Cog Prime and I’m in the middle of trying to come up with a better name for that cognitive architecture at the moment.
Now we played with OpenCog for quite a while. We built some cool prototypes, the backends of a few commercial systems on it around three years ago. A number of us engaged with OpenCog, though not all, not actually the lead developer on OpenCog, Linas Vepstas, but a number of us engaged with OpenCog. Decided it may be time from the ground rewrite of the system, not because OpenCog was bad just because the world had changed a lot since then and we had learned a lot from our prototyping. So we started building a new version of OpenCog that we called OpenCog Hyperon and that was a big decision because it’s pain in the butt to stop doing AGIRD for a while and rebuild your infrastructure.
We’re still in the middle of rebuilding the infrastructure, but we’re at least able to prototype new AI algorithms and approaches on the new Hyperon system now though it’s not yet as scalable as it will be six or 12 months from now. So now while building out the new version of OpenCog, we’re starting to flesh out AI learning and reasoning algorithms and the new version of the Cog Prime cognitive architecture on top of the new version of OpenCog Hyperon and while daydreaming about the next version, OpenCog Tachyon, which will be quantum computer based, but we’ll leave that for later.
Jim: Yeah. I’ll probably be off to the AI in the sky by then I suspect, but we shall see.
Ben: You’ll be uploaded the OpenCog Hyperon by then. Yeah.
Jim: Okay. That’ll be cool. Okay. Yeah, just for the audience’s benefit, I actually played around with OpenCog a bit back in 2014, maybe early 2015. Used it as a backend for an AI minimal consciousness artificial deer that I had running in Unity and had to do some of my own system stuff, and I’d always complained the thing was, well, in theory, a multiprocessor practically it wasn’t unless you did it yourself. And so I had to write my own communications infrastructure to be able to run multiple instances and communicate between them and stuff like that. And I’ve always pushed from the very early days as Ben no doubt remembers tirelessly and tiresomely, no doubt for a true distributed Adam space. It looks like you guys are making some real progress on that.
Ben: Yeah, it’s been a learning experience in a variety of different ways trying to build both an AGI and the infrastructure the AGI needs to run on. We built the old version of OpenCog with a small group of people and not that much resources at the time when AI was not as popular as now. So, I mean, we have a larger and broader team behind the Hyperon effort including more folks with distributed systems and computing hardware and programming language background and so forth. So I think we brought a broader group of people into building the infrastructure, which is great. My own personal interest with my individual contributor and researcher hat on is still not building the AGI running on top of this infrastructure. But on the other hand, if you look at what’s allowed transformers and other deep neural nets to do what they’ve done in the last few years, it largely has been advances on the infrastructure.
I mean, it’s Nvidia hardware infrastructure, it’s all these matrix multiplication libraries, it’s map produced. So it is a lot of the plumbing for doing things scalably and efficiently that has let modest variations on previously existing deep neural net architectures do so much more amazing things now due to being deployed at scale. So, I mean, I think I’ve gained a healthy respect now even more than I had before for scale. Like okay, you really do need to focus on all the plumbing to make it possible to run this stuff on huge data and huge processing infrastructure and otherwise you may have the right algorithm and the right knowledge representation and it still doesn’t do very much.
You never knew that it would’ve led to the Singularity if you just ran it on the shitload more machines because the continuity of function from Bert to GPT-2, GPT-3, GPT-4, I mean, it wasn’t all that obvious when you ran Bert in the first transformers. It was really cool, but it wasn’t totally obvious what GPT-4 would be able to do once you put that amount of data and firepower behind it. Right? I’m hoping we will see similar sorts of phenomena in OpenCog Hyperon where you have qualitative discontinuities of function as you start to do things at greater scale.
Jim: One of the limitations of the original OpenCog difficult to scale it up, not impossible, but wasn’t really natively designed for that. The second was the atom Es language was more than a little peculiar, at least from my perspective. And it looks like one of the things you guys have done is done a deep dive into the language you really want and it’s called MeTTa. Why don’t you talk a little bit about what walls you ran into with at of ease and what leap do you see coming from MeTTa, M-E-T-T-A by the way, people?
Ben: Yeah, I think there’s really two key technical aspects on OpenCog Hyperon. One is the MeTTa language and the other is the distributed and decentralized AtomSpace infrastructure. So the AtomSpace underlies both of these, which is a knowledge meta graph. So knowledge graph, I’ll assume people know what is is, it’s infrastructure for managing nodes and links. A hypergraph is like a graph, but as links that can spend more than two different nodes, like a link among three, five or 12 nodes. A MeTTa graph is a little more broad. You can have links pointing to links or links pointing to subgraphs, and these are typed labeled nodes and links, so you can have weights on them like probabilities or fuzzy values. You can have types associated with nodes and links where the types themselves could be represented by whole meta. And these AtomSpace meta graphics could live in ram or they could live on disc, they could live on one machine, they could live spread across many different machines.
Now, MeTTa is basically a programming language designed so that the MeTTa program is a chunk of knowledge MeTTa graph, and what a MeTTa program does is take chunks of knowledge, MeTTa graph and output other chunks of knowledge graph. So it’s a native language for knowledge graph rewriting and it’s designed for self modification and that a program is a bunch of knowledge graph which can then be rewritten by MeTTa programs, and so forth. So this in some ways is similar to the vision underlying Lisp, which in many ways is the original AI language. I mean Lisp was the first really usable language to break down the division between data and program. So a lisp program is also content that can be fed into Lisp programs and it’s easy to have Lisp programs generate or rewrite list programs stands for list processing. So lists most natural native data structure is a list, whereas meta its most natural native data structure is a type weighted MeTTa graph.
And I mean, in theory you can boil a type weighted meta graph down to a list, which is fine. On the other hand, in actual programming and algorithmic practice, that can be a cumbersome and annoying thing to do. So, I mean, having something conceptually Lisp-like but bothering out on a weighted typed meta graph is different from a practical programming language perspective. So in OpenCog Classic, as we’re now calling it the old original OpenCog, we basically were using scheme which is an object-oriented version of Lisp as the primary scripting language to manipulate the nodes and links in the OpenCog AtomSpace. And this was okay, but the fact that scheme is Lisp, which wants to be a Lisp processing language, means that things were a bit awkward here and there. And there was also this division between the scripting language, which was mostly scheme and then the cognitive content of the AtomSpace.
I mean you could write a transform to turn a scheme script into nodes and links, but it was an extra thing to do. And we also had Python and Haskell shells you could use to manipulate nodes and links in the original lime space, but those were not as well fleshed out as the scheme shell. So the decision to write our own programming language was done after a lot of painful thought and debate because I think there’s a long history in the AI world of people thinking, well, writing my AGI would be much easier if I had the right programming language, so let’s make the right programming language. Then 20 years later you became a program language researcher and did a lot of amazing programming language research, but you didn’t actually make much progress on the AI, which was your motive for making the programming language.
So that’s why I didn’t write a custom programming language in the original OpenCog. On the other hand, coming into OpenCog Hyperon, I mean, we had a bunch of specific code written for the old OpenCog that we knew what it was and we knew this was the code that we wanted to make smaller, more compact, more elegant and more efficient. So it wasn’t jumping into it just totally out of the clear blue sky. I mean, we had probably logic engine, evolutionary learning engine. We had some code for operating virtual agents and analyzing biology data, and so on. And so then we plunge into that the creation of a new programming language with a bunch of practical stuff in mind. One thing I found much to my delight as I plunged into the modern world of programming languages is actually the field of programming languages had advanced quite a lot since 2008 when I architected the original OpenCog system and without as much fanfare as advances in AI or some other areas, but quietly the world of functional programming had just advanced quite dramatically.
So I mean, I’ve been a Haskell programmer since ’92, ’93 before Monads were there when Haskell was quite primitive. But as I delved into type theory, which was important to me because I was looking at a lot of value for types, nodes and links in the knowledge MeTTa graph. When I looked into type theory, I could see, wait a minute, we now have dependently type programming languages like Adra or Idris and Idris 2 is even reasonably fast to run, which is this dependently type language is quite elegant and you add some academic researching languages which have gradual dependent types, which it’s a very wonky technical thing, but it’s an advance in design of functional programming languages that just wasn’t there in 2008. I mean, in the prologue world, you have experimental prologues that do unification very, very efficiently like accounting for garbage collection in real time while you unify complex expressions.
So I found there was just a lot of advance in the programming language world that we could draw on to build a language you couldn’t have built in 2008. So when we created Meta, we were trying to take everything that had been done in the functional logic programming world in the last couple of decades and one up by getting even more hardcore weird and abstract. So when you say programming for OpenCog Classic was a bit weird, it was, but there are many ways of being weird. So programming in MeTTa is also very weird, but we’re trying to make it weird in a better way. I mean, it’s a different language than has ever been played with before. It unifies aspects of functional programming and logic programming in different ways than have been done before, but we’re trying to make it weirdly elegant rather than weirdly ugly, and I think we’ve had enough people playing around with the early versions of MeTTa to get some validation that it does work out that way.
So the base version of MeTTa is very, very, very abstract and simple. I mean, it’s really just like the simplest possible way you could write a language that is a chunk of graph and then exists to rewrite MeTTa graph into MeTTa graph. So even the notion of equals is left as undefined as possible. So you could use Homotopy Type Theory to overload what equality means, and there’s no notion of type in the base language. You have to introduce type as a certain class of nodes and links with a certain notion of type equivalence and inheritance in it, which itself is boiled down into a bunch of nodes and links. So we’re really trying to boil down to the lowest possible level of plumbing. And what’s been interesting to set a couple of use cases, so we invited some folks from the NARS community, Non-Axiomatic Reasoning System, which is a different AGI approach than my own cog prime AGI approach.
We invited them to implement a version of NARS in OpenCog and then a guy named Doug Miles who I’ve known a while because he helped out with Hanson Robots a long time ago. He has his own AGI approach called LogicMoo, which is crisp logic done mostly in prologue. So we invited these two groups to implement stuff in MeTTa and they both fell in love with the language. I mean they’re like, well, this is just an easier, more elegant way to implement our AI stuff than anything we’ve seen before. Wow, it’s cool to have functional logic programming in the same framework. Yes, it’s rough around the edges, but everything is so simple and concise and we didn’t get that feedback from people doing development in OpenCog Classic that just the language itself was fun and cool and easy to work with even if you are bringing your own AI methods to it.
So I think that’s been interesting to see, and we are using that community to help refine the language and how it works. The other interesting feedback we’ve gotten is on the scalability side, we’ve been working with Greg Meredith who had developed his own program language called Rholang, which is very efficient for concurrent processing, and what Greg has done is develop a source to source rewriting tool that rewrites MeTTa source into Rholang source and then runs the Rholang efficiently using the Rholang run time. And again, this is enabled because of the way that MeTTa was implemented, it would’ve been hard to do that in OpenCog Classic due to the mix of not and link stuff with scheme stuff, but we had a formal definition of MeTTa. So then you can take the MeTTa from the formal description, rewrite it to a different language and compile it and run it efficiently on whatever hardware you want.
Then we can compile ang into hyper vector manipulations that we’re then running on the APU, which is a custom chip for associative processing made by GSI. Right? So I think we’ve managed to create something that is fun for people doing other AI stuff to use, and that is crisp and concise enough. You can compile it into additional forms to run efficiently on different sorts of hardware. So that’s certainly still a work in progress, but it’s been gratifying to get that validation and I’ve seen the same in my own little experiments using meta myself. I’m mostly doing combination of theory and management stuff now, but I have an interest in what’s called algorithmic chemistry where you have a soup of little rewrite rules which rewrite each other. So it’s like a pool of rewrite rules, rewriting rewrite rules so they can better rewrite rewrite rules. And my experiments with implementing algorithmic chemistry systems in MeTTa is it’s not much code and the code is very transparent, right? It’s fun to do.
Jim: The things you’ve said that track me and hey, when you’re ready, I’ll take a look at it. Is the idea that the code itself is formally in the meta graph so you don’t have this external back and forth, so fucking nightmarish between the external scripting and the language, et cetera. That would be gigantic. Other question I have for you, if I remember correctly from the old OpenCog, every atom in the AtomSpace was probabilistic in nature, you didn’t have to use it, but it was there. Have you retained that attribute or is it probabilistic less pervasive in the design than it is now?
Ben: So I guess it is less pervasive at the level of the plumbing of the atoms space and the language, but not less pervasive in the cog prime AGI design that I’m implementing anyway, if we get deep into the weeds in the old AtomSpace design, when you looked under the hood, each atom was a separate object or entity in the code. If you look under the hood in the default implementation of the high on a space which works with the rust base interpreter for MeTTa, atoms are not necessarily implemented as distinct software objects under the hood.
You have a forest of these trees and the atoms are represented in there and if you need to grab an atom to use it can be plucked out and labeled within MeTTa, but it seems to be more efficient not to necessarily represent the atoms individually. In a similar way in the old atoms space of the original OpenCog, every atom came with a truth value or attention value object, and then you could associate other values with it. But the truth and attention value were built in. We don’t have built-in values like that in the new AtomSpace because it was just inefficient because there are many cases where you didn’t need them. I mean, a blank truth value didn’t take up that much memory, but it’s just not necessary to have it there. Right?
Jim: Good. I think those are good design decisions. That was what my take was.
Ben: So, I mean, if you have something like a wild loop or a conditional or something, it doesn’t need a probabilistic truth value necessarily, but you’re still representing it in the AtomSpace. It doesn’t need an attention value at that level. You want to associate attention maybe with the whole meta script, but you don’t need to bother to associate an attention value with each conditional inside.
Jim: Exactly. Exactly.
Ben: What happens as everyone who does software knows when you build up a complex system over time, assumptions that made sense four years ago were baked deeply into the plumbing and then it just becomes annoying to get rid of them, right? Oh
Jim: Yeah, we all know that problem. What called technical debt, right?
Ben: I mean that happened a lot with the OpenCog Classic system because we developed OpenCog Classic and then we developed PLN Reasoning and Moses Program Learning within that system. So now if my hypothesis is right that the collection of learning reasoning algorithms associated with the Cog Prime cognitive design is basically a good way to build human level AGI, if that is right. I mean we’ve been able to architect MeTTa and Hyperon so that this collection of algorithms will be very elegant and efficient on this infrastructure. We’ve also built it for efficient interfacing with external deep neural net libraries, and this was actually the tipping point that led us to develop the new version of OpenCog is Alexey Potapov and Vitaly Bogdanov who were working with me, they wanted to connect torch the deep neural net library with OpenCog Classic, and they did it in a fairly nice way.
So each neural net layer within the Torch computation graph corresponds to a different node in the AtomSpace and they had it so composition of OpenCog nodes map isomorphic in the composition of layers in Torch. So this by the way, was a big advantage of torch over TensorFlow, Torch gives you programmatic access to the underlying computation graph and TensorFlow doesn’t for whatever weird design reason, so they made that work. But to do that, they integrated something in OpenCog Classic that they called a pointer value, which was a value associated with a node, which was a pointer to something outside the graph, which was in a way a horrible hack.
Jim: It’s super general, but it is a huge horrible hack.
Ben: It’s bad, but the way they were using it wasn’t bad, right? I mean, it is just like pointers in C or C plus.
Jim: Exactly. Don’t do that, God dammit. But once in a while it’s handy, right?
Ben: I mean, they can be abused in egregious ways. I mean, they’re used all over the implementation of every operating system. But yeah, Linas Vepstas, who was the main developer, the plumbing of OpenCog Classic, basically didn’t want to merge pointer values into the main code base of OpenCog because they’re ugly, but we’re just like, well, wait a minute, but how else do we do this? And Linas was like, “Well, don’t implement your neural net in the AtomSpace.” We’re like, well, long run. We really do want to do that, that’s the right way to do it. On the other hand, that will take years and right now we want to experiment with CNNs for computer vision together with declarative logic in OpenCog.
We don’t want to have to wait several years to be able to do all that neural nut stuff in the AtomSpace. So I mean, for a while we maintained two forks of OpenCog Classic, but on the other hand, that was a horrible hack. So now in the way grounding of atoms works, which is the association of atoms being nodes or links, the way grounding of atoms works in OpenCog Hyperon lets you associate a node or link with part of the torch computation graph in just a more direction natural way without needing some hack like a pointer value.
Jim: There’s also a more general way under your custom functions to reach out to any external code in a less hacky way.
Ben: Absolutely. We’re using that to connect with Isabelle, which is Ethereum prover, right? If you’re doing common sense logic in OpenCog, sometimes you want to do some very low level logical reasoning steps where you’re rearranging quantifiers and variables in a mechanical way. You can outsource that to theorem prover like Isabelle. That’s really good at actually it’s outsourcing that to other theorem provers using the Isabelle Sledgehammer, right? So yeah, you can connect with a simulation engine, you could connect with a theorem prover, you could connect with the deep neur on net library and the mechanics is there to make that A, not be hacky on an implementation level, but B, to make it easy to reason about that within meta, which is what is accomplished with Monads in Haskell, for example, you can bury within a monad the ability to do a lot of messy stuff.
You identify the properties of that messy stuff within the properties of your monad, then you can reason about this external stuff within your elegant functional programming system. In a way, ness was right to about pointer value because it was opaque to the AtomSpace. The properties of what you were doing with that pointer we’re not exposed to the AtomSpace. On the other hand, we didn’t have a great alternative to that in the old system, and so now we’re doing a bunch interfacing open cloud Hyperon with large language models. Now we also are very interested in creating something that does what LLMs do, but deployed within the AtomSpace so that you can do neuro symbolic learning within training of a transformer like model. I mean, I think that’s very important area of research, but we don’t want to be restricted to that. We want to be able to go back and forth with deep neural nets, they’re doing amazing stuff right now.
Jim: Yeah, everybody is on. They’re in a sweet spot between hardware and software right now where they’re able to just roar ahead in terms of producing useful results and it would be a bad decision to choke yourself by the performativity or lack thereof of the atoms space. They’re just engineered at totally different levels of intensity and scale.
Ben: Yeah. So this gets in, which will be a whole other podcast maybe, but if you look at what we can do with MeTTa to Rholang to Hyper Vectors to the GSIAPU chip, then you’re getting something that’s conceptually a lot like what’s done and going from deep neural nets to matrix libraries to GPUs, right? So, I mean, I think we’re going to have to do something parallel to what’s been done with that whole stack to GPUs for atoms space, which in the end may realize Linus’s idea that no, you should be doing this on the end space.
Jim: Someday, but not today, right?
Ben: Not today.
Jim: You have to be able to climb the hills where they are, right?
Ben: Well, that’s right. I mean, it could be a year from now. I mean we’re actually, we have prototypes of this stuff on the GSI hardware which is in production, so it is not just a concept, but yet I can’t train the transformer on that yet. Right now we’re still working out kinks and mapping of hyper vector into GSI. Right? So yeah, there’s the need for parallelization of development avenues and we failed with the old OpenCog to build a diverse enough energetic enough open source community. I think partly because the software was a pain to work with in some ways. Partly it was just a different historical era. I mean, I think in the modern historical era, there’s a lot of people who want to jump into AGI. I think that once hyper on is a little further along and we’re planning to launch something, we’re willing to call an alpha version in April of this year.
After that we’re going to start with hackathons and broader giving some grants for open source developers on Hyperon. So I’m optimistic we can do a better job of nourishing a broader open source developer community with the new version, partly because it’s easier to work with partly because more people love AGI now than did before, and part of the parallelization we hope to get from that is yeah, we want people banging away on using existing deep neural nets together with Hyperon to make things neural symbolic. On the other hand, we also want people banging away on the more difficult research of doing hierarchical attention based pattern recognition networks totally in the AtomSpace because I think that probably is a better way to get rid of hallucinations and the other irritating problems that you see with transformers than just banging on neural nets in their current form or some minor variation thereof.
Jim: And people like Gary Marcus have been continually, sometimes overly tediously pointing out the limitations of today’s transformer based LLMs to do various kinds of formal reasoning. You can trick them quite easily. That’s not what they’re designed for, they’re just a statistical predictor of next tokens, fucking amazing. They do anything.
Ben: Well, but they weren’t designed to write Python either and they can do it, right? That’s
Jim: True. And they could write movie scripts. They weren’t designed for that and they can do that.
Ben: Yeah. I think you could go a lot further than the state-of-the-art with LLMs for complex multistep reasoning. On the other hand, the cost benefit will get worse and worse as one example, as Jim, you’d remember in Cog Prime and in the plumbing of OpenCog Classic, we had these two component truth values where you maintain a strength and confidence rather than just a probability value. So it’s like a probability and then it’s the probability that that probability is accurate. So I wonder if you trained to transform a neural net based on two component truth values at every level, I wonder how that would deal with the hallucination problem instead of just making a probabilistic prediction, which is what transformers do, you make a probabilistic prediction where you know how much evidence underlies that prediction you’re making? I think if you did things like manage higher order uncertainty through the neural net training process of a transformer, you might be able to get a transformer that in a way knows when it’s making shit up versus knows when it’s making a confident prediction.
If you do that, you can probably do more confident multistep reasoning to a certain extent because part of the reason multistep reasoning doesn’t work well is the bullshit blows up exponentially over each step of the reasoning. But the thing is that’s still not going to work as well as a human at complex multistep reasoning, and it gets more and more expensive as you propagate all these higher order uncertainties through the system. Yeah, I think Gary overstates the case against LLMs. On the other hand, he is seeing some fundamental truth there, which is, as you said, these things are not architected to do abstract reasoning. Complex multistep reasoning relies on abstraction in a fundamental way, and transformers are not wired for abstraction. I mean, they’re wired to munch together a massive number of concrete data patterns in a different way than doing abstraction.
And this brings us back to one of the nice things about AtomSpace is it’s a very elegant way to represent data patterns at multiple levels of abstraction. I mean, which I think is important for general intelligence. There’s a whole bunch of learning theory that ties together the ability to abstract with the ability to generalize, which underlies Occam’s razor and other associated heuristics. If you have a simple model of a bunch of complex data, mathematically, there’s a lot of reasons to believe that gives you the ability to make hypotheses that will abstract beyond the data that you’ve seen. And I mean, deep neural nets aren’t doing much of that.
Jim: I’m sure you’ve heard me say more than once is my ruts’ idea of the quickest road to AGI is to solve heuristic induction. I would say something like the Hyperon architecture is more likely to get you there than LLMs. LLMs you’re basically using very loosey goosey basins of attraction in a strange way. I am looking forward to playing with it. Let’s move on. We’re very short on time. I want to hit all the high points here. In your 70-page paper where it was, you did mention your old standby cognitive synergy six times, but it wasn’t quite as central to the story as it used to be. Could you explain to people your concept of cognitive synergy and how you see it in this next generation?
Ben: The long paper that we’re talking about covers both the Hyperon infrastructure and the cognitive architecture, which was historically called Cog Prime and which we’ll have a new name for probably within a few weeks, but I don’t know where it will be yet. So I would say cognitive synergy is at the emergent level, so it’s not something you wire into the operating programming language, but the notion of cognitive synergy on the face of it’s pretty simple. A human level AGI system as both modular fragmented aspect and a unified aspect, you need both of those aspects, which is one of the many dialectical dualities in creating human level intelligence. So the modular aspect, one way to get at it is to look at different types of memory, which is something that cognitive science and cognitive psychology have delved into in great detail, and this is another one of the largely unsung domains of radical advance.
During my career when I started studying cognitive science in the ’80s or even teaching cognitive science in the ;90s, we did not have as solid a knowledge of the different sorts of memory in the human brain, how they work in the mind and brain, how they interact, operate with each other. Cognitive science and cognitive neuroscience have advanced a fuck of a lot, just like I was saying, functional programming language infrastructure has, there’s been exponential events in so many of these academic researchy areas that nobody thinks about alongside the exponential events and things like GPUs, and so on. Then computer vision that everyone sees. So one thing we see in cognitive science, I mean we see the human brain in the way it does. Memory is a bit compartmentalized and you can see that to the point where a lesion in one part of the brain will cause someone to lose very specific memory, like the ability to conjugate verbs or the memory of what people will wear in their life history, but not what cars were wearing in their life history or something.
So we have an episodic memory of the narrative of our life history. We have a procedural memory for how to do stuff. We have a declarative memory, which is facts and beliefs, and it gets much more fine-grained than that actually in the different subtypes of memory. And each of these types of memory has a different dynamic between working memory that you’re using while you’re doing stuff in real time and long-term memory. Each of these types of memory has different learning and reasoning heuristics associated with it in the human mind, and I think that came about because of evolutionary pressures, which you could go through step-by-step like what did we need in evolutionary history in terms of episodic memory? What did we need in evolutionary history in terms of the interaction between working in long-term memory regarding procedures? So you could choose to ignore all that and just architect an AGI system in some other way, and it might get much smarter than people.
You could try to emulate how the brain works regarding all these types of memory and associated learning and reasoning, which is really interesting, computational neuroscience. But then you run into a lot of places where the neuroscience just isn’t well known yet. What I’ve chosen to do in the Cog Prime cognitive architecture is try to roughly emulate the way the human mind works in terms of varieties of memory and associated reasoning and learning heuristics, but not try to drill down to the neural level. So that means like, okay, we know we need to deal with declarative memory of facts and procedures, we know we need to deal with procedural memory. We know we need to deal with episodic sensory memory. Let’s make sure we have good representations for each of these and good learning and reasoning algorithms for each of these. Now so far, we didn’t even get to cognitive synergy yet.
You could say SOAR or ACTR, which are classic cognitive architectures from the good old-fashioned AI world. Also try to break things down to the different kinds of memory learning and reasoning that the human mind does based on cognitive science, but they really break things down into different modules. The way that classic SOAR deals with declarative knowledge and learning of facts and procedures is totally separate from the way good old-fashioned soar deals with learning and representing procedures. And I think that is not going to be scalable, and we’ve already talked about why scalability is so important.
So to make things scalable, I think you need to be able to translate your declarative representation into procedural form and your procedural representation into declarative form and so on with sensory and episodic and so forth. So you need to be able to translate the representations, and so with each type of memory into the representations associated with other types of memory, which is a matrix of cases, but then more subtly than that, the learning and reasoning approaches associated with each type of memory need to be able to share their intermediate state with a learning and reasoning approaches associated with other types of memory. So it’s to help each other out when they get stuck in doing their learning or reasoning. You can model this quite abstractly using category theory, which I did in the paper some years ago.
Or you can look at it very nitty-gritty in terms of specific tasks that human-like agents are doing, right? So if you’re learning to serve a tennis ball or pick up a cup or something, on the one hand you’re doing something reinforcement learning ish to learn the procedure to pick up that cup. On the other hand, if you formulate that in a sensory way, you’re looking at images of yourself picking up that cup, which you can learn from, and there’s a lot of eye hand coordination and how people learn this if you’re looking at a declarative way, you’re thinking about what you’re doing wrong and why are you always smelling the cup and trying to draw a logical conclusion about it.
So it’s both an abstract mathy thing and a very, very concrete nitty-gritty thing to have a sharing of information on the learning and reasoning level and the knowledge representation level with all the different things a human like mind does. And I would conjecture that any mind that has fairly limited resources and needs to do a sufficient diversity of different tasks is going to end up with a modular architecture that then needs holistic cooperation among the modules, and that leads you to cognitive synergy. Now, it might be that if you have enough resources, you don’t need a modular architecture. Maybe that’s just a hack that you need to do.
Jim: Nature had to do it because nature had limited resources, right?
Ben: Yeah, yeah. Well that’s right. So when you get from AGI to ASI, maybe the ASI like synthesizes a modular architecture on the fly for each new task it has to do or something rather than having a fixed modular architecture. But if you’re following a path of first human level, vaguely human-like a TI and then a SI then think you’re led to an architecture that’s heavy on the cognitive synergy, and I think in LLM land, they will be led that way too. If you’re making GPT-5 that invokes Wolfram|Alpha to do logical reasoning or something, I think then after they’ve done that and run across the fact that it’s not very scalable, they’re going to say, well, we want to share the intermediate states of Wolfram|Alpha with the transformer, and we want Wolfram|Alpha to have access to a vector of the internal state of the transformer to guide itself.
And then you’re doing cognitive synergy between Wolfram|Alpha and your transformer right now. They’re not there in GPT-4 in terms of how Wolfram|Alpha plugin is interoperating with the ensemble of transformers inside GPT-4. You can see how that would naturally evolve within an LLM centric universe just in the desire to decrease the amount of processing cycles used by both the LLM and the Wolfram|Alpha theorem prover. When they have to do stuff together, they’ll just be more efficient if they have visibility into what each other are doing internally.
Jim: I don’t see any sign of that happening, at least not in the commercial or even the open source world.
Ben: Steven Wolf from understands the desirability of this very well, but the software on each side is not architected for that absolutely whatsoever. So
Jim: Basically the model that everybody’s using is so-called Rag, Retrieval augmented Generation, where something goes out to some software, some latent semantic vector database or Wolfram|Alpha or fucking Bing or something, results come back, they’re manipulated, they’re cleaned up, and then they’re sent to the LLM. But there’s no penetration at this point.
Ben: No. I mean, I’ve talked to Steven Wolf about this. I mean he probably better than the OpenAI guys. He and a number of people within Microsoft research fully understand the desirability of doing this. But I mean, Wolfram|Alpha not only does not have uncertainty built in like OpenCog Classic does, it has no natural way to insert uncertainty into its plumbing of how it does algebraic simplification and so forth. Whereas a transformer, it is an uncertain inference center. It’s doing statistical stuff and Wolfram|Alpha is not doing statistical stuff. This gets back to the advantage of doing everything in atoms space.
I mean, really the right way to get cognitive synergy between a transformer and a logical inference engine is to put the transformer in atoms space and the logical inference engine in atoms space. I mean, then you have the same meta representational fabric underlying the two of them. I mean, then the cognitive synergy comes easy. Now, how we can do that interoperating between a transformer as given now and logical reasoning within OpenCog atoms space is hacker and more partial, right? And we can do some things because we’re doing a statistical logic inside AtomSpace, so we’re ahead of Wolfram|Alpha. You can do one direction there. You can’t that well inject logical stuff into the transformer, right?
Jim: This is interesting. This is real cognitive energy. You get various tools to move down one or two levels in detail and use Hyperon as their meeting ground essentially, because I’m still skeptical about doing a transformer in Hyperon maybe with this hardware acceleration, but even if you can’t get down a couple levels and find representations that they can interact with and have the structuring from the rest of the Hyperon environment, then maybe you can do some really cool stuff that you can’t do with this clean line between retrieval and generation, which is all we have today. Basically
Ben: I think we can do transformers in Hyperon that will be more efficient than the transformers they’re doing now, which is, if it’s true, will be because of Rholang rather than meta. It’s because Rholang allows you to exploit parallelism in a more flexible way than MeTTa produce on similar libraries. And if you look at what’s going on inside a transformer when it does inference, which has to be done many times during training as well, it’s doing more stuff in parallel than needs to be. I mean it’s multiplying parts of the matrix that don’t need to change. You can leverage your parallel processing resources far more efficiently than is actually done on the Nvidia hardware by doing matrix multiplication now, and Rholang seems to let you do that. So, I mean, there is a quite interesting story there, which is more about how to do concurrent processing efficiently beyond map produced type stuff.
But you’re right, even if that doesn’t work, there’s a lot of interesting hybrid architecture though. If you break down how a transformer, when you do fine-tuning, you’re retraining the top few levels of a transformer or you do LoRa, you’re doing even less of that low rank adaptation. You’re not touching the base model. So one can take the base model of a transformer and you can replace the top couple layers of the transformer with stuff inside hyper on and then that can still work efficiently. Then you’re using reasoning to update these top few layers. Well, the base model is still what it is. This is a intermediate level of difficulty. It still needs… I mean, just doing the top couple layers of the transformer, which is where a lot of the attention magic is happening. I mean, this still needs things to be way more scalable than OpenCog Classic, but it doesn’t need things to be as scalable as you’d need to train a whole like GPT-4 scale network using OpenCog. So there’s a lot of interesting flexibility to play with actually.
Jim: Is there anybody in the generative base model world looking at what you guys are doing and thinking things like this?
Ben: Well, our own team is. Within SingularityNET, we have a team of 20 people, just 25 maybe doing just transformer neural nets. We have our own server farm training transformers for language, speech, music, financial data, a whole bunch of different things, and then a separate team working on OpenCog AGI stuff, and these are run by close friends of each other. So, I mean, we have been playing with it. What I would say is people that I know in the nuts and bolts like transformer, neural net world, they have too much other even lower hanging fruit to play with and that they haven’t gotten to this yet. I have found people in the hardware space very interested in this though. So we have a number of people we’re working with who are building large server farms aimed at serving transformers to a large number of commercial customers, and these guys are just interested in anything that will decrease the cost of training and inference.
Jim: As you know, also, there are people looking at other models other than multi-head traditional transformer architectures, which could be an order of magnitude or more and more efficient as well.
Ben: Well, yes, I don’t know if I’ve introduced you to Alex Ororbia, I don’t know if you’ve intersected him. He is at RIT. He has a different algorithm than back propagation for training neural nets, which is based on predictive coding. He’s working loosely with Karl Friston who you will have run across.
Jim: Yep. I know his work well, I don’t know him.
Ben: Yeah, I mean I love Friston as a human being. He is a great maverick. I have mixed feelings scientifically toward his approach for computer science. I’m not going to debate him on neuroscience. I’m not a neuroscientist. I would say Aurora stuff is the first thing within the Friston universe that fully makes sense to me on a technical level in terms of doing predictive coding based learning as an alternative to back propagation. And I think that it’s a localized learning method in a way that back propagation isn’t right. It’s much more elegant and efficient than say Bengio O’s difference target propagation, which was along similar lines, but it also has a clear probabilistic semantics, which would make it nice to go back and forth with probabilistic reasoning inside OpenCog.
So this is another possible direction instead of bringing the transformers into OpenCog, if you manage to make something transformer like, which is trained using Ororbia’s predictive coating rather than back propagation, you’re getting something that at least has a clear probabilistic semantics throughout the whole network on the neural net side, which then makes it easier to do cognitive synergy with problem reasoning on the OpenCog side.
Jim: Well, send me the link to him. I’d love to take a look at, that’s one of the things I’m keeping my radars out just waist wide scanning, don’t need them right now, is different approaches beyond classic descendant forms of transformers.
Ben: What Ororbia has is a different way of learning neural nets, and this is a great unknown to me in the neural net field is to what extent have our neural net architectures become over fit to this specific pluses and minuses of back propagation as a learning algorithm?
Jim: As you know, even Jeff Hinton says that’s a problem, right?
Ben: Yeah. You take something like InfoGAN, which I played with a while ago and probably pointed out to you, which is a cool way of learning GAN like generative models that automatically learns structured noise. It automatically learns sematic latent variables of degenerative model. No one managed to get that work for really complex stuff. You could use it to generate models of a face that automatically learn latent variables for the nose, mouth, and eyes. You could never use it to automatically generate images of a whole scene, for example. So you couldn’t make an InfoGAN transformer. Now, back prop never converges when you try to train a complex info model. Why does it never converge? Because the architecture sucks it because back prop sucks, right? I mean to find that out, you’d want to try to train an InfoGAN architecture with a non-back prop learning mechanism. As far as I know, no one ever tried to train an InfoGAN neural not using CMAES or some floating point GA.
Jim: Yeah, and I always remind people, don’t forget about evolution. Evolution will train anything, won’t necessarily do it fast, but non-differential fucking complicated as shit. Evolutionary approaches I still believe are underutilized in this space.
Ben: Well, that’s right. So when backdrop doesn’t converge because the gradient fucks up, right? I mean the gradient vanishes. Well, CMAES doesn’t care. So maybe you could use a floating point GA to train an InfoGAN network. Maybe you could use Ororbia’s predictive coating based method to use an InfoGAN network. Then you could do an InfoGAN style of transformer, which would enable a whole different attention mechanism to work.
Jim: Well. Yeah, this is just highlighting a bigger problem, which is there’s so much low hanging fruit on traditional transformer Nvidia hardware that 98% of the people are working in this, call it the already the status quo world, and we need more people working out on these alternative. Now, this is a little bit nerdier deep dive than usual on the Jim Rutt show. I’ve enjoyed the hell out of it. I hope you all have too. Thanks, Ben.
Ben: All right, thanks a lot, Jim. It’s fun to get to dive into some of the inner workings because I mean, in the end, that’s what’s going to make the AGI happen.
Jim: Yeah. I’m particularly excited to get my hands on this thing once you declare it, it’s worth looking at. But you know, my lack of patience for software doesn’t work. When the code and the graph are the same, that’s what excites me. I want to get my head around that. I think there could be dangerous things done with that.
Ben: Live dangerously, as Nietzche said.
Jim: Alrighty, thanks again, Ben, and we’ll have you back soon.
Ben: All right. Bye-bye.