The following is a rough transcript which has not been revised by The Jim Rutt Show or Matt Welsh. Please check with us before using any quotations from this transcript. Thank you.
Jim: Today’s guest is Matt Welsh. He is a co-founder and CEO of Fixie.ai, a start-up that is using AI to rethink how the world builds software systems. Previously was senior vice president of engineering at OctoML, a start-up developing technology to optimize machine learning models. I actually took a look at that site this morning as well. In fact, I got on their waiting list. Looks like they’re doing some pretty interesting things.
Matt: Yep.
Jim: And they’ve certainly raised a shitload of money so they’ll be able to do it. Previously he’s held senior engineering and engineering management roles at Google, Apple, and Xnor.ai, and way back when, he was a professor of computer science at Harvard. So he probably does know a thing or two. So welcome, Matt.
Matt: Thanks, Jim.
Jim: Yeah, great. I really am looking forward to this conversation. Those who want to learn more about Matt, check him out at mdw.la. I don’t think I ever knew the ending of it was.
Matt: It is a strange domain. I just wanted to get something short, and it was available.
Jim: LA. What the hell is that? Latvia?
Matt: L-A is Laos.
Jim: Laos. Oh.
Matt: Apparently, and I have been to Laos, but that was not the reason that I got the domain.
Jim: That’s cool. You can also catch him @mdwelsh on Twitter. And I’ll just note in passing, at mdw.la he has an Apple IIe emulator, which is kind of fun to screw around with, as one of the alternate interfaces to what’s on that page. You don’t have to use it to get his CV and stuff. But I did just for fun. Kind of took me back to the good old days. My very first PC was an Apple II Plus, which was the Apple that came right before the e. I think the e had radical stuff like 80-character screens and stuff. Didn’t it?
Matt: Color.
Jim: Oh, okay. Well, mine had color. The Plus had color. That was different.
Matt: Oh, it did. That’s right.
Jim: Yeah. The difference between the Plus and the non-Plus was the Plus had color, and it had bitmap graphics.
Matt: Right.
Jim: It wasn’t just hard-coded character mode. You could play games and do stuff like that. Anyway, that was kind of fun. Today we’re going to start off with an essay that Matt wrote called The End of Programming, which he wrote for the Communications of the ACM back in January. But I’m sure, as usual, we’ll go all over the place. So, The End of Programming. Programming is obsolete. What do you really think?
Matt: Yeah, that’s it. The thing that got me thinking in this direction was anyone who’s played with ChatGPT has probably been amazed by what’s possible and amazed by how well it works in so many different situations. And the thing that got me thinking about this was the whole field of computing, up until, basically, a year ago, has been entirely about getting humans to instruct computers how to do stuff. And what we do is we have an idea, and then we write a program, and write a program in some arcane language like C++ or Python or whatever it happens to be. And that just gives the computer, which we treat as a dumb machine, the exact instructions of what to do. Right? We say, “Do this. Increment this variable. Follow this path of instructions. Input this data.” So, when you start using things like ChatGPT, you start to notice that the language model actually can act as a computer, as a problem solver, if you give it the right instructions.
So, to give an example of that, you can give ChatGPT a job to do. You can say, “I have three stacks of cards. They’re organized like this. Please reorganize them into this other stack configuration,” and the model will give you step-by-step instructions about how it would do that.
Jim: Yeah, and it does other cool things like, “Hey, I got three potatoes, a pound of ground beef, and a jar of ketchup. Give me three recipes.” It’s actually quite good at things like that.
Matt: It’s amazing. And so some people think, and it’s easy to approach this saying, “Well, all it’s doing, it is parroting back something that it was in its training data set that was part of what was found on the Internet a few years ago, when they crawled all these Web pages and gave the model its training set.” So your pound of ground beef and potatoes and whatever, the recipe it comes up with is found on some website somewhere. But if you give the model something that is utterly new, a new problem it’s never seen before and certainly doesn’t have an answer on the Internet already, it’s actually demonstrating the ability to perform logical reasoning. And the thing that was surprising to me about this was, of course, this was not intentional. When these models were first developed and trained, they were trained to complete texts. They were trained to, “Fill in the next word in this sentence.”
Jim: And the weird thing is that’s actually all they’re doing.
Matt: Right.
Jim: That’s what the head scratcher. Right? And for people who want to learn more about this, I just recently finished reading Stephen Wolfram’s new book about ChatGPT, what it is and how it does it. I knew a fair bit about it, but he helped fill in some blanks. But he’s pretty crisp about the fact that it is still just filling in the next word, one at a time. Of course, it’s clever. It looks for probabilities in all kinds of interesting ways, but it’s really just filling out one word at a time, which makes it extraordinarily eerie that it’s able to do the job that it does.
Matt: Exactly, and the thing that’s really fascinating about it as well is even the people who developed and trained these models did not know when they did so that, by training the model to complete text, they were actually training it to think.
Jim: Think. Think. I don’t know about think-
Matt: Think.
Jim: … but put out artifacts that are like the result of thinking. Is there a distinction? I don’t know.
Matt: I don’t know that there’s a distinction. That’s a deeper philosophical question, right, that-
Jim: Well, it’s a big one, though.
Matt: It’s a big one. The philosophical question that people have had for centuries is, “If I have a machine that acts like it’s thinking, is it in fact thinking?” And we don’t know. We don’t know. But at least from an end user perspective, in some ways I don’t really think I need to care about the answer to that question. If I can treat the model as though it’s got these reasoning abilities, that’s good enough.
Jim: Yeah. And it actually isn’t all that great at deep formal logic. You can easily get it to go off track with complicated puzzles. But at things that typical humans are good at, it’s very, very good. In particular, have to say, I was quite shocked, the first time I used ChatGPT-3.5 as a programming aid, how good it is at programming. That’s really what we’re here to talk about today. And of course, it was trained on a lot of code. I found it’s not so good, at least 3.5 isn’t so good, for writing long programs, but it’s great for writing a 25-line function, even a pretty complicated one.
Matt: Yeah, that’s right. But I see this as a little bit of like, in the short term, we’re seeing already people leveraging these AI models to write code for them. But I actually think what ends up being really interesting is when you get beyond that, and you just skip the code entirely, and you just ask the model to solve your problem. And so the model is the code. The model is the computer, in some sense. It starts to solve problems in a very direct way.
So I can say to the model, “I have the following data. I want you to pull data from here, here, and here, and then collate it in the following way and generate a report” or generate an email or whatever it happens to be. And you don’t have to write any code. The model doesn’t need to spit out any code. There’s no code. It’s just the model doing the work. And to me, that is a complete revolution in terms of how we think about computing. It’s not about this priesthood of people who know how to write software translating their ideas into computer language. You’re just using the computer directly.
Jim: Yeah. And as you pointed out, this idea of taking something that’s in your head and then having to run it through a computer language to get it instantiated on your computer is quite difficult, and it truthfully ain’t much easier now than it was way back yonder. I think I wrote my first program in 1967 in Interactive BASIC on a UNIVAC 1108. Then, when I was a freshman in college, freshman and sophomore, ’71, ’72, I wrote a fair bit of numeric simulation work for the astronomy department in FORTRAN IV. It was fairly ugly stuff, but in truth, not much uglier than C++. Right?
Matt: Oh, it hasn’t improved. I went through the history of programming languages, and if you look back at ALGOL 1968, all the way to something like Rust in 2018, being the most modern and cutting-edge programming language, frankly, writing computer programs has not gotten easier. It has not. The languages have become more powerful, but they’ve become a lot more complex as a result. And if you get to a place where you’re not coding at all, coding is not your goal… Code was a means to an end, but if we can shortcut that completely and just use language models as the actual computational engine, I think it radically changes how people do this kind of stuff.
Jim: Yeah, now, when you talk about that, using the language model is the computational engine, that seems to be one level of things. It strikes me there’s another level. So I suspect your Fixie.ai… I actually spend a little time plunking around on it today… must be doing some of this, which is not just feeding a question to a language model and then coming back with an answer, but, in some sense, orchestrating how it’s presented, decomposing the problem and, actually, even letting the language model decompose the problem, et cetera. As-
Matt: Right.
Jim: I think of this as probably going to be the sweet spot for a while, is you write the software behind the scenes so that the customer doesn’t have to, and it, essentially, uses some relatively simple code to interact perhaps iteratively and multiple times with the language model to find the answer the question or even bring in data from external sources, package it up and do that. I created a little chatbot, I don’t know, a month and a half ago for my podcast, where I downloaded all the transcripts into a semantic vector space, put a query front-end on it, took the results of that, stuffed them into the context window for GPT, and had it write the output. And it works surprisingly good. If anyone wants to try it out, chat.jimrutt.com. Butt-ugly interface. It’s about a four lines in Flask, but it does work. Right?
Matt: Now we got RuttGPT as the next big thing in the field. Right?
Jim: Yeah. So, at Fixie, what are you guys doing along on that continuum from typing a question into ChatGPT to some more complex orchestration between the user and the LLMs?
Matt: Yeah. The core thing is that these language models by themselves don’t know anything about live data. Right? They don’t know anything about a company’s own data. They were trained on a massive data set culled from the Web, but any company that wants to use this language model tech needs to have the ability to interface the model to their own database or their own knowledge base or their own APIs or whatever it happens to be. And so what we’re doing at Fixie is we’re trying to make that just really easy for companies to do.
So if, say, you’re a company like an insurance company, and you want to have a chatbot on your website that lets customers ask questions about the claims process or even check on a specific claim or update some billing information, whatever it happens to be, today, to build that using large language models is definitely possible, but it’s a lot of work. And you have to pull together a bunch of different pieces, the models, vector databases, the data sources, the APIs, all the Web framework stuff to hook it into your application. What we’re doing at Fixie is just making that whole process extremely streamlined so that you don’t need to be an AI expert or a language model expert to be able to build that. We’re seeing a lot of interesting applications of this stuff. Customer support is certainly a huge one, people who want to do things like business reporting or checking communications to see if they’re compliant with regulatory constraints. There’s just so many applications of this stuff that the world is just waking up to that.
Jim: Yeah. I’ve been advising a early-stage venture fund that’s going to invest in this stuff. And one of the things I keep pointing them to is complex customer service.
Matt: Yeah.
Jim: That would be perfect. It’s not quite life and death. I’m not sure I’d want language models signing off on my legal compliance for the SEC quite yet, maybe do a first pass, but for complicated customer service, I guarantee they’ll do a better job on average than the $22-an-hour nimnuls they have on the phones these days at phone companies and credit card companies and places like that.
Matt: Yeah, I think that there is a huge opportunity there, but also one that just not necessarily fully replaces the human customer support agent, but at least automates a lot of the manual stuff that they’ve got to do. One of our customers that’s building with Fixie is finding that their human customer support agent spent a lot of their time manually doing things like look up the order history or tagging the ticket with the right category. All these things could be automated. And you still often want to have a human at least on that final quality control step and making sure that they’re able to follow up with a human when necessary. But if you can streamline the manual stuff that they’re doing, these people just end up being a lot more productive.
Jim: I’m kind of personally curious. When you do something like take a company’s knowledge base and integrate it with the language models, are you using things like latent semantic databases, or are you doing things like fine-tuning models or some combination of two, or is there some other stuff that you’re doing?
Matt: Yeah, generally, it’s a combination of both. In many cases, the language model, the base language model like GPT-4 is really good enough by itself at ingesting information from, say, your documentation on your website or your knowledge base, knowledge base articles, and then answering questions from it. You don’t need to fine-tune the model, so you can use the same model, but just give it the right context in real time. And for that, you use things like a vector database, where you would take the user’s question, run it through the model to get the vector embedding that represents that user’s question, look it up in the vector database to match it against all the documents that you’ve stored there, fetch those documents, feed the documents back through the language model to generate an answer. Okay? That’s a pretty common pattern.
In some cases, though, we’ve also found it is helpful to fine-tune the model, which basically puts the model itself through another set of, round of training that changes the model’s weights, based on the knowledge of the new training data. And that ends up being necessary in some situations, but my prediction would be, in the future there’s going to be less need for that, given how powerful the base models are becoming.
Jim: Interesting. Now so far, for the public, OpenAI has not offered fine-tuning on GPT-4, though I have heard through the grapevine that Microsoft Azure may be offering it soon. Do you know where that stands?
Matt: I haven’t heard the latest, but you know what? It would not surprise me if in the next few months, anybody could fine-tune GPT-4 as well.
Jim: Yeah, that would for certain, very precise domains, thinking like regulatory or legal or-
Matt: Legal, medical perhaps, you would want to do that. And then, there’s tons of open source models coming out all the time as well. There’s going to be an explosion of these, I think. We’re going to see dozens or possibly hundreds of new models coming out all the time, and people are going to be trying to figure out, “Which model do I use for which situation. It’s not always going to be clear. There’s going to be trade-offs in quality. The tone of voice of the models could be different. You also end up with just what’s the economic cost of running the models? So there’s going to be an interesting kind of explosion here to worry about.
Jim: Yeah, in fact, I had the guy from Stability AI on recently to talk about their work on truly open source large language models, not just like Facebook’s LLaMA, where you have the weights, but we actually have the software, the corpus, the whole deal. That will be really interesting, and they’ll be available for fine-tuning and the reinforcement learning and all that good stuff. I know that probably scares the hell out of the OpenAI guys, but it’s coming.
Matt: I think it’s coming. And I think the genie’s out of the bottle, so to speak, already. The base technology to develop and train these models has been out there. It’s all published in research papers. There’s no real secret sauce. OpenAI was the first to kind of really throw all the resources you needed at it to make it work really well, and they’ve done a fantastic job at packaging it to make it easy to consume. ChatGPT was little more, really, than a user interface on top of an existing set of models, but it changed the way people think about this stuff, for sure. But the challenge for them will end up being, can they remain competitive in a world where anybody can get their hands on one of these models and fine-tune it and get even better performance without their help?
Jim: Yeah, it is going to be an arms race between size of model techniques that are used, quality of the corpus, et cetera. For instance, the Stability guys have said that they think that their 67-giga parameter model will outcompete or outperform GPT-3.5 at about 170, is it? Something like that.
Matt: Mm-hmm.
Jim: And the LLaMA model does surprisingly well also on considerably smaller models. So that will be part of the dimensionality of competition. One I’m sure that that hits home with you at Fixie in the work I’m currently doing on the scriptwriter helper that I talk about on the show once in a while… We’re building a software program, in cooperation with a professional scriptwriter, to have a series of iterations where you can start literally with a three-sentence description of the movie and go through a series of automated processes and have a hundred-page script come out the other end. You’re not going to get a great script that way. And I also have a bunch of places where you put notes in, just the way a producer would in a movie. In fact, I was just fooling with that yesterday. “Oh, I want the romantic scene to be steamier.” Just write that in, and it’ll rewrite the synopsis, and then you feed the synopsis through the scene-maker and then… Oh, it’s crazy how it works. But, but where I’m getting at with the limitation is these models are slow. Right?
Matt: Mm-hmm.
Jim: If I were to use GPT-4 it would take over an hour to do a single pass to write the film script. I prototype it on GPT-3.5, might take 15 minutes. In your own work, what are you guys finding about the annoying slowness of these things?
Matt: Well, I’m laughing because you’re complaining it’s going to take an hour to rewrite an old movie script. I would say you should be just amazed that it’s just that fast. But in any event, yeah, latency is definitely a concern for many applications. We’ve heard this from lots of people. And I’ll do demos for people of Fixie, and it’ll be 30 seconds and still waiting for the answer to come back, and that’s kind of not a great experience. We’re so used to things being instantaneous, especially if you’re modeling after interacting with a human. The thing, and my take on this, is the performance of this stuff is only going to get better over time. And we’re seeing new techniques and new technologies come along to optimize the performance of these models all the time.
My colleagues at OctoML, where I used to work, they recently released the ability to run a large language model on an iPhone. And that’s unbelievable because just a couple of years ago, if you had said, “Hey, how would you run one of these things on a phone?” I’d say, “No way. The model’s too big. The processors are too slow. It’s going to go through your battery in two minutes if you tried it.” But they’ve managed to optimize the performance to make that workable. So I just think we’re going to see this improve dramatically over time. And a year from now, we’re not going to be saying to ourselves, “Man, these things are so slow.” I expect we’ll be, kind of like remembering the early days of dial-up Internet, we’ll think back on today as kind of the moral equivalent of the dial-up Internet days.
Jim: Yeah. What do you think about eventual hardware optimizations? Today the stuff’s running on, essentially, gigantic GPUs.
Matt: Yep.
Jim: Is there any reason to think that there may be some ASIC-style solutions or some of these big old megachip solutions or some way to really bust through a couple of orders or magnitude on throughput?
Matt: Oh, absolutely. And in fact, a friend of mine, who happens to be the son of our CTO at Fixie, has just started a company called Etched.ai, and they’re building a custom-built chip for accelerating large language models.
Jim: Ah.
Matt: They think they can get something like 100x performance increase out of a single chip. You can replace a rack of GPUs with one of their chips. This is going to be a huge, huge, huge change in the way that these things are used. Why do we not have that yet? It’s because these language models have only existed for a bit of time now. They’re not widespread enough, and they’ve just become large enough that it makes sense to invest in building custom hardware for them, just like with GPUs. We didn’t have GPUs for many, many, many years, and it wasn’t until computer graphics was really demanding that there was a market for it. So yeah, I’m super excited to see what people come up with, when they want to build out their own chips to run these models.
Jim: Yeah. And it’ll be kind of an ecosystem co-development, where if there are chips, people will start using technologies to build models supported by the chips, right, [inaudible 00:23:31]-
Matt: Correct.
Jim: … which then gives an incentive to build bigger and faster chips that support the same interface.
Matt: And that’s exactly what NVIDIA’s done with their GPUs. I mean, today it’s kind of funny to even call it a graphics processing unit because they’re primarily being used for things like running AI models. But knowing what capabilities are in an NVIDIA chip has dramatically influenced the way that people design these models. And they know which numerical operations are going to be fast and which ones are not going to be fast. And so they tend to build the models for the hardware, which is the right approach. So yeah, I think we’re going to see a very interesting diversification of the types of hardware that people develop for this kind of stuff.
Jim: Yeah, that’ll be very interesting. Another little technical question I’m kind of personally curious about. With my scriptwriter, I have to do a lot of parsing of the stuff that comes back from the LLMs, and so I try to get it to write JSON back at me. And it works 95% of the time. And then, even funnier, I have a special system prompt to fix JSON. So I’d say 90% of the time, if I set one that isn’t right, fixes it correctly. It would seem to me that would be a really nice extra feature for them to build in, is to have some structured output mechanisms. Presumably you guys have to deal with this all the time, when you’re doing these orchestrations.
Matt: We do, but interestingly, the way that we have approached it is to say, “Let’s make English the lingua franca that these software components communicate with. So, in the past, you’re right, things with like JSON or YAML or whatever formats end up being really important if you’re passing the data downstream to something that needs to have it in some structured format… We are kind of going maximalist on the language model as the primary way that you operate on data. And what we’ve found is the language models are just so good at taking data in whatever format you give it. You can give it to them as just a list. You can give it to them as a CSV. You can give it to them as JSON. You can give them to it as just a table in some format. And it doesn’t care. It’s perfectly fine in any old way that you want to give it the data. It’s able to deal with it. And so we’ve said, “Hey, let’s just use that and trust that it’s going to work.”
Now, if you’re sending the data outside of your system, you’re going to have to format it into something. And I do think having a specialized formatter that can send the data out from a language model as JSON or whatever format you want does make a lot of sense.
Jim: It is true. I mean I quickly learned that you just talk to it in language. Right? It’s the receiving part where having it in a structured form makes a big difference. In my program, I create a bunch of intermediate stuff and reuse it, like I stick the character descriptions in whenever there’s dialogue to be written. So there’s a lot of back and forth. And I, at first, just spent a lot of time writing parsers. Then I said, “Let me see if I can get it to talk JSON.” And the answer was, “Yeah, sort of, probably most of the time.” Right? And be really nice if they built that in as a feature set.
Oh, speaking of which, back onto programming, this isn’t quite all the way to your vision of using language as a programming tool, but I have found, writing this code, that it puts you in a very odd mindset because part of your brain is doing traditional programming, where I’m talking to my computer to make it do what I want it to do, but the other part of what I’m doing is creating prompts and prompt templates and programatized templates that are going to talk over a wire to an alien. And no telling what the alien might say. Right? And so I’m constantly editing the prompts and the prompt templates. Then I’m also editing the code. And they’re two very different mindsets.
Matt: It’s true that there’s a impedance mismatch, in some sense, between instructing a computer explicitly in code and, effectively, asking it politely to do something for you using a language model. I’m reminded a little bit of a programming language… I think it was from the ’70s… called INTERCAL that was a bit of a sort of a parody programming language. And in INTERCAL, it had a feature where you needed to say the word “please” enough times for the program to actually run. If you didn’t say “please,” it wouldn’t do the thing. And so I find myself, when I’m talking to a language model in a programming context, often saying things like, “Please do this” and “Thank you,” which is interesting.
So yeah, we run into this all the time. And it’s often hard to know what is your mental model about what is the language model capable of doing, and what will it comply with in some sense. If it’s given a set of instructions, how explicit do those instructions need to be? How detailed do they have to be? How complicated can you push it? We’re learning how to do all this stuff all the time, and it’s not always clear what belongs in code and what belongs in English. Ultimately, we’re going to get to a place where, I think, it’ll almost all be in natural language. But we’re not there yet.
Jim: And at least with the OpenAI engines, you also have the distinction between the system prompt and the user prompt. And I found with GPT-4, the system prompt is really good, really powerful, but in GPT-3.5, less so. And so, annoyingly, I have to write switch statements to send stuff to the user prompt, if we’re using GPT-3.5, and to the system prompt, some of the setup stuff, if we’re using GPT-4, which is kind of curious and annoying. So there’s a lot of this what I call “cultural learning” that we’re all doing because, of course, this stuff isn’t documented, and it’s not even documentable probably in principle because the guys have no idea what these things can actually do, and we’re just sort of exploring and finding what works and what doesn’t.
Matt: I’m not even sure… Right… if I wanted to write the operator’s manual for GPT-4, what I would even put in that book, right, is exactly how do you ask it to do certain things. What’s the right approach? And there is a lot of people writing about that and thinking about that and trying to figure out how to use this new technology. I think of it like this alien technology that just landed in our backyards, and we’re all just kind of poking at it and prodding it and trying to figure out what it can do.
Jim: And of course, talk about skyhooks. I have used it to write prompts for itself. And it’s actually pretty good at that. It will take a short thing and make it more detailed and verbose. And then it will also, oddly, if you run out of context space, at least you get GPT-4 to do lossless compression on your prompts. A prompt, for instance, in my case, the long-form synopsis of the movie I want to use in various places as a system prompt primer for writing scenes and describing characters and stuff. And so I have it create a lossless compressed version, which cuts its size by about 60%, which is pretty amazing.
Matt: Oh, wow. That’s amazing, yeah, thinking about the model itself kind of helping you tailor the inputs to itself.
Jim: Yeah, I call that skyhooking, skyhook.
Matt: I love it. I love it.
Jim: Let it be the skyhook, right? You really got to get your head into that to maximize the ability to deal with these aliens. Right? They are. They’re aliens. I’d love to get your thoughts on this as well.
One of the things I have gone on the record as saying is that this change from people who know how to deal with stacks and arrays and pointers and all that crap to people who are dealing with language may mean that we want to hire very different people. This may be a return to the tech marketplace for high-end liberal arts grads, for instance, because people who are really good at thinking how to write and how to express and how language really worked… I would love to find a poet, person who writes epic poetry, something like that, and say, “All right, let’s see if we can turn you into a prompt engineer” or, actually, something more than a prompt engineer because kind of annoys me when I hear the term “prompt engineering,” because the art of massaging these things is more than just the prompts. It’s some larger gestalt, which we’re just trying to figure out. What have you guys been finding about the kind of people that are good at this kind of work?
Matt: Yeah. Well, what’s interesting is it’s unclear what it means to be good at it even. And we’re learning that too. Right? I wouldn’t even call it yet prompt engineering. It’s more like prompt voodoo, where we’re just kind of poking and figuring out what works. And there’s not a lot of methodology around it yet. Primarily because this thing is such a massive black box, with so many complexities internally, it’s really unclear what one needs to do to get it to work well. So I do think that there’s a fair bit of cargo cult science around this right now, and we’ll see that hopefully mature over time into more of a rigorous methodology that you might be able to call engineering.
The thing that excites me tremendously about this is the idea that these language models could, potentially, open up the field of computing to vastly more people than are currently able to access it. So if you imagine people that are not even necessarily literate being able to instruct a language model and have a positive outcome using it to solve a business problem or answer a question or build a tool for them, all of these things, in my mind, are things that anyone with an appropriate kind of level of language command, which most people have, should be able to do. And so then the question is, what happens to our priesthood of programmers, so to speak, as soon as anyone is able to go directly to the computer and get it to do the thing without having to go through these ridiculously highly-paid engineers like me?
Jim: Yep, indeed. And as somebody who managed armies of engineers in the past, getting rid of a bunch of them looks very attractive, tell you [inaudible 00:34:08]. But sort of doing that correctly is going to be an interesting task because it’s probably going to be a whole new way that we manage projects. Right? You’re still going to need people to conceptualize the flows and the data sources, quality control, what are the actual limits of hallucination we can tolerate, et cetera. So there’s going to be a whole new set of product manager-type people, probably, that are going to become very important in actually creating useful results from these platforms. Something like that. You guys thought about that much or seen much of that in your own work?
Matt: It’s definitely something that we’ve been thinking about. And I think the uncertainty comes from the fact that we don’t know whether these AIs, say in the next five to 10 years, are going to be seen as thought leaders or as automata. In other words, one could imagine treating these AI models similar to the accountants of the 19th century. Right? The rooms full of people keeping the books and doing calculations by hand and jotting them down into notebooks. You could employ language models in that way and then still have warm-blooded humans being your thought leaders and the ones who are kind of defining the strategic process. But what’s uncertain to me is the extent to which these models might actually become powerful enough to do the thought leadership as well.
And to take a kind of extreme example of that, imagine a government run by language models or run by AI, right, operating very autonomously in defining policy or analyzing data and coming up with the best approach to solving some problem. It’s very scary to think about, of course. It’s very frightening to think about, but it kind of reduces many of these things that we currently have so many debates about in our society from something that’s a question of morals and ethics and cultural background to a very pragmatic kind of thing. And so the question will be how far we want to employ these models and how much control we effectively give them over how we do things. And yeah, we don’t know where this is going to go. I think it all comes down to what is the ultimate limit of the capabilities of this technology.
Jim: Yeah, that will be part of it, but there will also be some domains where, at least as long as it’s limited to feedforward large language models, it won’t really be able to… It’s hard for me to even see GPT-5 coming up with some definitive question on how should abortion be regulated, for instance. I mean, that’s still a value question based on the traditions people come from, and there is no right or wrong answer per se. There’s-
Matt: Correct. Right, right.
Jim: And lots of other things. What should be the share of GPT that goes to the government versus the private sector? These are value-laden choices that, well, even-
Matt: Yes.
Jim: Even GPT-4 will give you a really nice write-up on things like this and how to frame your thinking about it. You can even ask it, “If I’m thinking about designing a new socioeconomical political system, what are 25 things I ought to consider?” And it will do a really nice job of doing it. And then you can even say, “Compare the current status quo to theoretical Marxism to” some other radical idea, and it will build a three-column table, and it will fill in all the cells for all 25 aspects. It’s actually quite impressive in its ability to do that, but it still doesn’t have the ability to tell you the answer.
Matt: Well, and here, I think, you’ve kind of pointed at one of the biggest risks, of course, which is what are the language models emitting? They’re not necessarily emitting anything that would be constituted an original thought or an opinion. It’s based on the training data. Right? And so if you ask a language model a deeply moral question, its response is almost certainly going to be biased by whatever was in the training data set, which of course, as we all know, is heavily biased towards what information is on the Internet in certain languages, primarily in English. Right? And who’s writing that stuff is also people who are coming from very well-off countries and with a particular level of computer literacy and so forth that we’re able to publish on the Internet. So we kind of have an interesting situation now, where the language model’s ideas, so to speak, are really based on just a very, very, very tiny fraction of human concept and human knowledge. And I wonder whether building the foundation of a society around that sort of cherry-picked, if you will, sample of human thought is going to be the right thing or not.
Jim: Yeah, that’s an interesting question. And of course, it’s partially being solved itself as the whole world comes onto the Internet though you do have a question on, for instance, let’s suppose you did want to build a model for global governance. You would want to make, I think, be fairly careful that you had a balance of different languages and different cultures and different perspectives and things, which at one level, would be fairer but maybe harder for it to actually come to any decisions about anything.
Matt: No, I think that’s right. I think you probably end up more with different models for different kinds of societal values and norms and traditions. Right? I mean, it would not surprise me at all to learn that certain parts of the world, it would make sense to have the equivalent of ChatGPT but trained predominantly on a text that was in the language and from the tradition for which the audience of that is intended. It’s a moral conundrum, of course, because we worry a lot about technology being used to reinforce certain ways of thinking, and this is something that AI is just opening up a Pandora’s box of hard questions here.
Jim: Yep, that is for sure. Of course, then the other issue, those of us who’ve played with this stuff a fair bit understand that even GPT-4 still hallucinates a fair bit. Right? You can’t actually trust its factual answers. I, for instance, have a test question I use on all these models when I try them out, which is, “What are the 10 most prominent guests from The Jim Rutt Show?” because it knows about The Jim Rutt Show. Most of the models do. And some of them are quite good, like GPT-4 will get, typically, nine out 10 people who were actually, been on my show previous to October 2021. GPT-3.5 gets about 5.5 out of 10 correct, iterative trial. Bard gets 1 out of 10 right, which is quite interesting. Just found by trial and error that it’s a pretty good probe because it’s one of those things there’s just enough on there that it could know. But it doesn’t always know.
And of course, there’s all kinds of things it’ll hallucinate on. It’s famous for making up scientific papers that didn’t exist, even making up fake URLs. That’s going to be an interesting issue, and I know people are working at self-assessment, where perhaps GPT-4.5 or five will have the ability to essentially do some form of estimator on how good its answer is. For instance, we know if you ask it to give you a short bio of George Washington, it’ll be mostly correct. If you ask it to give it a short bio of Jim Rutt, it’ll be about 70% garbage, as it turns out.
Matt: No comment. But I think that’s exactly right, and I think what we’re finding at Fixie is using language models effectively really comes down to a couple of things. I think first of all, grounding it in real data and real hard facts that the model can then be aware of those facts when it’s responding. And then, of course, you need to put kind of guardrails on it because, as you and many others have seen, the language models, they were trained to complete text, and they will happily complete text even if it doesn’t make any sense. They tend not to have a filter, if you will, on their own thinking. And so as long as it is plausible statistically, from the model’s perspective, if a response violates factual knowledge or even logical reasoning, there’s nothing really stopping the model from emitting that.
So we think that a good way of addressing that is you don’t necessarily want a fully general-purpose model in all situations. In other words, GPT-4 is, in some sense, way too powerful to answer questions about Allstate’s insurance policy, right, because it can also generate 14th century German literature or something like that at the same time. You don’t need it to do all of those things. What you need it to do is stick to the facts, to know when it’s wrong, to know when it’s unsure. One thing we’ve found a lot is if you’re not careful, the models will just happily answer questions that they should have no reason to have an answer for. They should say, “I don’t know,” but they don’t know that they don’t know. And so that self-introspection needs to be something that we train the models for effectively. And yeah, that’s going to be a very important thing going forward.
Jim: Yeah. The introspection is going to be particularly useful because one could see, in principle, that there ought to be some way to sort of do meta-analysis on the weights that are firing to see how strong the signal is, versus how much entropy there is in the firing, and say, “All right, too much entropy, not enough convergence here. Here’s the answer, but 80% chance it’s bullshit.” Right?
Matt: I wish I had that model for myself. Most things that I say could use that kind of filter.
Jim: Yeah, though we’re better at that than GPT. And here’s actually something that I thought about when I first started experimenting with it and seeing hallucinations. We know when we’re lying. And in fact, cops… I’m from a police family. My dad was a career police officer. My brother was career federal law enforcement. I’ve known lots of cops. They can generally tell when you’re lying because there’s various tells. And your eyes look around in a different way. Very specifically, people put more detail in wrong places than they should. The thing about these LLMs which makes them more persuasive is they lie very fluidly because they don’t know they’re lying. Right? And their extraordinarily good mastery of syntax and even semantics, so it comes out sounding every bit as authoritative and indistinguishable structurally. There’s no tells that it’s just making this shit up. And that makes them, in some ways, even more dangerous.
Matt: Yeah, I think it’s a good point that the models just don’t know when they’re being factual or not. And I think it’s an interesting question how we know that ourselves, as people, because people can certainly convince themselves that something is true that’s not, in fact, true and state it as though it is, in fact, a fact when it’s not. This ends up just being a really fascinating area of research, I think, for the development of these models, and I think it’s going to be essential to see that, taking that step beyond the auto completing of the text and thinking of these models as really active participants in our society. And we need to hold them to a similar standard that we expect others to follow. Right?
We always joke about politicians lying, but the reason that you put a politician to do something is, ideally, to get that person to actually take good action on behalf of the set of people that they’re representing and to tell the truth about what’s happening. And we don’t know how to think about AI models. They don’t have a moral compass, necessarily. And so I think we’re just going to have to evolve our understanding of that and the ways in which we integrate them into society.
Jim: Yeah. At least, we’d have to go a long way to get them to the point to have a moral compass. So we just have to assume that they don’t have a moral compass in the same way we don’t assume our general ledger program has a moral compass. Right? We know our general ledger will enumerate the legumes accurately if we input good data, but it will not give us any good advice per se, or has no moral opinion. And I do find that, unfortunately, the LLMs are so good with language, and language is the human superpower. And so people overreact to these LLMs and tend to anthropomorphize them a lot, I think, like the famous guy from Google who thought that LLaMA was sentient or conscious or something, which I do know a fair bit about the science of consciousness and the probability of a feedforward model being conscious is extremely small, I would say.
Matt: I think that’s generally right. But most people do not know as much as you and I do about this space. And that’s where it’s going to get really tricky as we start integrating these in deeper ways into our society. And already at the point where you have a video call with someone, and you don’t know whether they’re a human or not. Right? You have a phone call, you don’t know if you’re talking to an actual human. And if you can’t make that assumption, then it seems to kind of tear at something fundamental about the fabric of society, that people need to have that fundamental ability to trust that the entity that they’re communicating with shares certain basic philosophical characteristics with them, in terms of morality and trust and so forth. And I don’t know what will happen to society if we just don’t pay attention to that and we say, “Well, let’s just use these language models everywhere, and they seem to do a good enough job. It’s fine.” Will there be an erosion of trust between people that follows from that?
Jim: Yeah. We may need to get some out-of-band signaling to be able to assert that this is an actual person on the other end, for instance.
Matt: Yeah, well, prove you’re a human.
Jim: Yeah, exactly.
Matt: Next time I go on a podcast, I’m going to ask my host to do a caption test for me.
Jim: Yeah. Yeah, that’s good. Not a bad idea. But I suspect there’ll be platforms that will do that, though, again, how you make that ironclad, I’m not quite sure. But it certainly seems like we’re going to need a out-of-band signal, not just the stream that we’re interacting with, to be able to provide attestation to what this is, where it came from, and may well need something similar for video artifacts, some way to signal outside of the video artifact itself that this was created by an actual person at this location at this time, and these are the people involved, and this has been certified by a video notary or something that is trustworthy or believed to be trustworthy. I can see it. There’s going to have to be an arms race around trust as we can’t naively trust these artifacts. Right now, we still can. I’ve got a little bullshit Elon Musk emulator. I can do a Zoom, and it makes me look like Elon Musk, but man, is it crappy. But in a year or two, I’ll bet they won’t be so crappy.
Matt: Some would argue that Elon Musk is a bullshit Elon Musk emulator, but okay.
Jim: That’s like our politicians. It’s just the threshold for AI politicians may be pretty low because the competition’s so bad. Right?
Matt: Right. Right.
Jim: Yeah. One final topic before we roll here. You guys must be doing some software development at Fixie. You can’t do this by just talking to LLMs. One thing I’ve found, and other people I’m talking to, is that using LLMs to help you code is gigantic. It’s 3 4, 5x improvement. I mean, yeah, Copilot’s sort of useful. It kind of helps. But man, asking GPT itself to write code or explain how to do a UI component, how to do use a library you haven’t ever looked at before, it’s amazing. What’s your experience been with this?
Matt: Oh, absolutely that. So first of all, I mandate that everyone in the company use Copilot because it is an accelerant. It’s an important one. And I think, for me, the main use of Copilot is that if I’m not quite sure how to do something, I can put it in a comment, and then it’ll fill out the next few lines of code. And that means I don’t end up spending half an hour browsing the Web, looking for a solution and going off and goofing off on Reddit or something like that. Using chat GBT to write code, though, that is really profound. Countless times it has gotten me out of a bind, where I have been beating my head against a brick wall on some problem. It’s some arcane combination of I’m using this library and this setup and this cloud vendor, whatever it is, and I cannot figure out the particular incantation that I need to put all of those pieces together to get the result that I need.
Now, in the old days, you would Google for your question and your problem and hope that someone had written a blog post or that there was a Stack Overflow post or something that was about exactly your thing. And then you’d kind of read it and say, “Oh, yeah, I can see how I can adapt this to my use case.” With ChatGPT, I literally can talk to it like I was talking to an expert. I can say, “I’m using this library and this framework and this tool, and I’m trying to do the following three things, and here is the code I’ve already written, and it doesn’t work. Tell me what to do.” And it gives you the answer. And it is amazing. It’s not always right, but even if it’s wrong, it often gets you past the conceptual gap that you might have had that was leading to the problem in the first place, that it gives me enough of a suggestion that I go, “Oh, right. That’s how I need to approach the problem. Thank you very much.”
But actually, eight times out of 10, I can just take the code it’s generating and paste it right in, and I’m done.
Jim: Yeah. Exactly. That’s exactly the class of use case I found that ChatGPT is just phenomenal for. Here’s my hypothesis on why. Even though guys like you and I have a whole lot of knowledge about computer stuff in our head, the density of our knowledge is, it differs on domain. Right? So the language we know the best is in our fingertips, at least it is when you’re doing it every day. And one of the reasons it helps me so much is I typically don’t code every day anymore. I might go six months out without writing any code. Then you sit down and go, “Oh, now, all right, does Python need semicolons? No. C# does, though. God damn it.” And then you switch back and forth from between two.
But the LLMs, for the same reason that they can lie so smoothly, their knowledge is fairly uniform. They know a lot of stuff really well. And so just those things where you have to know three libraries and a communications protocol at the low-expert level to solve the problem, very few people have low-expert knowledge in everything, right, and humans have to specialize, and they do. But the LLMs don’t. They got everything. So I think that’s why they’re-
Matt: They’ve got a high-expert or medium-expert knowledge in all the things, and that ends up being really powerful. Right? I don’t even know how to think about that in terms of how we hire programmers and how do we develop software. Right? There’s a real possibility that today you could have a dev shop, say, anywhere. It could be the United States. It could be in Southeast Asia. It could be anywhere, where the programmers that have been tasked to build the software solution are primarily leveraging something like ChatGPT to do the work, and they’re just taking the answers and massaging them into the final form. It would be a fantastic business model to have that. In the future, of course, we’re just going to have the machines writing all the code and no humans involved at all. But in the short term, there’s probably a lot of money to be made by having a very small staff of humans with ChatGPT behind them.
Jim: Yeah. And then we’ll also emphasize the data cleaning problem, right, and how to do that with ChatGPT. And I’ve heard of clever ideas, again, from the corporate world where corporations have willy-nilly built relational databases over the last 30 years, and they have no fucking clue what they got. Say, “Cruise all the code that I have, and create a knowledge base of all the schemas that have ever been created in this company, and then create code to test which ones still exist.” That would be hilarious. Could GPT-4 do that? I don’t know if you segmented the problem properly because its look ahead is not real far. Like in scriptwriting, getting it to write a scene is fine. Getting it to write a whole movie without decomposing the problem and the components, not so good. More or less the same, you tell it you want to write a 300-line program, well, that’s pretty sporty thing to do. Get it to write a 30-line function, it’s great at.
So how you orchestrate the models to solve these bigger problems will be interesting. But of course, the models will get bigger span as they get better. So it’s really going to be interesting here.
Matt: Yeah, exactly. I don’t even know how to think about a future in which these models just stop having the kind of limitations they have today, because I have a feeling that we’re just seeing the tip of the iceberg here. And as stunning and powerful as ChatGPT has been, what’s coming next is going to make our mouths drop open, in terms of what it can do. And it’s going to change how we think about what we can expect out of machines. I mean, today people are getting used to this idea that ChatGPT could write a movie script for you. Well, yeah, sure. But even six months ago, that was not clear.
Jim: Yep. Yep. And if you really get into the details, the jump from 3.5 to 4 is pretty damn big. I mean, it’s a lot more reliable. There’s a lot more things I would trust 4 with. There are very few things I’d actually trust 3.5 with, other than as a helper. And it’s okay at writing short functions or writing a business letter or something, but much more than that, there’s a lot of noise. 4 has gotten rid of a lot of the noise. And as you say, what will 5 be like? Ah. Very interesting. One last question before we go. You used used to be a professor, back in the day. If a kid were 17 years old and had just been accepted to Harvard or my alma mater, MIT, the other end of Mass Ave, what would you suggest that they focus on if they wanted to participate in this industry in the years ahead?
Matt: Yeah. It’s a tough question, and I’m asked this a lot, and I don’t even know. I’ve got a couple of kids of my own, and thinking about what the future holds for their academic career, if they end up going to college, wanting to be involved in AI and computing. I think we’re already seeing traditional computer science programs have not yet caught up to what’s going on here. And I think it will take us some time for them to catch up. The concept of AI as taught in a lot of computer science programs is still quite antiquated, and so many other things that we spend a lot of time teaching people about ends up becoming irrelevant in a world where these language models could be used to just automate so much that we do by hand today.
I think for now, I would say sticking with the traditional discipline of computer science makes a lot of sense. There’s not a good replacement for it. I would probably not jump on the bandwagon of people who claim that they’ve got a new prompt engineering degree or something like that. That feels very shortsighted and not quite setting people up for success. We saw a lot of that happen in the late ’90s, when the Internet became a thing, and all kinds of new programs came along that purported to train people to be experts in this new wave of technology of the Internet and how to use it in all kinds of ways. But those did not, in my opinion, have the right level of impact on the way people’s careers were going to shape out. So it’s a hard question. I don’t know the right answer. I would say, for now, just grab the bull by the horns and see where it takes you. But nobody knows where it’s going.
Jim: Yeah, and it’s moving so fast, you’re probably going to have to do, like most of us have done through our whole career, a hell of a lot of self-education. Get your hands dirty. Jump in. I tell kids, when they ask me that, I say, “Yeah, get yourself an API subscription to OpenAI and just start slinging code.” And [inaudible 01:00:10].
Matt: Yeah, that’s it. Exactly.
Jim: Just see what it does because nobody knows. We didn’t, which is so exciting about this. This reminds me of PCs back in 1979, right, when almost any idea was likely to be a good one because the space hadn’t been filled out at all. So I think we’re in a time like this. Anyway, Matt Welsh, I really want to thank you for an extraordinarily interesting conversation today.
Matt: Thanks a bunch, Jim.