Transcript of Currents 033: Connor Leahy on Deep Learning

The following is a rough transcript which has not been revised by The Jim Rutt Show or by Connor Leahy. Please check with us before using any quotations from this transcript. Thank you.

Jim: Today’s guest looks like a really interesting guy. His name’s Connor Leahy. He is a machine learning researcher working on advanced general AI technology, things like unsupervised training, reinforced learning, scaling, et cetera. Some of his greatest interests are in research towards AI alignment and strong AI. Welcome, Connor.

Connor: Well, thank you so much for having me.

Jim: Yeah, this’ll be fun. I mean, we got connected on Twitter. Somebody pointed him and said, “Oh, here’s our deep learning absolutist. He can tell you if deep learning is the answer to everything.” And then as we did some back and forth, he goes, “well, not quite so much as all that,” but I think we’ll have an interesting conversation. And as I’ve dug in, done my preparation for the show, I found all kinds of interesting things that Connor is involved in. And I think I’m just going to jump in, because we’re a little short on time, unfortunately, and see how far we can get. If we don’t get too far, as far as I’d like to get. We’ll have him back for a part two.

Connor: I might just take you up on that.

Jim: No, right. So first let’s start with one of your projects that I just found utterly fascinating. EleutherAI. What is it?

Connor: EleutherAI, we describe ourselves nowadays as kind of like a decentralized research collective, of just machine learning engineers, interested in a certain kind of flavor of ML research, I guess. So scaling large language models, large models in general and AI alignment, et cetera. So it’s a specific flavor of research, in other words, a glorified chat server. I mean, it’s a bit more than that, at this point.

Connor: It kind of started as a bit of a joke between me and some friends. We were just hanging out and talking about GPT-3, which is this very large language model by OpenAI, and that was last year. We said, “Man, it’s kind of cool, wouldn’t it be fun to try to build our own.” And then a friend of mine was like, “This, but unironically, and the rest was history.” And so nowadays we’re a very active community of really, in my opinion, totally unbiased opinion, a really, really great people interested in doing alignment research. We are most well-known for our ongoing GPT-Neo project, where we are attempting to build and train a full size GPT-3 type model and release it as open source. That’s not the only thing we’re doing. We’re also interested in many other kinds of research projects. It’s become a really big thing. And, yeah, it’s just a really great bunch of people.

Jim: Yeah. It was particularly… We’ll get back and talk about alignment because that’s obviously a very important topic, but before I do that, why don’t we talk about GPT-Neo and GPT-NeoX, and why you decided to create them? You know why this isn’t a waste of time, since open AI has already got GPT-3, et cetera? I think our audience generally knows what GPT-3 is, but it wouldn’t probably hurt to give 20 seconds on what it is.

Connor: Sure. GPT-3 is basically a very, very, very large model, a truly phenomenally large neural network model that is trained on the pretty simple task of just predicting the next token in a text. So it’s trained on huge dumps of text from the internet, and it’s supposed to predict text, given what it seems so far. So you might give it a sentence like, hello, my name is, and then it might predict the next word might be John or Mary or something, whatever would make sense in context. So this seems like a very simple, basic kind of thing. You can kind of see how this might be a fun little toy or something, but what’s really interesting about GPT-3. There’s a lot of interesting things about GPT-3.

Connor: I’m sure we’re going to talk about, basically, it turns out that just with a simple task, it was capable of learning quite a lot of very useful tasks, and that you can use it for like translation. You could use it to write stories. You can use it to summarize text. There’s lots of very interesting things you can do with these kinds of models. OpenAI was the first one to build a model at this kind of size and they’ve commercialized it in a private beta forum behind an API, which means you can send the model some texts and they will complete that text, it will respond to that text, but you don’t have access to the model itself.

Connor: So I personally think that these models are very, very important. When I saw the paper that came out around GPT-3, my jaw just dropped. I was just like, “Oh my God, holy shit. This is the future. This is crazy.” It’s unbelievable really. I know there’s a lot of skepticism, and it makes sense to be skeptical about big claims and stuff like this. And a lot of this is based on kind of just hunches. A lot of research is about taste, every researcher will disagree on how important something is.

Connor: But for me, just seeing that just by training this very large model on just a lot of text, it just suddenly was capable of doing very complicated tasks, that it was not specifically, engineered to do, seemed very, very important to me. And I think it is in a sense a… I guess, what was most shocking to me is just how much, just by making it larger… So previously, there was GPT-2, which is basically the same architecture, same [inaudible 00:05:00] design, very minor differences, just they have scaled it up a 100 times, a 100 X larger, and just suddenly its performance just kept improving, as a scaling law. I remember you were once part of the Santa Fe Institute, I think they’re very fond of scaling laws there as well, power laws.

Jim: Absolutely.

Connor: And yeah, so what I find most interesting, potentially, about GPT-3 is this discovery of these scale laws, is that as you scale these models larger and your datasets larger, you can see this very smooth kind of increase in performance compared to the amount of extra resources you put into your models. And so I’m very interested in this. I think this is really important and I have a lot of things I want to research about these kinds of models. I have many questions about these models. How hard are these models to build really? Are these laws, do these really hold up? How do these work internally? The big question, how do neural networks even work. As far as I’m concerned, no one knows how neural networks work. It’s a miracle that they work at all. I find this still utterly perplexing.

Connor: And so it started as more of a, hey, this will be fun. Like, hey, let’s just goof around, let’s see how far we can get building this kind of thing. We didn’t really expect to get as far as we did. But we found that there was a lot of really interesting engineering challenges. We learned a lot of really valuable information from working with these kind of things hand on. I guess it all comes down to this belief that I think that these models are the future. Very large, unsupervised, pre-trained models are going to be the future of a lot of technology, especially in the AI field and elsewhere. And I also think there’s a lot of risks that come with this.

Connor: As you’ve already mentioned, this concept of alignment and security, something I take very, very seriously. I think there’s just very severe risks that come, potentially, with these kinds of technologies, in the very near term, I’m very sure. And I want more research to be done about these kinds of models.

Connor: OpenAI having their model kind of locked behind closed door, makes it very difficult for many researchers to be able to actually study, how do these things work? How do they learn? What biases do they have? How can they be misused? What kind of internal things are happening with them? My own research, which focused on this kind of security implication is very much bottlenecked by the fact that a lot of the phenomena I’m interested in kind of studying, don’t really work with smaller models. They’re just too messy. They’re not good enough to show the kind of interesting characteristics that I’m interested in studying. So in a sense, it’s about unblocking, at least for me personally, there’s a lot of unblocking low research academics who can’t possibly train a model like this on themselves, and allow them to research this really important technology.

Jim: So many things to talk about. First, let me hop back up and talk about this issue of scaling laws. You’re right, at the Santa Fe Institute, where I’m still associated, though not nearly as much as I used to be when I actually lived out there, there’s a list of hundreds now of social science, mostly, but not exclusively power laws or something. And it turns out you look at them carefully, they’re usually not quite power laws. They’re sort of maybe power law in the middle, but they’re not at the beginning of the end. Have you found a power law and capability versus model size yet in this class of models?

Connor: Yes. So there is a number of papers of this, mostly from OpenAI, maybe all of them are from OpenAI, including scaling laws for neural language models, scaling laws for transfer. There was also scaling laws for, I think, generative modeling something like this, these are not the exact names. I think Jared Kaplan is on all the papers. At least when I looked for these papers, I just go to archive and put in Jared Kaplan and check out all those. And basically, what we’ve seen is that, yes, there seem to be, at least in these regimes that we have studied, which is across several magnitudes supporters, like four magnitude supporters or something of different sizes, that there is a very strong connection between model size, the amount of compute you have, and amount of data you have, and the final performance you will get.

Connor: And one of the most interesting things about these studies is that they found that as you have more and more compute, that you can put into your model, as your budget for compute increases, it becomes more and more efficient to train larger and larger models. It’s actually more efficient to train a really, really huge model a little bit than it is to train a really small model a lot, in many cases. There’s this interesting thing that as these models get larger, they, in a way, get more sample efficient. They do need more data, but the amount of data they need grows with like a slower power law than the model size that you want to train.

Jim: That’s interesting.

Connor: Yeah, there’s been a lot of really interesting research in this direction.

Jim: So briefly, is things like performance on language models, you can say performance on some tasks, an exponential on model size or on input… I guess, the input databases are the same size, so what is the other factor when you’re projecting it onto a power law distribution? Is it number of weights in the model or-

Connor: There’s laws for all different… If you hold one constant, we can vary the others and see what kind of performance you get, kind of. Usually, how we think about it is the most useful one that you use in practice is that given you want to use a certain amount of compute, let’s say you have a 100 GPU’s and you want to use them for two weeks, that can then estimate, okay, I have this and this many petaflops of compute, I want to put into this. Then you can check the power laws, like what is the optimum size of my dataset? How many samples do I need? And what is the optimum size of models to train for that many steps? Given your number of parameters, you can estimate how much compute, how many flops you’re going to need to do one step on one piece of data, so that way you can kind of convert between the different ones and find like the ideal point that will give you the best performance.

Jim: Okay. The next one that you mentioned, I want to follow up on that too, very interesting, is that the sample size can actually go down as the model goes up. What do we know about the relationship between those two? And maybe actually it would help to describe what a sample size is, and why it’s relevant in this case? Sampling rate, I should say.

Connor: The best way to understand this, so these are just your invented numbers, but basically, you can imagine a 100 times larger model only needs 10 times as much data to train to completion, so that’s kind of what we seen. Of course, as the models get larger, you would want more data, but the amount of data you want grows slower, given the same amount of compute, given constant amount of compute, you’re putting into your model, compare it to the amount of parameters you put into the model.

Connor: So as your models get larger and larger, you’re dominated more and more by the cost of the size of the model rather than the cost of data. So previously, we were in this regime where getting more data was usually the best bang for your buck. That was like the best way to improve your performance of the model. But at least with these scaling models, as we’ve seen currently, currently, often the best way to them is… So this depends on the quality of your data or whatever, but assume we have good data and we have the option of either getting more data, training on more data or just making them up larger, currently, it seems to be often the case that we should just train a larger model on the same amount of data.

Jim: Interesting. Because I remember for a long time, it was thought that the guys with the big data piles would win. But if we’re finding there’s a big asymptote there, perhaps not. And that gets me to my next question, on your site, you talk about something called the Pile, what is it? And is it big enough?

Connor: So the Pile is our curated dataset that we created for training to GPT-Neo and successors, and we released it in open source as a paper, please cite us. It’s basically, an 800 gigabyte collection of various data sources, some of these are sources that have already existed, that other people have curated, some of them are from our own creation, filtering Common Crawl data and the like. Given our current calculations, this amount of text data, we think is higher quality than, for example, GPT-3 was trained on, should be enough to train a model up to like one to 10 trillion parameters, so that’s about like one to two orders of magnitude larger than their current largest models.

Jim: And this is available to anyone who wants to download, open source, no gatekeepers, so that’s a good thing. And we’ll provide a link to it for those of you out there who want to get that. Hey, Carlos Perez, I’m talking to you, man, a friend of mine, who’s an AI researcher, who’s a little bit frustrated by the expense, frankly, of dealing with GPT-3, and also has some questions about what goes on behind the curtains. So you have this 825 gigabyte dataset, is it all English, by the way, or is it multiple languages?

Connor: We filter for English, so this is intended to be proved purely English. This was the design decision we made, we were also interested in multi-lingual data, but it was just so hard and we didn’t have enough native speakers. We wanted to have native speakers to validate the dataset, so we’re not just producing garbage, and that was just not possible. We’re just a few people doing this in our spare time. We’re just hackers in a cave. So we decided to focus on English for the Pile. And yeah, that’s-

Jim: Okay, cool.

Connor: But probably there’s going to be some stuff in there that’s also not English that went through the filters.

Jim: So the next thing is you created your own model that you aimed to replicate the functionality, at least, approximately of GPT-2 and 3, talk to us a little bit about how you thought through that. As I was reading on your site, it sounds like you did not choose to exactly replicate how OpenAI did it, but you did a whole bunch of very interesting engineering thinking, on perhaps a better, more efficient way to do it. Why don’t you take us through some of that thought process, that was really interesting.

Connor: Yeah. I mean, one of the reasons it’s different, is just kind of the obvious, we don’t know all the details of GPT-3. We know most of the details, but for example, the dataset was never released, and some of the dataset is kind of mysterious. They just have this mysterious books two, and no one knows what that dataset is, and OpenAI will not tell us what that dataset is. We don’t know where it came from or what’s in it. We have some suspicions what it might be, but we don’t know for sure. So that’s one of the reasons we made Pile, because we don’t know what the original dataset is. There’s also some fine details about how the model is trained, like what parameters were used that, of course, we won’t be able to exactly replicate. But also in the time since GPT-3 has come out and we’ve been doing own tests with smaller models…

Connor: One of the great things about scaling laws is, is that if something works for smaller models, then it usually means it also works for larger models. That’s a big boon. So we’ve been able to run a very, very large number of experiments with all kinds of different, crazy architectures. And for the most part, the results were always, it doesn’t make a difference, but was worse than the original, the original model was very good. We have found a few things that we are very excited about, including rotational position encodings, which we wrote a blog post about actually kind of a funny story. Apparently, these are already very popular in the Chinese AI sphere, but they hadn’t made it to the West yet. And some of our people that we work with who speak Chinese, found that on the blog, and said, “Hey, we should translate this, and bring it to the Western audience, because this is like really, really cool technique.”

Connor: And so the original version of our code called GPT-Neo was built for TPUs those are Google’s custom homemade ML accelerator chips. We had very large access to a very large number of these in the beginning, but it turned out that it just wasn’t enough for like a GPT-3 size models. Google, through their TPU research cloud project program, which anyone can apply to for academic access to TPUs, they very graciously gave us a lot of access to a lot of these to run our experiments. Unfortunately, it was just not enough to train a full-size model.

Connor: So the interesting thing happened last December, when a cloud company called CoreWeave approached us, and they basically said, “Hey, we’re interested in having open source models like these. And we’re trying to get into the ML large-scale training space. How about we work together? You would test our hardware and we provide the hardware to train a full-size GPT-3 model, let’s do this.” And so, they’ve been very great ever since we’ve been working together with them to build our new codebased GPT-NeoX. So there’s a bit of a misconception that these refer to different models, that’s not a 100% correct, it’s most two different codebases. So the original Neo codebase was for TPU and this new NeoX codebase is we built from scratch, well, not from scratch, it’s actually based on invidious Megatron code, a lot transformer puns, obviously, that we’ve been working on to optimize and scale to train GPT-3 size models on GPU clusters.

Connor: That was very difficult, luckily we have some great engineers that have been wrangling, terrible Nvidia and Microsoft research code in their libraries full of bugs to get it to work. And right now, we’re at the point that our code is pretty stable. We’re pretty confident that it works as we expect it to. We’ve trained some smaller models with it. And we’re just basically hit by the chip shortage, is that we’re just waiting for more GPUs to arrive and be installed.

Jim: Interesting. Now both of these are on TensorFlow on top of it, is that correct?

Connor: No. Neo is, the TPU those are based on Mesh TensorFlow, which is like library on top of TensorFlow. NeoX is based on PyTorch.

Jim: Oh, okay. That’s interesting. That’s a big change. I see a lot of people now moving oddly, or maybe not oddly, from TensorFlow to PyTorch. Just give me a quick take on compare and contrast TensorFlow and PYTorch, since you’ve obviously gotten pretty deeply into both.

Connor: Use PYTorch, if you can.

Jim: That’s what I keep hearing.

Connor: Yeah. TensorFlow just has a lot of difficulties and it was one of the first things, it was a trailblazer in many ways, but PyTorch is just much easier to use in practice. There’s just a lot of very, very nice things. The way it TensorFlow does it, is you have to define a computation graph and then that compiles the graph and runs the graph, so errors are really hard to trace back, you can’t change the graph once you’ve loaded it and stuff. While PyTorch, it’s just like running code. Just every time you run a line of code, it just executes that on the GPU, and if there’s an error, you see the error. If you want to interrupt the process, you can just interrupt it. It’s just a much, much more elegant way of dealing with these kinds of things.

Connor: The thing is, is that you need TensorFlow or now also, JAX, which is new thing from Google to use TPUs. So TPUs require… PyTorch has TPU support, but it’s pretty terrible. So PyTorch is really only good on GPUs.

Jim: Well, again, since I got a live person here who has worked with both of them, the compare and contrast between the T-chips, the Tensor chips that Google has, and GPUs, I mean, Google made… I actually got access to a little bit of… in that experimental program when it first came out, and I flung up some simple stuff and played with the TensorFlow chips. They did seem… I wasn’t working them hard enough to see if there was really any functional difference, since you’ve worked on a cutting edge, performance-driven project, what’s your take on GPU versus Tensor Processing chip?

Connor: So I would describe TPUs as black magic, dark magic. You have to sell your soul, your sanity, and in return you gained great power. TPUs are a pain to deal with, in many situations. There’s a lot of weird edge cases. There’s a lot of finicky things, like all your tensors have to have a constant size, you can’t change the size of things. These are all things you can work around, they’re just a real pain. But if you do do that, now that we’ve used those, we must say, TPUs do have fantastic performance. They have these wonderful interconnects. I would say, the biggest bottleneck for training very, very large models is actually the interconnect speed between chips, because you have to split your model between like many chips and the cables, the connections between them are extremely important and TPUs really, really half fantastic into connect.

Connor: So if you have a model, a GPT model or something that can scale, it’s very easy to scale from one TPU to a 1,000 TPU, which is a click of a button. That is very, very nice. And you can get very, very good performance out of them, especially at price per dollars. But in practice, if you want to do anything, if you’re not at Google and can’t call up a Google engineer to help you or something, you will be much slower than working with GPUs. GPUs just allow you to try much more complicated or interesting experiments. You can iterate much faster. It’s just a very, very pleasant experience.

Connor: Nowadays GPUs are a very mature ecosystem. It’s just really easy to work with them. So, if you value your sanity, use GPU’s. If you have people who are willing to, maybe not to, give up some of their sanity, TPUs do give you very good performance.

Jim: Of course, you have the advantage of having free CPU from a CodeWeave, is that the name of the company?

Connor: CoreWeave.

Jim: CoreWeave. CoreWeave, we’ll put a link to them on the site too, give them a call out. They’re giving you all the GPU you can eat. Do you have a sense, if you had to pay for your processing, whether it would be feasible to continue to use GPUs versus TPUs?

Connor: Oh, God, no. Eleuther has no budget. We function on no budget. We don’t raise money, and we, actually, don’t even accept donations. It’s part of our philosophy that we don’t accept monetary donations, because we don’t want weird incentives. We accept only labor and compute donations. And it’s crazy. Eleuther is really a very, very strange scenario, in that we have a bunch of people doing this in their free time with no budget, but we work on a computational budget that must be millions, equivalent millions, at this point.

Jim: But you don’t know or cares, so you really can’t say whether, if someone were doing this as a commercial project, it would be feasible to do it on GPUs versus TPUs, I guess is my question.

Connor: I mean, doing these kinds of projects is extremely expensive either way. I think, if you’re in the order of magnitude that you’re considering affording something that’s like this, it probably doesn’t really make a difference. I don’t know exactly how much of a difference it would make, at this point, between TPUs and GPUs. GPT-3 is just so large, you’re going to talk about millions, no matter what you use.

Jim: Of course, it was different between millions and tens of millions. Us business guys pay attention to stuff like that, business that might be perfectly feasible, but cost five million might not at all be feasible if it cost 50 million for it.

Connor: That’s true, yes.

Jim: Yes. All right. Okay. Bottom line is, you don’t know.

Connor: Yeah.

Jim: So let’s get now down to a GPT-Neo, which is the only part you have released so far as I understand it. And according to your site, you have on March 21st, 2021 released the two point seven billion parameter model trained on your GPT-Neo software. Could you put that in context of GPT-2 and GPT-3? Where does it fall on that continuum?

Connor: So the largest GPT-2 model is about one point five billion parameters, and ours, at two point seven, it’s not quite twice as large as the largest GPT-2 model. The largest GPT-3 models 175 billion, so as you can see that’s about a 100 times more, not quite, not that exactly a 100 times more, but quite a… much larger. So it’s not the absolute biggest publicly released model, but it’s up there. I think there’s like one or two models that are publicly released that are larger than ours, at the moment, but I have reservations about some of those. I feel like some of those are not particularly good, maybe I’m biased.

Jim: Now I saw some benchmarking you’d done on particular tasks and it seemed like GPT-Neo could hang in there, even with GPT-3, on at least some class of task. What have you learned about benchmarking your model versus GPT-3?

Connor: I’ve learned that benchmarking is really hard. Benchmarking is really quite difficult. And, honestly, in my personal opinion, I feel like we as a field, as a science discipline, haven’t yet figured out what the right way to evaluate these models is. On some tasks, the difference between with the smallest GPT and the largest GPT is like four percentage points or something. But there’s a huge difference in subjective performance in using the smallest or the largest model. If you use the smallest GPT-3 model, I mean, disclaimer, we don’t actually know how large the models are with OpenAI. We’ve tried, we’ve asked them multiple times, but they refuse to tell us what size they actually are.

Connor: We have some guesses how large they actually are, but actually we think that the smallest model is probably quite small. We do think the largest model is probably a full-size model, but it’s really hard to tell because OpenAI won’t talk to… They won’t tell us what the actual sizes are. OpenAI has also been kind of stingy about explaining how they did some other evaluations, so it can be pretty hard to replicate those kind of things. Yeah, evaluations are good. They should be done. It’s useful to have these kinds of benchmarks, but it wouldn’t read too much into them at the same time. It’s more interesting to see a trend like a power law or whatever, as the models grow larger than all these tend to increase, rather than focusing, oh, it went up by exactly zero point five percent, that means X.

Connor: Our model, also, for example, destroys GPT-3 on certain tasks. Our model was trained on a large amount of academic mathematical texts and code. So our model annihilates GPT-3 when it comes to math and coding type tasks.

Jim: Interesting. Now, when I fooled around with GPT-3, a little bit, as I recall, actually, I had somebody else doing it on Zoom, so I didn’t really have… I wasn’t quite hands-on. God damn, OpenAI has not approved multiple requests from me, and I have a really interesting research project. Maybe we can do it on GPT-Neo instead, and we’ll talk about that a little bit. But as I recall, how GPT-3 worked, was you set up a primer block of text, and GPT-3 had a pretty short, small limit on the size of that priming text, and then you put your query in, and then you got your generated output. Is that approximately correct on how these models work?

Connor: Well, actually, what you describe as like context and prime are actually the same thing? So you can imagine these models have a window they can see, which is 2048 tokens. A token is not quite a word, not quite a single character, something in between, and it can see inside of this window. So you can fill up as much of the window as you want with your prompt, and then you can have it complete as much as fits into that window, basically.

Jim: I gotcha. Okay, so that’s actually… You say you’re at the same 2048 limit?

Connor: Yep.

Jim: Because that struck me that that was a big bottleneck, and if you could make that limit, for instance, a million, it could be much more interesting.

Connor: The problem is that the training costs of training a transformer is quadratic in the size of its window.

Jim: Ah, okay, so that’s a very key engineering decision.

Connor: Yes, exactly. The 2048, in my opinion, is actually already pretty good. You could do a lot of very good things with that size. It’s also problem with training data. There’s very little training data that’s a million tokens long other than maybe books. And you can do a lot with these smaller token windows, if you’re clever about it. The way I think about it, us humans, we also only have short-term memory, which is, seven plus minus two objects or whatever, which… I can’t remember, a 20 step derivation in my short-term memory. It feels like we’re due for some kind of scientific breakthrough that’s going to make these models use long-term memory in some kind of way, I don’t know some kind of scratch pad or some kind of memory tokens or something.

Connor: Not currently, but I expect something like that to emerge in the near future, because that quadratic bottleneck is a pretty big problem when you get to these larger sizes.

Jim: Interesting. So let’s now move to, and this may be a little tricky, but I think you’re up to it, give the audience an example of something you might do with one of these training models using this size 2048 tokens. What would be your prompt and what might be your output? What’s an application?

Connor: Here’s a great example, a good friend of mine is working on a video game using GPT-3, Codename: Electric Sheep is the name. And the idea is that is a dream simulator. So you lay down in bed and you enter a word of what you want to dream about. You could say, “I want to dream about a beautiful forest.” “I want a dream about a cyberpunk dystopia.” I want this dream about whatever you want to dream about, and then what he does is, is that he then queries the GPT model with prompts to generate NPC speech and world description. So one of the things he does is, is that he would example, say, I don’t know if this is the exact prompt he would use, but just to give you a feeling for it, it might be like, you might prompt the model with, “Wow, I really sure do like, topic,” and then the AI might continue that by giving you a whole explanation of why he likes it or what it thinks about it or whatever.

Connor: So this is something I haven’t yet had the time to really talk about, but really the most… One thing of using GPT-3 yourself, is that it is really a very different experience, in my opinion, from using other AI systems or other computer systems. In that, very often you can just ask it to do things and very often, not always, but very often it will work.

Connor: So for example, it used to be that with like GPT-2 would to generate… One of the tasks we needed for the video game was to generate names for the NPCs. So with GPT-2, you had to try all these complicated prompts, where it’s like, you gave it a bunch of example names in a list format, and you hoped it would continue the list format, but that didn’t always work or whatever. But now you can, literally, write, this is a video game about dreaming, in this dream it’s about X, here are the following characters colon, and it will just pretty reliably just output a list of pretty relevant character names that you could just parse right out of there.

Jim: That is pretty amazing. And as you discussed earlier, GPT-2 and GPT-3, architecturally, are quite similar, it’s just a difference of a much bigger dataset, or not a bigger dataset, but a bigger model with many more parameters. So essentially, more richness of whatever the hell it is that goes on in the interface between language and cognition is a better statistical representation in the GPT-3 model than in GBT-2, is that approximately correct?

Connor: Yeah, exactly. The person who connected you, that said I was a maximalist, he was probably talking about some of my thoughts and some of the things I’ve said in the past about GPT-3 and such. So I think people very much underestimate how powerful GPT-3 is, and how different GPT-3 is to other technologies. I really think this is a incredibly important discovery, just the scientific discovery that this is possible. That you could just scale these things up and they just gain all of these weirdly human characteristics. Here’s one of my favorite examples, two friends of mine wrote a paper where they tested… So in the original GPT-3 paper, there was a translation task, where they were just translating like English into French, and the way the tested this is what a prompt, that was like English:English sentence French: and then the model would fill in a French sentence.

Connor: My friends found that they could get statistically significantly better performance, like several percentage points, not an insignificant amount, by instead making a prompt like, the following sentence, sentence, is translated by the masterful French translator as, and then that. Putting in the masterful is really important, if you don’t do that, the performance degrades. Very often would GPU-3. If you just tell the model, you’re a super intelligent, very helpful model that wants to help humans, makes it nicer and give you more useful answers. And my explanation for that is, is that if you think about the task the model was trained on, it wasn’t necessarily trained on being correct or giving the correct answer. It was trained to, basically, emulate the [median internet 00:34:06] user or to predict or to role play to Lark as whatever is given the text.

Connor: So if you tell it, “Please role play as something super intelligent,” in a way, it will then try to simulate something more intelligent than the median internet user. And I find this absolutely fascinating.

Jim: That is really interesting, and it actually fits in well with my proposed research project, which I’m going to run by you to see what you think. What I proposed to OpenAI was, my home field in science is evolutionary computation, and you’ve pointed out that these changes in the prompts even subtle and non-obvious ones make quite a difference in the output. And so, what I proposed was using a simplified form of genetic algorithm to explore prompts space, and then use Mechanical Turk as a validator on the output. And like an example like you gave, any of these examples, fairly simple ones, and then just basically iterate between GBT-3 and Mechanical Turk, and use GAs to select the genes, essentially, being the prompt, and seeing if that converged to better results. Does that make sense to you?

Connor: It does. It does make sense to me. So there’s actually a lot of research… I actually work at something kind of similar to that for my own projects. So there’s interesting… One of the things, again, I want direct access to a GPT model is because like what do you describe, is like a zero with order optimization method, evolution, there’s like zero borders, you don’t have any gradients.

Jim: Yeah, yeah. The weakest but most general of all methods, right?

Connor: Exactly. Like, you can do a lot with evolution that you can’t really do with gradient best methods, but, of course, it’s also, much harder to optimize. So there’s been a lot of very interesting work in the last just few months on like continuous prompt programming, so where you don’t give it words… So the way GPT-3 works is, you have these tokens and each token is then put into an embedding space, so it’s like transformed into like a vector.

Connor: And this vector, which kind of like encodes what that token means, is then the input to the actual model. And what they found is that if you have a task, you can backprop through the model to find token embeddings, there in between words, they’re not real words, they’re like just these abstract concepts of something, and you can use those to get the model to significantly perform better on different tasks. There’s something called like prefix-tuning and such. I could leave some papers if you’re interested in this.

Jim: Oh, yeah, that’d be very interesting.

Connor: Yeah. So if you have access to the model and the tasks you we want it to perform good on, you can use first order optimization, so we can backprop through the model to get these continuous representations that can give you really good performance. I personally, also, work on, I’m very interested in using the reinforcement learning and stuff like this to learn human preferences, so something I’m interested in is training… So this is based on work done, the last author was Paul Christiano at OpenAI, where basically he trained a model on humans labeled summaries of texts, so there was text, it was summarized, and humans labeled whether the summaries were good or bad.

Connor: He trained the model to predict whether humans would like or dislike a certain summarization. And then he used that model as a reward signal for a reinforcement learning algorithm to fine tune a GPT-2 model to produce better summarizations. And the final model actually outperformed the human benchmark.

Jim: Wow, that’s really interesting. My interest has always been in search techniques that allow the power of the back end to be searched at some high level, on a high order, whether it’s GAs or GPs or other funky… particle swarm optimizations, or other forms of algorithmic search in a space like this. And I think this is very interesting that people are starting to do this. Because it seemed obvious to me, it’s the kind of thing that ought to have been done, but none of the early papers were doing this at all.

Connor: Yeah, absolutely. That’s what I mean when I say, I think we haven’t even scratched the surface of what these models are capable of or what the correct way to prompt or control these models really is. I think we’re very much in the early stages of understanding how to use and harness what these models actually learn.

Jim: And then, as I understand it, this first order optimization you talked about, where you use actually the abstract token space, rather than words, is something one could not do on GPT-3. They don’t give you any access to it, but one could do it on GPT-Neo.

Connor: Exactly.

Jim: So there’s a sales pitch for GPT-Neo, that if you want to work in abstract tokens rather than actually words, then you need actual access to the model. Which actually I’m going to go on a little rant here, this isn’t the first time I run into this. God damn OpenAI, when they first started out, it was all going to be open source, give it to the world, gift economy. Various billionaires staked this damn thing, and it was going to be given to everybody. Well, guess what? They very quickly turned tail on that one, and have God knows what kind of bizarro deal with Microsoft, and, what’s up with all that?

Jim: And do you have an opinion on what’s going on at OpenAI? And why they have defected against the world on their original commitment to be all open source?

Connor: Oh boy, do I have opinions.

Jim: And that’s why you’re on the Jim Rutt Show. You can say whatever the fuck you want, right?

Connor: I appreciate it. I appreciate it, but honestly, yeah, as much as I have opinions, in a way they might not be as inflammatory as you might hope them to be.

Jim: Ah, shit.

Connor: I have a lot of friends at OpenAI. I think a lot of the people at OpenAI… I was very like… You know some of my history and stuff, I’ve had my disagreements with them. I’ve had very severe disagreements with some of these people on how are their policies are and whatever. But I must say that once these people actually give me the time to talk to them and I’ve heard their arguments for what they’re doing and why they’re doing it. I’ve been surprised by how often I was like, “Ah, okay. I actually kind of see where you’re coming from here.”

Connor: So, basically, here has been my timeline of how I feel about OpenAI. When OpenAI first did the GPT-2 thing, I was like, this is incredibly silly. I think this is stupid. This is when I was like micro-infamous for like two weeks for creating a version of GPT-2 and threatening to release it. And I wrote a terribly long, cringey manifesto about why I think this is a good idea. I decided not to, because I actually heard some of their arguments. And basically, I think for there, they had a very good argument.

Connor: Your argument is basically just, yeah, GPT-2 two is probably not dangerous, but someday some model will be dangerous. And it’s a good idea to have these discussions now. And I think about how should we think about responsible disclosure? How should we think about deployment and such? So in a way, I felt like an asshole, because I was like shaming these people who have reasonable security concerns. I think they were wrong. I had objective disagreements with them, but it was a reasonable thing to be concerned, so in that case I would have called myself… I think I was being the dick in that situation.

Connor: Things have changed slightly, in my opinion, since then. So I think that was… I’ve talked to OpenAI people have also said like, “Yeah, we hate the name. We should have never used this name, because…” I think it was never the intention for it to be purely open source, if I understand this correctly, that was never actually the goal. The goal was always, I don’t know if you’ve read the OpenAI charter and whether or not it’s crazy or cringey, to each their own, but they take the things that they say, at least like the people I’ve talked to, take it very seriously, that AGI will come, it will change everything. This is going to be bigger than the Industrial Revolution.

Connor: And as much as I want to rag on OpenAI, they do have the charter. They do have this like weird corporate Windfall Clause. A lot of people say the Windfall Clause is kind of silly. For those who don’t know, the Windfall clause says, that after OpenAI has paid, I think, a 100X returns to their shareholders, the rest goes to non-profit, so it’s kind of like a cap return. Of course, the cap is ludicrously high, it’s like 100X or something.

Connor: But if you take the argument seriously, that this could be a bigger than the Industrial Revolution, which you may or may not take that seriously, then it is at least a sign of good faith in a way for them to have such a thing like this. I must say though, that in like the last year, like ever since the GPT-3 thing, I’ve had much more severe concerns about OpenAI’s direction.

Connor: So I don’t know if you’ve heard also recently there was an exodus for like lots of their… a lot of top people left OpenAI, a lot of them, I knew. Basically, all the people I really liked left, not all of them, but most of the people I really looked up to left. Especially, a lot of the safety team and such, which is not happy about that. I’ve also just had some interactions with people like Sam Altman, where I found that they have opinions that I don’t really agree with, or I find their opinions worrying in that regard.

Connor: So overall, currently, I’m at the point of, I basically don’t trust OpenAI anymore. I think they’ve gone way too hard commercial with this GPT-3 thing. And even if you disregard the commercial thing, if you take these safety things seriously, so let’s assume I believe everything OpenAI says, “AGI is coming, we’re going to build it. It’s going to change the entire world and make a quadrillion dollars” or whatever. Let’s just assume these things are true. If they take these things seriously and they take these arguments about safety seriously, in my personal opinion, the most dangerous thing that they did is, is just to tell the world that scaling was possible.

Connor: Because before that, scaling was widely considered to be a joke. Everyone was like, “This is a dead end. Don’t do this,” whatever. “We need more smarter architecture, not bigger models,” or whatever. But now that OpenAI has shown, with GPT-3, that scaling is at least one of the paths forward, and it’s like an easy path. You just need more GPUs, just by more GPUs, anyone can do that. In many ways, they’ve accelerated a race towards more and more powerful AI technology with, out of control, lack of safety concern research. So I find there’s a lot of contradictions in the communications from OpenAI. A lot of my favorite people left. Yeah, I’m very not happy with the current state of what they’ve been doing. And I do fear what they might be doing in the future.

Jim: Yeah, rant off more or less. So I’ve always wondered if the GPT-3 isn’t a head fake to get people going down a different road from the real answer. Well, maybe we’ll talk about that in our next episode. So that’s a perfect place to pivot and this’ll be sort of our last topic. And I would like to get you back to talk about all the other things on my topic list here. So I’ll ask my guy, George, to hook us up sometime in June to continue this conversation. But let’s now pivot to the idea of alignment and AI safety.

Connor: Yeah. So alignment or AI alignment kind of describes a certain niche within the wider world of like AI safety and AI ethics. So I would describe as a certain flavor. It’s like, there’s a certain type of person involved with it, a certain founder effects. And it goes back to people, Nick Bostrom, Eliezer Yudkowsky and stuff like that, kind of like the founders of these ideas. And basically the way I would like to explain it, is just, I would like to ask all people working on AI, what happens if you succeed? What happens if we succeed? If it just works, we have superhuman intelligence. It’s a billion times smarter than a human. It can solve any problem. What then? How can we control such a thing, if it’s smarter than us? Obviously, it can trick us into doing whatever we want.

Connor: We already can’t control our governments, our economic systems, our computer programs, just think about our computer safety, we have bugs and software flaws everywhere. Now, if you have a bug in your software, okay, maybe your browser crashes or something, but what if it controls the entire economic system with planet, or it’s curing cancer or whatever?

Connor: So basically, alignment is this question of looking just a little bit into the future, into these really powerful AI systems that are as smart or smarter, smarter than humans. And just this question of, how would you even get them to do good things and not bad things? What does it even mean? How would you formalize this idea? How would you stop them from doing something very dangerous, potentially? If they’re very powerful, they might well be able doing very, very dangerous things.

Jim: Interesting. Yeah. And so, how does that relate to GPT-3? Which at one level, I mean, it’s very impressive, but it doesn’t look like they’re going to turn the universe into paperclips anytime soon.

Connor: Ah, you bring up the paperclip maximizer, a classic, of course.

Jim: And we know it’s misleading, but nonetheless, there’s different kinds of risks and GBT-3 is nowhere close to an obvious existential risk, like the theoretical paperclip maximizer that Eliezer and others love to talk about. I’ve had some interesting conversations with Eliezer back in the old days, back before they became MIRI, and he’s definitely an interesting dude, but I… Anyway, another conversation for another day. So what are the risks around things like GPT-3 or the misalignments with the good of humanity?

Connor: So there is two prongs to this thing. There is the concrete things that GPT-2, 3 could do, stuff like misinformation, bias language, use for like troll farms, whatever. Personally, I think you’ve seen my extremely long medium posts about those kind of stuff. I think these are real concerns. These are real, but they’re not existential. I don’t think we’re going to paperclip ourselves by having mean Twitter bots. I don’t think it’s good that these things happen. And I think, we should figure out ways to address these problems, but they’re not what I personally work on. So I’m more concerned about consensual risk.

Connor: So the reason I’m interested in GPT-3 and these types of models from an alignment perspective is, is I think they’re really genuinely is a appreciable probability, I personally put like 10 to 30% chance that the first transformative AI, so like real paperclip level threat will just be a very, very large, trained on GPUs, made with PYTorch, pre-trained on infinite data system of this kind. So there’s a lot of reasons why I think this might be the case, again, like 10 to 30%. Of course, who knows unknown unknowns, who knows what’s coming down the line.

Connor: But for the first time in my life, I look at these like transformer-based models, these large scaling models, and I see, as far as I’m concerned, a direct path to AGI. If these scaling laws hold and if they work for different modalities. Whether or not text is enough for AGI is kind of controversial, but we’ve already seen that these scaling laws also worked for images, they worked for math, they work for sound, video. They were for all of these different modalities. You can use the same architecture to combine all of these different things.

Connor: And it seems to me that… I also have some hunches about how this relates to… Also, the brain also does something like unsupervised learning things, in the neocortex, it seems to be really important. In a way a human brain is just a scaled up chimp brain. It’s all the same parts, it’s just three times as large. So whether or not that’s true, it’s not super central to the argument, but I have a pretty strong hunch that there is a connection there. And it seems to me that seeing how between GPT-2 and GPT-3, how we have these really interesting, seemingly intelligent things emerge, of course, people would never say, “Oh, GPT-3 is sentient,” or something, of course not.

Connor: But then, again, you can talk to it. You can talk to it like a person and it works pretty well. It is kind of weird. If someone from the future came to you and gave you a black box, and the black box just talked to you like a person, you would also have like some like, “Hmm, this seems concerning. This seems like something we might want to research more carefully.” And as we’ve seen, the difference between GPT-2 and GPT-3 is really hard to quantify. We’ve talked about this with like evaluations. It’s really hard to give like an… I couldn’t predict ahead of time, what GPT-3 might be capable of/ in the same way, if someone trained a trillion parameter, 10 trillion, quadrillion parameter model, I don’t know what it might be capable of.

Connor: I think there’s truly a non-zero probability chance, if you train like a quadrillion parameter model or something, and it has some so good simulator that you tell it, you are a paperclip maximizer, your actions are colon, that it might very well just do that, if you’d unhook [inaudible 00:50:49]. Of course, no one would actually do that. But I feel like this is an interesting test bed to start to research these generalized models of complex data and how we might be able to align them, because they’re also not yet dangerous.

Connor: I don’t think GPT-3 is going to paperclip the universe. I think that’s good, because I think that means we can do kind of stupid experiments with it, that we might not want to do with an actual AGI.

Jim: Interesting. Well, I think we’re going to wrap it up there, because this is a natural branch to episode two, where I’m going to come back and say, “Ah, here are several reasons why I don’t think GPT-X is actually AGI.” And that was the original reason I think we were connected. And I’m really looking forward to the next conversation, and I also want to dig into and get your updated versions on your counting consciousness articles, which I found… I was, holy shit, when I was reading those things this morning, actually. I’d love to talk about those. And they were written a while ago, I mean, what was it? Two years ago, that’s ancient history today. I’d love to get your perspective on those things as well. So Connor Leahy, I want to thank you for an excellent episode of The Jim Rutt Show, and look forward to having you back.

Connor: Well, thank you so much. It was great.

Jim: Indeed, I really did enjoy it.

Production services and audio editing by Jared Janes Consulting. Music by Tom Muller at