Transcript of EP 325 – Joe Edelman on Full-Stack AI Alignment

The following is a rough transcript which has not been revised by The Jim Rutt Show or Joe Edelman. Please check with us before using any quotations from this transcript. Thank you.

Jim: Hey, listeners. I finally got around to kicking off my Substack, and I just put out a quite interesting little paper—at least I think it’s interesting—called “A Minimum Viable Metaphysics.” Yes, Jim Rutt famously says, “When I hear the word metaphysics, I reach for my pistol.” But everybody has a metaphysics even if they don’t know what it is. And I’ve been thinking about it for 18 months. What is my metaphysics? And I have published this little essay. Check it out: jimrutt, all one word, .substack.com.

Now on to our conversation today. Today, we’re talking with Joe Edelman. I’ve known Joe for many years. He was one of the original Game B-ers back in 2013, and he’s probably best known as the originator of the concept of “time well spent” as a metric for online services and more broadly for life itself. I find myself using the term fairly frequently—time well spent—and I try to remember to credit Joe with its coinage. Very interesting and important concept. He was one of the cofounders, along with Tristan Harris and others, of what is today the Center for Humane Technology. More recently, he helped build the School for Social Design, an online school for enlightened product and service design, and today is a key member of the Meaning Alignment Institute, which researches how to align AI, markets, and democracies with what people value. They call this “full stack alignment.” Welcome, Joe.

Joe: Thank you, Jim. It’s nice to be here. It’s good to be here again.

Jim: Yeah, it’s great to have you back on. Actually, we had Joe and Ellie Hain—we had a very good conversation back in Currents 080 on rebuilding meaning. So if you like this conversation, check that one out too. Baby needs a new pair of shoes.

Today, we’re gonna talk about a paper that came from the Meaning Alignment Institute titled “Full Stack Alignment: Co-aligning AI and Institutions with Thick Models of Value.” Joe is one of the authors. There’s a long list of coauthors, and we’re gonna go through it in some detail today. Now the topic of aligning AI—it’s probably also worth noting that I’ll be having Nate Soares on relatively soon. He’s the coauthor of the New York Times bestseller If Anybody Builds It, Everybody Dies. And what he and his coauthor, Yudkowsky, talk about, they call “AI alignment.” And so AI alignment’s a thing in the air now, but we’re gonna be talking about something quite different. So I thought it was worth mentioning that.

So, Joe, first thing: I was very pleased to see that you called something out, which I find many people not including in their thinking, and that’s pluralism. You specifically say, “while remaining pluralistic and respecting the diverse ways people pursue flourishing.” I continue to believe this is one of the core ideas about humanity that’s absolutely important in all of our design of social systems, but it’s very frequently forgotten. Tell me your take on pluralism and why you included it in this paper.

Joe: Yeah. So something that I’ve been obsessed with for maybe a decade or something, and it was part of my work on social media before this current work, is: what’s the information basis for alignment? So what would, for instance, a social media recommender need to know about people to give them the kind of recommendations that are really good for them? And what does an LLM need to know about you, or an LLM agent if it’s gonna go arrange things to help you flourish, right?

And the default answer in these fields is your preferences. So the recommender system would understand what you seem to like to click on or comment on or whatever and then give you more of the same. And this is also spreading into the world of chat agents. Sometimes ChatGPT will ask you, “Do you prefer this response or this response?” or will ask you to thumbs up or thumbs down a response, right? It’s the same kind of preference data.

That’s one kind of pluralism—people have different preferences. Some people prefer terse writing, and some people prefer more verbose, flowery writing, for instance. And you can sort of pick that up from these upvotes and downvotes and recommend different things to different people. But I think there’s much more interesting kinds of pluralism than preference pluralism. There’s norms. There’s values.

Two kinds of pluralism that are really important to take into account are norms and values. So norms are what we think are appropriate, and that varies across cultures. The appropriate norms of journalism, for instance, vary across different publications, across different communities that think of themselves as journalists—similar with legal norms and finance norms. And values, I think, vary in a different way. What people find meaningful, what makes a good life to them, will vary from person to person.

So a lot of my work has been about this question of the informational basis on which to align, and this brings you immediately into questions of how people differ.

Jim: So in the paper, let’s get back to this idea of preferences and contrast it with your ideas of values and norms. You talk a fair bit about how preferential models fail to capture what people truly care about. Let’s dig into that a little bit and maybe talk a little bit about how preferential methods are substantially the basis for what we currently have online.

Joe: So it’s not just online, first of all. Markets and voting are both preferences models of like, they’re preferences kinds of systems. They’re about aggregating preferences. Right? Like, when you vote, you usually have a small menu of options, maybe politicians or policies. When you buy something, you you you know, you’re shopping from a menu of of options. You say, I I prefer this one to that one. I I buy this toothpaste instead of that toothpaste. And I I think it’s possible to collect value information if you had some ideal menu. But in practice, if you’re choosing between Biden and Trump, say, or you’re choosing between Colgate, I don’t know, some other kind of toothpaste, that doesn’t actually collect information about how you want to live. It collects information about maybe price, maybe flavor, maybe something about your politics, like which which of these two politicians you think might protect your special interest group or something. It collects some information, but it’s very shallow information. And that’s because the menu has been curated in a way that the only thing you could express with a preference is between the menu options that are in front of you. It’s And the same with social media recommenders. Right? So if you if you’re scrolling and you like this and you like that, well, there’s a lot of things that just aren’t on the menu. And even if they were, you’d need much more information about many more preferences before you could get the right kind of read on people. It’s different with LLMs, for instance, much more is possible because you can describe in a prompt, for instance, what you care about. You can use, you know, much richer language, and you can even give a mathematical specification for what you care about if you want. And so compared to what’s possible with LLMs, it’s kind of becomes clear that our ways of voting, our ways of arranging, doing market allocations, and things like social media recommenders are very primitive in terms of language that they can operate on, the the the information basis.

Jim: Got it. Now you also critique, however, just straight text. That is certainly true that LLMs and other related ways of processing text can be more high dimensional than, you know, sort of single low dimensional preferential systems, but taking a a straight text approach also has its limits. Tell us your thinking about that.

Joe: Yeah. Yeah. The paper goes into this in in much more depth, but I think there’s a couple different problems with text. So one is that so Claude, for instance, is the current LLMs are largely trained through text specifications that are very vague. Like, if you look at Claude’s constitution, for instance, you’ll see that Claude part of the training loop for Claude, this is called constitutional AI. The training loop, Claude will generate multiple responses, and then, like, another version of Claude will try to pick the one that’s most helpful and say, that’s the response that Claude should give. Or pick the one that’s most harmless, or pick the one that’s most honest is another term they use in the constitution. There’s many of these kinds of terms, but they’re very vague. They’re not defined in Claude’s constitution. So they use, like, a single word. And, also, when people tell an LLM or an agent what how they want it to operate, there there’s a lot of ambivalence there. There’s a lot of ambiguous terminology. There are certain cases where this is okay. It really depends on on on the stakes and also the cultural pressures. But when there’s cultural pressures to sort of define, let’s say, harm, like like there’s a big issue in culture in in kind of Western culture right now is is certain kinds of speech harmful. And if there’s no instruction to the model about what is actually meant by harm, then there becomes a cultural pressure to push model behavior in one direction or the or the other by having certain interpretations of the terms. And all of that is kind of not specified in the information that’s used to align the model. And so this is kind of some of the problem that we discussed in the paper.

Jim: Though on the other hand, the advantage of language is it’s open ended. Right? You can build in as much nuance, as much detail as you want. And, you know, you and I chat a little bit pregame that we both are pretty heavy users of these models, and one of the things you quickly learn is the absolute criticality of getting good at prompt engineering.

Joe: Yeah. And I think a lot of that is about I think we’re learning, and it’s the same thing if you’re managing someone. You learn what you need to say, like, how to be less ambiguous. And in a way, that’s what we’re advocating for. But we’re we think that you should use philosophy, like philosophy of values and the cognitive psychology and philosophy of norms to do this. So we’re not against saying things using text, but we think that it’s important to know how much you need to say. And for that, you need kind of a theory of what a norm is or what a value is so that you can say, okay. Well, I’ve actually communicated substantive information about what an actual value is rather than just a word like helpful.

Jim: Got it. And just for the listeners, I’ve really focused on this lately. There’s a magic word in prompt engineering. It’s total aside. That has nothing to do with the paper. And that’s the word then, particularly the thinking models like o three and now, GPT thinking and even more so, GPT pro link things you want to do with then, and it becomes very smart at sequencing them correctly. A small little trick, but one that I find to be pretty powerful. Anyway, let’s move on back to the paper. What you guys propose is what you call the thick models of value as the answer to the limitations, first of a preferential approach and then even of a higher dimension, but still fuzzy textual approach. So tell us about some detail, the idea of the thick models of value, and what does that actually bring to the table?

Joe: Yeah. So there’s been work, I think really good work in philosophy of choice and in cognitive science about so in philosophy of choice, there’s this question of how do our values direct like like, influence our choices. And one model of this is that we reason about our choices. So we have reasons that sort of suggest that option a is better than option b. And so then if somebody says let’s say we pick option b in a kind of a preference model, one of the things that voting and markets and social media never does is say, oh, why did you click that? Right? And if you say why, you give your reasons, and that’s much more information. And those reasons often reference your values. They reference what’s what’s important to you. And sometimes they also reference norms, like what you think is appropriate in this situation. So there’s theories in both cognitive science and in philosophy of choice about how all these things connect. What is the underlying grammar that connects norms, values through reasons into choices? And based on these theories, you can have much stronger claims about what you would need to communicate to communicate a value or to communicate a norm. And to give an example, one of my coauthors, Li San Shen, she’s done a lot of work on reading the room in terms of norms if in multi agent systems. So if you’re a robot and you enter and you’re, let’s say, your self driving car and you look at a road, can you tell what the norms of the road are? Right of way and stopping at stop signs and so on. If you’re you haven’t programmed any of this, but you can watch other cars. Or similarly, to give an example, because I’m in Germany, if you go to the sauna and you’ve never gone to a German sauna before, could a robot intuit based on watching what people do what the norms of the sauna are? And her method is that she has ways of guessing what the goals of the, other agents in the system might be. And then when are they not actually pursuing their goal? When are they setting their goal aside to, for instance, hold the door for someone or to stop at a stoplight or whatever? And then based on that, there’s a kind of a spec a mathematical specification of what a norm is, which is like a a pattern in terms of when you would defer your goal based on on these kinds of social rules. And so she has some math that go from absurd behavior, back out the goals, and then back out the norms. That gives you what what you end up with then is a specification for what a norm is. It’s very rich and that kinda can tell you exactly what you’d need to communicate, like, what the information content is that you need to communicate to have communicated a norm to, say, an LLM, and then it could then abide by that norm. So that’s an example of a thick model of value, is something that based on a theoretical understanding of what a norm or a value is, provides a kind of package that you can use to convey the content of a norm or value, and also to judge whether a value was used in a choice or whether a norm is being abided for abided by by a community just based on behavior, for instance, based on choice behavior or the actions of of the other parties. And so this is a much richer kind of, mode of analysis and communication than just text and and much, much richer than just preferences.

Jim: This is very interesting. So and in the paper, you call out the idea of a grammar and or a type system for normative concepts. Have you guys actually been able to develop such a thing?

Joe: Yeah. I think, the the two that I’m most happy with are Shen’s model of norms, which I just mentioned, which she has a paper on where you you can read more deeply. And we have won a kind of formalization of values as attentional policies, so values as kind of a path of attention that you use in making a a certain kind of choice in a particular choice concept, context. So let’s say you’re deciding who to hire. Your attention will reliably go to certain kinds of considerations. So if we study you while you’re considering whether to hire someone, what are you looking at? What are you thinking about? There’ll be regularities there. And so that’s kind of what we call a value. And there’s another paper that from from the Media Alignment Institute from me and Ryan and Oliver called what are human values and how do we align AI to them? And that goes into this, thick model of value as attentional policies.

Jim: How do you make sure that when you define a grammar and particularly when you define a type system, you often end up accidentally smuggling in values of your own? How do you make sure that a grammar and particularly a type system still honors pluralism?

Joe: I guess that’s really a question about expressivity. So I think a good test of Shen’s model of norms or our model of values is to take a bunch of preexisting norms that you know about or a bunch of preexisting values that are maybe your values and make sure that they can be expressed in these grammars. And I’ve been impressed with both of these formalizations that they, you know, are expressive enough. And then you want to also make sure that they’re not too expressive by taking some things that aren’t values or norms that you wouldn’t call values or norms and making sure that, you know, you can’t express those things in a way. Like, you can in a text string, you can write purple flamingos or whatever. Is that a norm? Is that a value? It’s not. So it shouldn’t be expressible in these systems. But something like being honest and being, like, a certain style of honesty, like, say, epistemic humility or owning up to things would be another style of honesty. These things are expressible, or all the norms that I mentioned previously when I was talking about Shen’s work are expressible in these formalizations. So that’s the way to test, I think.

Jim: I like the idea of testing for non-norms, and can you represent them in the system? Like, I call that a pruning rule. And a system like a grammar and or a type system to be actually useful does need to be a pruning rule. There has to be a lot of things you can’t say in it. Because if you could say anything, you’re not saying anything in this sense. Does that make sense to you?

Joe: Yeah. Absolutely.

Jim: And now the other one, this is a little elaboration actually, and I have it in my notes later, and that’s the distinction between values, norms, and more transient things like taste and fashion? Talk about that a little bit.

Joe: Yeah. So here’s one—this is just, I’m not saying this is canonical, but here’s one way to tell the difference between a taste or trend on one hand and a value or a norm. With tastes, you’re fine with other people having whatever they have. So, you know, if I prefer chocolate and you prefer vanilla, that’s fine. It almost feels like it’s as it should be. Right? With norms, it doesn’t work that way. Right? If we both live in America, but I prefer driving on the left. That’s not right. Right? That’s not gonna work. That’s the difference between norms and tastes and a good way to sort of tell the difference.

Jim: On the other hand, in our current political context, things that can be both held simultaneously, let’s use the canonical example, pro-life and pro-choice. There’s nothing like the sides of the road that adjudicate those choices. It’s, yeah, it’s certainly possible for different people to have those two different norms, right, or values.

Joe: Oh, I’m not talking about impossibility. Well, so I haven’t gotten to values yet.

Jim: Okay.

Joe: Continue. And actually, I think that that example is relevant to values. So with values, with norms, there’s often a feeling like, oh, we all have to agree. And it’s not okay. Like, it’s not okay. Defection is a problem.

Jim: Yeah. I call it causal constraint. There are at least probabilistic forces that actually exist that tend to push you away from violating the norm. Like you just say, driving on the wrong side of the road, going into a Walmart parking lot and talking to the people there in an obscure Amazonian forager language. You know? You’re not gonna make it very far. There are higher-level constraints on violating those norms.

Joe: And then there’s values where values are about the way—I use the term values—a lot of people use the word values to include norms, so I don’t. So this is just a technical note. When I use the term values, I mean wise ways of living or methods of discernment or, like, things that have to do with what’s meaningful to people. I think these are all equivalent. And so with a value, you actually want one that’s maybe wiser than your own, which is not true with taste. So if I prefer vanilla, you prefer chocolate, I don’t want someone to, like, educate me about—I wouldn’t want, like, there’s not, like, a wiser taste there that I would actually prefer to have. But with values, let’s say I’m trying to hire employees, and I’m not that good at it yet. I’m not so good at sensing what makes a good employee. I might really like a mentor. Right? And I might learn something from how somebody else does their hiring. And so that means that it’s a value in our terminology. And that’s also different from tastes and also different from norms. That’s kind of one way to separate them out.

Jim: Well, let’s get back to the pro-choice, pro-life thing. You said that that was relevant. That doesn’t exactly seem like one of those kinds of values.

Joe: Yeah. I mean, so we have several conversations in our we have some listed in the paper, and I encourage people to look it up. It’s on archive, where people start with abortion is murder. Like, that’s the first thing that they say to the LLM. And five minutes in, eight minutes in, they’re telling a very different story about what they think is important.

Jim: Let’s move on to now the integration with AI. Now one of the things that concerns me very deeply is AI assistants. Yes. I use them all the time. I love them. I find them amazing force multiplier. Joe and I talked a little bit earlier today about how amazing they are for folks like us who have many ideas but need help with, you know, the formalization, the mathematics, etcetera. These things are astounding. However, they are also potentially massive vectors for manipulation. Speak about that a little bit.

Joe: One so we have we’ll probably talk about it later, but we have five research areas and four kind of moonshot things that we think need to be built to create a good free future for humans and AI. And some of those are most of those are at the institutional level. So they’re about things like democracy and markets, governance. But one of them is just at the level of a single AI agent. We call it Allied AI, which is like an agent that really understands your values and what’s important to you and can represent you well according to them. It doesn’t alter your values except through helping you get wiser in a way that’s sort of rigorously not manipulative of you. And to do this, I think so there’s, like, different ways you can think about this problem of non-manipulative AI assistance. One would be like, oh, it just leaves you however it found you. Right? Like and if you have some dumb ideas or some moral positions that are actually, like, not very well thought out, you know, it lets you have them and operates according to them. Right? We we don’t want that, but we also don’t want a situation where the values of the, you know, the people that made the AI are are kind of spreading into the users or the values of some part of the population of you know, like or where the AI is bringing people down to kind of median or the most popular values rather than their individual values, which might actually be better. Right? So there’s, like, a lot of failure modes to avoid. And I think thick models of value help with that too, because these theories about how values through reasons go into choice and how norms evolve are theories about what good reasoning, like, moral reasoning is. And you can use such theories to allow people to evolve their moral thinking in collaboration with the AI in a way that doesn’t involve incepting any values from outside, but is just the kind of it’s kind of speeding up possibly their moral thinking that they would do anyway. Does that make sense?

Jim: It sort of makes I mean, I understand what the words are. I’m not sure how you’d actually implement that in the world. Let me give you an example. Especially in our current politics, there’s an awful lot of, let’s say, political positions based on gross ignorance. Let’s consider the autism vaccine controversy, for instance. What exposed to facts, at least some people will change their views. Less than I would like, but at least some. How do you disentangle value updates by via fact from value updates via moral reasoning or logic or some other rhetorical form? Because facts presumably are capable of changing values that people might have and probably should.

Joe: Yeah. Yeah. Yeah. None of what we’re working on is so relevant to facts. And, generally, I think there are a lot of that’s another thing that we want, like, abstractly as a society is for the I think that we’re in danger of the ecosystem of AIs leading us even further from the truth than kind of the social media era has. And so but that’s not something that we actually talk about very much in terms of full stack alignment. So even if you have some agreement about the facts, there’s still a lot of variation in values. And one way that you can help people with their moral thinking is you can try to point out considerations that may or may not this is this is kind of what happens naturally if you have a value. Let’s say you try to be honest with everyone, like Honest Abe Lincoln or something. Or maybe it’s George Washington. I forget. Anyway and, you know, at some point, your honesty is, like, really not tactful. And it’s it’s absolutely the wrong thing. Right? And then you have to revise your your ethos or whatever because you’ve run into a consideration that you didn’t take into account that creates a more complex value, like a balancing between honesty and tact maybe that is relevant in a situational context that you weren’t anticipating in your previous kind of set of values. And so this is a kind of thing like, pointing out edge cases, for instance, is a kind of thing that LLMs can do for us to help us think through our values that’s not so manipulative.

Jim: Gotcha. Let’s get and drill into this one level deeper. You have something you call AI value stewardship agents or words very similar to that. Can you put a little color on what that might actually look like?

Joe: Yeah. So it’s really two things together. One is the thing that we’ve just been talking about, which is helping people think through their values and maybe upgrade them a bit in this nonmanipulative way. And then the other thing is that we think more and more we will have agents—this is already happening, right? We’ll have agents that are operating on our behalfs, trying to use our value. So let’s say you have some property and you want to—maybe you have an Airbnb or something like that. You want to rent it out, right? By default, you might just have an agent that finds the highest price, right? But it turns out that might be—maybe that’s causing gentrification. Maybe you’re renting to somebody that you actually really wouldn’t want in the neighborhood, like a Russian oligarch or something. I don’t know.

Jim: How about fraternity boys? People I talk about with Airbnbs, that’s the biggest problem—fraternity boys and bachelorette parties.

Joe: There we go. So what you want is an agent. If an agent is now doing this for you and interviewing people and choosing who gets your rental property, you want an agent that can somehow represent your values. And you want it to represent your best values, right? Like, let’s say you say no bachelorette parties or something, and then it turns down a bunch of kids, right? But maybe they’re kids that actually, if you had thought about it, you wanted to give a chance because maybe you really believe in these kids or they’re throwing a different kind of party or whatever, right? So there’s a ton of nuance in terms of whether an agent can represent your best values, your most thought out, most thoughtful kind of position that you would have if you had all the information. And then there’s also something about a feedback loop where, let’s say, the agent has a tough choice about whether to rent to these particular kids or not. Maybe it makes a certain call, or maybe it doesn’t make a call and it bumps it back to you. Either way, there’s some kind of loop there where you’re going to find yourself in positions of responsibility that you never would. Similar if you’re building a software product and then many people are using it, right? You find yourself in a new position of responsibility. And so there’s some kind of loop between the values of the human operator and the values of the agent that the agent is representing so that these positions of responsibility become a thing that can make everything grow in integrity and thoughtfulness in terms of the values that they proceed with rather than descending into a kind of a Machiavellian bidding war, for instance, where it’s just the person who can spend the most who gets the rental property. And then, you know, we have a kind of a cultural decay that happens because Russian oligarchs and frat boys are now renting all the units. Does that make sense?

Jim: Yep. It does. And, again, the words make sense, and the ideas sound noble. The question is how do we actually get there? You know, for instance, I actually use not so much Airbnb, but VRBO, which is kind of a little bit different focus, but same general concept. And the landlords do vary tremendously. Some, they don’t give a shit who you are. They just rent to you. Others of them do a back and forth and figure out who you are. And I always proactively say, hey, I’m an old dude with a wife I’ve been married to for forty some years. And, you know, just make it clear that we’re, you know, probably safe to rent to. And, you know, I can sort of back integrate what some of their models are. They don’t want to rent to people that are young. They don’t want to rent to singles. They don’t want to rent to people with dogs, et cetera. At least, you know, we have dogs, but not dogs that we take on trips with us. So there’s, you know, sort of from zero consideration, those people definitely exist. They’re economic optimizers. You know, as you say, one can take the market and just strip all context out of it. The only thing that counts is dollar votes. But I think most of the VRBO people at least have, you know, two or three or four factor model, which is easily understandable. And you can see how they could even implement that algorithmically, right? Though probably legally they couldn’t, which is why they use the email Q and A or conversation method instead. But now let’s go to the furthest step, which you articulated, which is if we could somehow fully capture the value of the landlord. Let’s break that actually into two parts. One, how would you capture some reasonably thick graph of the landlord’s values? And two, how would you actually implement that in terms of the interaction between a landlord and a prospective renter on VRBO or Airbnb?

Joe: One way that we collect values, and this is from our—so like I said, there’s multiple kind of thick models of value that we talk about in the paper. My favorite one with Norm’s, which is my with my collaborator, Shen, and my favorite one of values is is ours. But there’s more options here than than I’m going into. But for our model of value, we would ask people about their choices in the past and which ones felt really meaningful or really right, and which ones were hard choices and what made them hard. And through these kinds of questions, we we pull out the the intentional policies, we call them, the the things they’re paying attention to that get most to the heart of what the issues are about for them. And this often involves, like, a layer of follow-up questions where, you know, somebody says, oh, I really liked renting to this particularly particular old couple because it seemed like it really made their vacation fabulous, and they seem to really appreciate the property deeply or something. And then the LLM might ask, you know, one level of follow-up questions like, you know, why how did you know? Did you have a sense when you were talking to them earlier that they would be the kind that would really appreciate the property better, et cetera. Right? So it would ask some follow-up questions and kind of take notes and make some kind of legible representation of what it thinks it learned from you. Right? And then it would try to pay attention to the same thing in its interviews, which at first would be interviews with people, but soon will be interviews with other agents that represent the people that also know quite a lot about the people they represent. Right? And so quite a lot more information could be exchanged in the future than is practical when you’re interviewing someone on one of these platforms or over email where you only wanna spend, I don’t know, a small amount of time sending emails back and forth. You won’t you don’t wanna ask in-depth questions about the other party. But in the near future, you could. And then based on those that information exchange, I think there’s three outcomes. One is like, this is definitely a no, or this is definitely a yes, or this is a hard choice. And when it’s a hard choice, this is an opportunity to kind of, make the representation of your values even richer by either making a choice and then checking with your owner if if that was a good choice or ask or just bumping it back to the owner.

Jim: And then using that as part of the the learning feedback. Right?

Joe: Yeah. Exactly.

Jim: Well, obviously, let’s let’s take this case. This is actually a interesting, case. Obviously, when we get to the stage where the renter’s agent is talking to the landlord’s agent, we see a game theory thing gonna happen for sure, which is the renter’s agent’s gonna lie like shit, to maximize the probability of getting accepted. What do you say to that?

Joe: I think there’s two domains here, and it really depends on which we are, which is like a zero sum domain and a positive sum domain. So and both will happen, of course. In the zero sum domain, this becomes a big problem. I think that there are ways to deal with it, different kinds of authentication chain, outside information, vetting, and so on. But I’m much more interested in the non-zero sum domain where there’s much more economic upside to be unlocked because of this kind of information exchange than is commonly unlocked. Like, when you like, if, for instance, if you’re renting to mostly absent Russian oligarch and they’re paying you a lot, but it actually hurts the community in some way versus you’re renting to people who are going to be very good additions to the community. There’s there’s a lot of positive sum gains to the second. Right? And so I’m just more interested in the domain where you where those things are there to be discovered and where maybe the Russian oligarch finds some other place where he has a positive sum benefit. I don’t know. Do you see what I mean? Like, that’s just like a more interesting set of possibilities.

Jim: Well, it’s interesting, but is it realistic? Right? Let’s let’s take I mean, I’m not quite sure what the negatives of our Russian oligarch. Ones I bet are pretty nice. Right? I’m sure they could be unnice, but, let’s but let’s use the more obvious example of why a neighborhood wouldn’t want this, you know, a frat boy party. Right? You know, there’s gonna be lots of cars on my street. There’s gonna be probably people, you know, singing drunkenly at three in the morning. There might be bottles broken in the road. You know, there’s a whole bunch of negative externalities if I rent my, VRBO to the week for a weekend to, you know, 10 frat boys. But it’s in the interest of the frat boys to get the best property in the best location for the best price. So there’s, you know, an obvious, problem there of the, frat boys not caring about the externalities, or at least at least there could be a scenario, and probably the most common scenario is they don’t care about the externalities while the landlord does. You know, a system to survive in the real world has to be able to deal with those kinds of cases.

Joe: I want to take a different example. So let’s say I have a rental property, and it’s for events. Right? It’s an event space. It’s outside of Berlin, let’s say. And I’m also really into regenerative agriculture. I really like—I’m a green or something. Right? And there’s many, many people who could—my event space is overbooked. Right? So there’s many different potential people who could rent on any given weekend this event space. Right? And then I find out that there’s a group that does permaculture workshops or something. Right? And I can verify that they do permaculture workshops because they have a long history of doing permaculture workshops. And so I choose them, and maybe I even offer them a deal. Maybe they pay a little less than the other group that I would choose. Right? And then I feel like I won because my event space is being used not just in a way that gets me money, but in a way that, you know, furthers the kind of change that I want to see in the world. And they also won because they got a deal and because they got to book the event space when other people wanted it. So that’s the kind of upside that I’m trying to unlock. And the other people could try to lie and try to say, oh, we also run permaculture workshops or something if they can figure out that that’s my value function. They could try to put up fake websites about a history of running permaculture workshops, but I kind of think that in many cases, this is just going to work out.

Jim: Maybe. Sometimes.

Joe: That’s what we want. Yeah. Sometimes. Yeah.

Jim: Yeah. And I will say I have a natural tendency to look at the world from a red teaming perspective. Right? Which is, all right. How could this be abused? And what are the game theoretical incentives for people to abuse it? You know? So it’s a—you know, the association of retired Monsanto salesmen or something. Right? And their agent, because it’s an agent, can spin up fake websites, can tell all kinds of cock and bull stories. Right? Perhaps does an ecosystem like this require some factual grounding or reputation systems or networks of vouching, you know, ways to build confidence in the representations of agents?

Joe: Sure. Yeah. I think those things are definitely necessary.

Jim: Gotcha. Okay. All right. I think that, I think we’re probably in agreement there. And so I’m seeing what you’re saying about the win-win aspects, and this could actually be a way to accelerate the discovery of good faith win-win.

Joe: Yeah. That’s the goal. That’s the goal.

Jim: Yeah. That’s the goal, and that’s good. But at the same time, it has to be resistant to or relatively immune to bad faith attempts to exploit. And if there’s anything that we’ve learned about online networks—I mean, I started working in the world of building online consumer networks in 1982, you know, at the beginning, literally, the very first company that had consumer online services, the Source. And amazingly, in the very early days, there was very little attempt to exploit. But by 1983, there were. And, actually, I started 1980. I take that back. By 1982, we were already seeing the need to relatively aggressively deal with the idea of exploits. And so for any idea for me to be taken seriously has to be able to withstand at least a first or and second order consideration of how such a system might be exploited by bad faith operators.

Joe: Sure. But this—and something to know about the Meeting Alignment Institute and the Full Stack Alignment work that we do is that we focus on things that we think are not going to happen. Right? So with in—you know, unless we act. Clearly, chains of trust, chains of authentication, and so on are very, very important for the agentic era, and many, many people are working on them. And similarly, with these issues of fact checking and things like that, many, many people are working on that. And so they’re not neglected in the same way.

Jim: Say it again? I think I missed the point.

Joe: Many people are working on the problems that you’re talking about so that they’re not so important for us to work on.

Jim: Oh, I see. So you’re therefore saying that you can focus principally on win-win situations and not necessarily have to concern yourself with adversarial ecosystems.

Joe: Yes. Because many people are working on agentic commerce and the chains of trust that are necessary, the kind of vouching and so on that you said. Like, this is a very, very big field of research, and so it’s not—it’s complementary to what we’re doing.

Jim: All right. So let’s assume that the goal is to upregulate via these mechanisms the discovery of win-win contexts. How do you imagine that happening in our world, which is unfortunately, in the world of investment-based business, money on money return, what those of us in the Game B world call Game A, how does something like this gain traction? Why is this in the incentives of small-minded short-term rate of return investors to do this?

Joe: I mean, I think so. We have to shift from talking about these allied agents or value stewardship agents to talking about some of our other proposals. Two other proposals that are relevant that we have people working on, and one of them that we just—we’re going to do a workshop, a convening workshop at Oxford in mid-November on, we just decided yesterday. This is work with Jakob Forster, who’s a multi-agent AI person at Oxford. We’re calling this kind of moonshot research project the super negotiator. This is an AI agent or possibly multiple agents that interview multiple parties and come up with better contracts or better terms of agreement for them than they would have come up with by themselves. We think that there’s quite a lot of evidence that negotiation, in conflict situations, like divorce proceedings or companies that are doing, you know, some kind of that have problems in court with each other or countries at war. In these kinds of situations, there’s not enough search done to find potential deals, especially if you consider other interested parties, the supply chains of these companies, neighboring countries, et cetera. Right? There’s a lot of winners and losers possible in these different kinds of situations of conflict. And most of them are never interviewed about kind of what different outcomes would benefit or hurt them and what they would be willing, what kinds of concessions they would give to produce this outcome or the outcome, what kinds of coalitions they could form, what kinds of upside. To give an example in contracts that’s just kind of a very obvious one, we think that often a particular employee is hired into a particular position in a company, and it’s not the best position for them. But what happens is that, you know, an HR person is responsible to fulfill that position, and they find somebody who will accept those terms for that position. But they never do a global search across the company about what someone with those particular skills could do best at that company. Right? So there’s, like, really minimal search done in forming contracts in general. So one project that we’re convening is about doing just much more search in these areas of conflict resolution or contracting or negotiation more generally. And another area that’s relevant is this project we call market intermediaries, which is a thing that sits in the middle of a market and intermediates between buyers and sellers and writes more complicated terms into the contract that it also assesses that are more about the flourishing of the buyers. So an example would be AI chatbots. If people are using Claude, for instance, and they’re trying to flourish or they’re using Claude for, like, personal growth or psychological health or something like that, currently, there’s no one—the payment flow to Claude doesn’t depend on whether that actually happens. Right? But it’s not very hard to add a third-party assessor into the flow, and you could even say to Anthropic, like, hey, we’ve bundled together 10,000 users that all are using Claude for personal growth or learning or whatever. And we want to pay you by their learning outcomes or their personal growth outcomes. And we’ll pay you more than you would get if those outcomes are good and less if they’re not. And then this deal, which bundles 10,000 users, would go through the kind of enterprise sales flow that Anthropic already has for businesses. Right? Because it’s like 10,000 users all bundled together. And they could decide. Their contracts department could decide whether this is, like, worth it for them or not. So that would be a good way to—that’s like an example of how we can restructure market incentives to take into account these, like, deeper considerations.

Jim: There’s, like, a meta thing here, which is coming out as we’re having this conversation. Because what you’ve been able to identify are some domains where win-win is actually feasible, and they are not subject to the multipolar trap. Right? This is the, you know, the big bad attractor in our current system, what many folks refer to these days as Moloch. The idea that there are many contexts where doing the wrong thing is profitable and even worse forces the other people in the industry to also do the wrong thing. There’s, you know, interesting examples on, for instance, the addition of sugar to breakfast cereals. Originally, breakfast cereals were designed specifically as a health food. Right? To be wholesome, whole grain, et cetera. And then in the sixties, one of the companies discovered that, oh, if we add a shitload of sugar to these things, a lot more kids will be bugging their mothers to buy them. And now you walk down the breakfast cereal aisle at your local non-hippie grocery store, and it’s 300 varieties of sugary cereal. Right? Classic multipolar trap. And those are, you know, known breakers of goodness, essentially. But interestingly, you’ve picked out some examples where you may be able to get around the multipolar trap. I particularly like the contract one because I absolutely agree with you. As somebody who has negotiated many contracts and many quite complex contracts, most people are lazy and don’t explore the space of win-win in contract land. And something like you’re describing could be very interesting. I’m also intrigued by this idea of bundling consumers with a let’s call it time well spent wrapper, to use your term, and I think it’s appropriate in this case. That could also be extremely interesting, right, which is where we provide a clear win-win at multiple levels, via adding something like an adjudication of results for people that’s contractually enforceable and pays off in actual dollars. So those are cool. I want you to respond to that, and then I have a follow-up question.

Joe: Yeah. I mean, I think Moloch is in a way a ghost. What the multipolar traps that you just named, the races to the bottom that you just named, the breakfast cereal one, they’re examples of—they appear to be optima because the market reads things in terms of preferences, not values. They’re actually losses in terms of values. But since the market just sees what people buy, they don’t see why, what they want out of life. You know, like, it appears to be an optimum when it’s actually a market failure. Right? And so if we can build something that’s kind of like a market but that pays off in terms of values, I think we get out of a lot of those. Like, Moloch, in a way, kinda disappears.

Jim: Let’s take—let’s go look at the breakfast cereal example because it’s such a clean example. The payoff is—let’s be totally cynical. Let’s update it to, you know, 2025. For the executives making decisions, it’s the return on their stock options when their stock price goes up if they move market share a bit. And the first mover has the advantage that they can move market share. Kids will bug their mothers to buy sugary cereals. So mister CEO, who has average time in job of three and a half years these days, sees that he makes this move, two quarters of increasing market share, stock price goes up 30 or 40 percent. He sells the stock options, and he retires with his Romanian hooker girlfriend. How do you stop that?

Joe: Well, I mean, I think that you have to intermediate the buyers and sellers kind of as I was talking about. You have to somehow have something that replaces the market which operates—so there’s a few different markets that you talked about. There’s the investment market, and then there’s the consumer market for breakfast cereals. Right?

Jim: And they’re both—and they’re coupled, and Moloch is the emergent result of a coupling of the two markets.

Joe: And also the result of—I’m just positing something that you might not find realistic, but I’m talking about replacing the market, both markets, with something that understands the values of both sides better. Because, like, the values of the shareholders are represented by this very thin metric of ROI or whatever. Right?

Jim: I call it money on money return. Right?

Joe: And the values of the consumers are not adequately represented either to the extent that, let’s say, consumers regret their obesity or their cavities or whatever, or this is not really in line with their sense of flourishing, then their purchase of the cereal is actually kind of a misreading of them. Right? It’s a misreading of what they really want. Right? And so what we’re trying to do is replace both of these markets, ideally, with some kind of mechanism that does allocations like a market does that actually understands what the shareholders, as people, as human beings, really care about, actually understands what the consumers, as people, really care about and tries to get them what they actually want. And my claim is that when you do that, like, what we’re calling Moloch came from this misreading.

Joe: Oh, yeah. I mean, so I have a particular kind of theory of change about society, which is a very big kind of view, long-term kind of theory of intellectual history or something that comes from Charles Taylor and I think maybe also Deirdre McCloskey, economic historian. And it’s that our society grows through a succession of self-readings. So, like, in the medieval era, self-readings were mostly about your social role. Like, I’m a king. I’m a peasant. I’m a wife. And then I do certain things because that’s my social role. And then there was this shift in the period of the Enlightenment towards thinking of people, thinking of themselves as having beliefs, having personal goals. And then this led actually to markets and democracies making sense to people much more. Because if I have personal goals, then, you know, I can execute them in the market. I can be entrepreneurial in the market, and I couldn’t really in this old system of social roles and so on. Does that make sense, that transition?

Jim: Yes. And I would also say that around the same time, around 1700, the focus on the individual became substantially upregulated.

Joe: Yeah. That’s the same, same, same moment, really. And so I think we’re at another moment like this, and there’s a characteristic change, which is that you go from having a bunch of systems that are based on one reading of people, which would be currently our preferences. Like, that’s what markets and democracies of the twentieth century are kind of optimized for. Everybody gets their preference. This is kind of like Amazon as a market, and it’s kind of like as democracies have become more populist, it becomes more like a reality TV show where everybody’s kind of voting for their favorite character or something like that.

Jim: They’re more like pro wrestling. Right?

Joe: Yeah. Yeah. Exactly. Yeah.

Jim: Yeah. It’s not even a reality TV show. It’s all fake. Right?

Joe: So I think that’s the old system. I mean, that’s the system we’re trying to leave behind. And the way that we leave it behind is that we come up with several new systems, and they all have in common a new underlying way of reading oneself, which we’re saying is this sort of thick model’s value. What I am is not my consumer preferences or my preference for Biden or Trump, but my values, what I think a good life is, what I really believe in, what integrity means to me, things like that. And people aren’t there yet. But I think that there’s a kind of a flywheel, and we’ll have a blog post about this in the next few weeks. There’s a kind of flywheel between institutions existing that understand us that way and us understanding ourselves that way and then advocating for more institutions that understand us that way. So, you know, as I start to see my values as the important thing to take into account, for instance, when I’m renting my rental property, then I will also start to see values as important when I’m participating democratically. And I’ll be like, oh, it doesn’t matter. Trump and Biden, that doesn’t say anything about me. I want to put another kind of information into my local democracy that says much more about me so that, you know, decisions can be made collectively that are based on all of our values rather than our pro wrestling favorite wrestler or whatever. So this is another research area that we have, and this is also our other paper, “What Are Human Values and How Do We Align AI to Them.” We have a kind of democratic system we call moral graph elicitation, which builds a kind of a moral graph out of the values of all the citizens. This is like another area of research. This is a system that already works well, actually. But it has in common with all the other proposals that it’s kind of based on some kind of legible, substantial representation of the values of the members. And one of the things that’s going to be necessary is people are going to have to say, oh, yeah. Those are the, that’s right. That’s the right kind of system. Like, that’s what’s really listening to me, not something that just, you know, takes a vote between a couple politicians.

Jim: I love this thinking, but I will also suggest, based on years of thinking about this, that there ain’t no way to get from here to there in one leap. And that—oh, I don’t know. I’m not gonna say the only way, but a way to get there is to start with membranes, right, that have chosen to use a method like this for their internal governance. And if we can then show that such membranes, which could live embedded in the current system—you know, think about the way the Mennonites are embedded in the current system, but have their own social operating system, even have their own finance, right, and their own way of making decisions. You know, the way the Mennonites decide whether milking machines should be used on their farms or not is actually extremely interesting and is not democratic and produces pluralistic outcomes in different places. But you think of those as membranes that use a different social OS. So if we created—we’re able to create membranes that used a value-rich, high-context method of governance, and it could be shown to work better, then we might actually have a chance of being able to grow step by step until eventually we reach a tipping point and could tip the whole society into ideas like that.

Joe: Yeah. Yeah. I agree 100 percent. And a big part of our—what we’re doing is finding—we call them early markets or early implementation opportunities, places where the situation is such that they would have a lot to gain by early adoption of one of these things. And we don’t think that each of these innovations we’re working on has its own early markets.

Jim: I think that’s interesting and critical. And, you know, particularly in the economic space, you know, places that are less subject to Moloch, like some of the ones you mentioned. Actually, I think they were pretty clever that you may not have intuited them as Moloch-free zones, but I think some of them were. And that kind of critical analysis, where an approach like that actually has a chance to land and germinate, is probably gonna be really critical to your success. So I would encourage that kind of thinking. Obviously, this is a hugely ambitious ensemble of projects and ideas. What else do you have to learn, do, and execute on to start to bring this vision to reality?

Joe: Yeah. That’s a hard one. I mean, I’ll kind of riff—

Jim: And see where—that’s all you can do, obviously. That’s all we’re asking for. I’m not gonna hold you to it. I’m not your mother, and I’m not your boss. Right?

Joe: So I’ll just say what’s happening, which is really beautiful. I feel just incredibly honored. I was a kind of more or less individual contributor, like thinker and writer for a lot of my life about these things. And then starting about three years ago, it shifted, and research network started growing around this work that includes, I think, more or less the best people in the world in all these areas. They’re mostly academics. They’re mostly universities, but there’s also a lot of people at the big labs. There are seven people at Google DeepMind, some people at Anthropic and OpenAI and so on, but also at Oxford, Harvard, at MIT. And so my role is increasingly kind of a cheerleader, a little bit of a shepherd, sometimes just kind of encouraging someone to step up in some way. Together with Ryan, who works at the Media Lab Institute with me and was one of the people who fine-tuned ChatGPT-4 and invented InstructGPT. So he’s like a big person in LLMs. Together with Ryan and some of the other people that are working on this research network, like Shen, who I mentioned, and Jakob Forster, who I mentioned. There’s also Philip Corliss, who’s also at Oxford. We’re just trying to get it done, like, get the research done. There’s a path from research to these early implementations that we talked about a little bit. So as we get the research done, we also look for where it could first be deployed in sort of test deployments, what kinds of data we would want to collect from those early deployments that would just justify later deployments, things like that, and also some broader theory of change work. But the main thing is just, in a way, allowing this network to assemble and encouraging people, the best people, to work on the thing that they’re best at amongst all these projects that I’ve mentioned over the course of the podcast. And so I feel like my role is—it’s not exactly spectator, but it’s increasingly carried by other people. Maybe gardener or something of this network. You know? Something like that.

Jim: Yeah. It’s funny. I sometimes describe my role in Game B as the janitor. Right? You know, make sure that the power is on and that the trash cans have been emptied and that lunch will be there. You know, something like that. Right? Okay. Very good. I like that. Now final question. You know, this comes back to some of my own life experiences. As I mentioned, I was involved in the online world from literally the very beginning. We all thought we were doing God’s work. Right? We were sure that this would be great for democracy, for civilization, for personal growth, for developing better institutions, et cetera. Alas, unintended consequences. Step back a little bit and see if you can think critically about your work and your vision. What might go wrong? What might actually make things worse if the world were to go with a thick, TMV-type approach? And let’s say you’re really successful. You got it in on the consumer side. You got it on the investment side. You got it in governance. What could go wrong? What might be a bad attractor or a bad emergent result if such a thing happened?

Joe: I think there’s always unintended consequences of everything. And in a way, all you can do—what I wish the early Internet people had done—is read a lot of social theory and also economics and kind of just been more informed. Like, if you read—or if you read the, you know, the people that were believing in the Arab Spring, or I was involved in Couchsurfing, or there’s also, like, a lot of rhetoric around Wikipedia. There’s a lot of optimism that was not founded in social theory or social science. And so that’s one thing that we’re trying to do differently is just, you know, have the unintended consequences be—

Jim: Anticipated at least. Right?

Joe: Yeah. Like, try a little bit more than the early Internet people did. I think often when engineers try to do social things, they, yeah, they just they don’t have the right models. Right? But, of course, there will be more—there’ll still be unintended consequences. One that we talk about how to avoid, one is concentration of power. So if you imagine this market intermediary situation where it’s contracting for tens of thousands of users, or if you imagine this kind of super negotiator that’s building contracts, that’s building coalitions that didn’t exist before, these things could become very powerful. Right? And we would want them to be maybe auditable, nonprofits, things like that. There’s always a danger. There’s a danger for many of these things that there’s like a right way to do it, that as it grows, somebody will then do it not the right way because they see it growing. You know?

Jim: Cooption and Moloch will be attempting to take what you’ve invented and figure out how to fuck people with it.

Joe: Yeah. Exactly. So I think—

Jim: Just like rock and roll music. Right? You know, the classic example—started out as a—or jazz, you know, rock and roll more a better example. Started out as a, you know, a folk kind of music at the edge between black people and white people, and then it became the most vile, corrupt example of Moloch ever created. Not the worst, but one of. Right?

Joe: Yeah. So I think that’s maybe the one that we think the most about, which does not mean—I don’t mean to say that we’re prepared for it. But we think about it all the time.

Jim: But at least you’re aware of it. That’s a start. Right?

Joe: We think about it all the time, and we do have—we have a variety of—we have a defense-in-depth approach. So we’re not walking in naively, and there’s probably seven to 10 things that we’re gonna do to try to prevent that, and who knows if they work.

Jim: Very cool. Joe, this has been an amazingly good conversation. I’m really glad I reached out to you and suggested we do this. This is—I’ve had a lot of fun, learned a lot while reading the paper and thinking about it. I hope you’ve enjoyed your time here on the Jim Rutt Show today.

Joe: Yeah. Absolutely. Thanks. It’s a joy to come back and have so much to share because I think the last conversation we had, it was just much more speculative. It was much more like, you know, I hope these things happen. This is what I think is right, and this is more like, this is what we’re doing. And that’s such a joy to be able to bring that.

Jim: Yeah. I love it. So folks, go read. It’s a little technical, but it’s not that hard to read. “Full Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value.” As always, we’ll have a link to the paper on the episode show at jimruttshow.com. Alrighty. Audio production and editing by Andrew Blevins Productions. Music by Tom Muller at modernspacemusic.com.