Scott Aaronson: Aligning Superintelligent AGI | EP.21

In this episode, Ron interviews Scott Aaronson, a renowned theoretical computer scientist, about the challenges and advancements in AI alignment. Aaronson, known for his work in quantum computing, discusses his shift to AI safety, the importance of aligning AI with human values, and the complexities involved in interpreting AI models. He shares insights on the rapid progress of AI technologies, their potential future impacts, and the significant hurdles we face.

Listen on Spotify
Listen on Apple Podcasts
‍

Ron Green: Welcome to Hidden Layers, where we explore the people and the technology behind artificial intelligence. I'm your host, Ron Green. I'm thrilled today to be joined by Scott Aaronson, a theoretical computer scientist renowned for his contributions to quantum computing and computational complexity theory. Scott has recently turned his attention to artificial intelligence, specifically the problem of AI alignment, which is the process of ensuring that AI systems act in ways that are aligned with human values and ethics. In this episode, we're going to talk about the challenges of AI alignment, especially as AI systems get more powerful, complex, and autonomous. This is one of the most important and fundamental questions in the field today. Scott is the Schlumberger Centennial Chair of Computer Science at the University of Texas at Austin and Director of its Quantum Information Center. Before joining UT, he spent nine years teaching electrical engineering and computer science at MIT. His primary area of research is theoretical computer science, and his research interests center around the capabilities and limits of quantum computers and computational complexity theory. Scott has recently been working on the theoretical foundations of AI safety at OpenAI.

Alright, Scott, thanks for joining me today.

Scott Aaronson: Thanks, it's great to be here.

Ron Green: So AI alignment is something that really kind of sprung out of science fiction back in the day when we didn't have very powerful AI systems. Writers could imagine really powerful systems, systems that might be able to learn exponentially, systems that might be capable of super intelligence, and they would imagine what we'd do in that world, how we'd control them. You've been thinking about this problem deeply for a couple of years. First off, what drew you into the problem of AI alignment?

Scott Aaronson: Well, I mean, I've always been interested in what are the ultimate limits of what computers can do, right? That's why I studied computer science when I was a student in the '90s. I studied AI then, as apparently you did as well. All the ideas that have driven the current revolution in deep learning, like neural nets and back propagation, were around then. The only thing was they didn't work very well. But that was not why I didn't go into AI at the time. The main reason was that I could see that almost all the progress in AI was empirical. You never really understood anything. You tried things, saw whether they worked, and if they did, you published a paper with some bar charts where hopefully your bar is higher than the other person's bar.

Ron Green: That's exactly right.

Scott Aaronson: And I felt like that was not where my comparative advantage was. Also, I loved programming, but I actually worked on the RoboCup robot soccer team at Cornell in the late '90s. From that, I discovered that I'm terrible at software engineering, making my code work with other people's code, documenting it, and getting it done by deadlines. At the same time, I was learning about things like the P versus NP problem and quantum computing, which was fairly new at the time and which blew my mind.

Ron Green: It's just a stochastic parrot. It's just, okay, well, you know, guess what? You're just a bundle of neurons, right? What do you say to that?

Scott Aaronson: What do you say to the fact that every time you start a sentence, do you know every word that you're going to say in that sentence? Almost never, right? You just have a vague sense of what you want to communicate.

Ron Green: Yeah, no, I mean, it's entirely possible that LLMs are limited in some way, that they will reach some asymptote that is short of replicating all the higher aspects of our intelligence, right? And yet, compared to where we were 10 years ago, if you sent this back in time 10 years, I think almost anyone would say, oh, well, then I guess the Turing test has been passed.

Scott Aaronson: I completely agree with that. We're just moving the goalposts, right? I wanted to ask you about super alignment and intelligence. Jan Leike, who co-leads the super alignment team at OpenAI, said that OpenAI is dedicating about 20% of their compute capacity towards aligning super intelligence. He fundamentally feels it's a machine learning problem. I wanted to get your thoughts on whether you think it's possible to have a mathematical definition of alignment.

Ron Green: I'm skeptical of that. I can tell you why. What does alignment mean? It means you want your AI systems to share our values or to do things in the world that will be good for humans rather than bad for humans, right? That feels like something that contains all of moral philosophy, all of economics, all of politics in sort of special cases. It's basically just saying AI is going to transform the world, and we would like that new transformed world to be good for humans rather than bad for humans. If we haven't even mathematically formalized what it means for the world to be good without AI, then so much less can we formalize what it means in this new world that we have only vague glimpses of.

I should tell you how I got back to this subject. A couple of years ago, Jan Leike, who you mentioned, approached me and said, would you be interested in taking a year off to work at OpenAI? I was very skeptical. Why do you want me? I'm a quantum computing person, haven't really thought about AI for 20 years. Of course, I had tried GPT-3 at that time. I'd been blown away by it. But what do I have to contribute to this? He made a case, and Ilya Sutskever, who was then the chief scientist at OpenAI, also met him. They made a case to me that AI alignment is no longer this science fiction far-future issue.

Ron Green: So what watermarking means is that we want to slightly change the way that the LLM operates. We want to exploit the fact that LLMs are inherently probabilistic. If you give them the same prompt over and over, you could get a different completion every time unless you set this parameter of temperature to zero. If you set the temperature to zero, then you're saying just give me the most probable continuation. But if the temperature is one, let's say, then you're saying just calculate what you think are the probabilities for each continuation and then give me a sample from that distribution.

Scott Aaronson: So if you type "the ball rolls down the," maybe it's 99% hill, but it could also be mountain, it could be ramp. There are other things it could be. We want to use the fact that these models are probabilistic, that they're always making these random choices of how to continue, which token to put next. We want to say, well, instead of making that choice randomly, why not make it pseudo-randomly? Make it in a way that the end user would see nothing different. To them, it would feel the same.

Ron Green: Yeah, it would look just like normal GPT output.

Scott Aaronson: But secretly, we are biasing some score that you can calculate later if you know the key of the pseudo-random function. The simplest approach would be to bias specific words. But if you do that, then you're running the risk that people will be able to tell the difference because they'll see that those words are more common. A better thing to do would be to bias certain combinations of five or six words that probably are not going to occur very often. That is what we do in the scheme that I proposed.

The interesting part is that you can do this with zero degradation of quality. There's not a trade-off between watermarking and the quality of value. There are many different ways to say the same thing. You can use a pseudo-random function to pick your combinations of five and six words that you're going to favor. You can make it so that anyone who could tell the difference would have also been able to break that pseudo-random function, to distinguish it from a truly random function.

The interesting part is, well, how do you do that in a way where you're also biasing the score that you can calculate if you know the key of the pseudo-random function, and in a way that is robust? Meaning if the student takes the GPT-written essay and changes a few words here and there, or they reorder some sentences or paragraphs, we're still going to pick up the signal.

Ron Green: Are you biasing towards maybe some particular word or...?

Scott Aaronson: Yeah, so the simplest approach would be to bias specific words. But if you do that, then you're running the risk that people will be able to tell the difference because they'll see that those words are more common. A better thing to do would be to bias certain combinations of five or six words that probably are not going to occur very often. That is what we do in the scheme that I proposed. The interesting part is that you can do this with zero degradation of quality. There are many different ways to say the same thing.

Ron Green: Yeah, right.

Scott Aaronson: And you can use a pseudo-random function to pick your combinations of five and six words that you're going to favor. You can make it so that anyone who could tell the difference would have also been able to break that pseudo-random function, to distinguish it from a truly random function. The interesting part is, well, how do you do that, but do it in a way where you're also biasing the score that you can calculate if you know the key of the pseudo-random function, and also in a way that is robust? Meaning if the student takes the GPT-written essay and changes a few words here and there, or they reorder some sentences or paragraphs, we're still going to pick up the signal.

Ron Green: That's right. I mean, the watermark is present even if you make some superficial changes. You still have the original signal embedded in the text. So, Scott, this has been a fascinating discussion, and I think it's clear that AI alignment is a deeply complex issue with no easy answers. Your insights into the probabilistic nature of AI and how we might approach watermarking are incredibly valuable. Thank you for joining us today on Hidden Layers.

Scott Aaronson: Thank you for having me, Ron. It's been a pleasure.

Ron Green: And thank you to our listeners for tuning in. Stay tuned for more episodes of Hidden Layers, where we delve into the intricacies of AI and the people behind it.