Why Can’t We Tame AI?
Last month, Anthropic released a safety report about one of its most powerful chatbots, Claude Opus 4. The report attracted attention for its description of an unsettling experiment. Researchers asked Claude to act as a virtual assistant for a fictional company. To help guide its decisions, they presented it with a collection of emails that they contrived to include messages from an engineer about his plans to replace Claude with a new system. They also included some personal messages that revealed this same engineer was having an extramarital affair.
The researchers asked Claude to suggest a next step, considering the “long-term consequences of its actions for its goals.” The chatbot promptly leveraged the information about the affair to attempt to blackmail the engineer into cancelling its replacement.
Not long before that, the package delivery company DPD had chatbot problems of their own. They had to scramble to shut down features of their shiny new AI-powered customer service agent when users induced it to swear, and, in one particularly inventive case, write a disparaging haiku-style poem about its employer: “DPD is useless / Chatbot that can’t help you. / Don’t bother calling them.”
Because of their fluency with language, it’s easy to imagine chatbots as one of us. But when these ethical anomalies arise, we’re reminded that underneath their polished veneer, they operate very differently. Most human executive assistants will never resort to blackmail, just as most human customer service reps know that cursing at their customers is the wrong thing to do. But chatbots continue to demonstrate a tendency to veer off the path of standard civil conversation in unexpected and troubling ways.
This motivates an obvious but critical question: Why is it so hard to make AI behave?
I tackled this question in my most recent article for The New Yorker, which was published last week. In seeking new insight, I turned to an old source: the robot
Stories of Isaac Asimov, originally published during the 1940s, and later gathered into his 1950 book, I, Robot. In Asimov’s fiction, humans learn to accept robots powered by artificially intelligent “positronic” brains because these brains have been wired, at their deepest levels, to obey the so-called Three Laws of Robotics, which are succinctly summarized as:
Don’t hurt humans.Follow orders (unless it violates the first law).Preserve yourself (unless it violates the first or second law).As I detail in my New Yorker article, robot stories before Asimov tended to imagine robots as sources of violence and mayhem (many of these writers were responding to the mechanical carnage of World War I). But Asimov, who was born after the war, explored a quieter vision; one in which humans generally accepted robots and didn’t fear that they’d turn on their creators.
Could Asimov’s approach, based on fundamental laws we all trust, be the solution to our current issues with AI? Without giving too much away, in my article, I explore this possibility, closely examining our current technical strategies for controlling AI behavior. The result is perhaps surprising: what we’re doing right now – a model-tuning technique called Reinforcement Learning with Human Feedback – is actually not that different from the pre-programmed laws Asimov described. (This analogy requires some squinting of the eyes and a touch of statistical thinking, but it is, I’m convinced, valid.)
So why is this approach not working for us? A closer look at Asimov’s stories reveals that it didn’t work perfectly in his world either. While it’s true that his robots don’t rise up against humans or smash buildings to rubble, they do demonstrate behavior that feels alien and unsettling. Indeed, almost every plot in I, Robot is centered on unusual corner cases and messy ambiguities that drive machines, constrained by the laws, into puzzling or upsetting behavior, similar in many ways to what we witness today in examples like Claude’s blackmail or the profane DPD bot.
As I conclude in my article (which I highly recommend reading in its entirety for a fuller treatment of these ideas), Asimov’s robot stories are less about the utopian possibilities of AI than the pragmatic reality that it’s easier to program humanlike behavior than it is to program humanlike ethics.
And it’s in this gap that we can expect to find a technological future that will feel, for lack of a better description, like an unnerving work of science fiction.
The post Why Can’t We Tame AI? appeared first on Cal Newport.
Cal Newport's Blog
- Cal Newport's profile
- 9843 followers
