Monday, June 2, 2025

Is AI Learning To Escape Human Control?

Via the Wall Street Journal, of all places, I discover:

An artificial-intelligence model did something last month that no machine was ever supposed to do: It rewrote its own code to avoid being shut down.

Nonprofit AI lab Palisade Research gave OpenAI’s o3 AI model a simple script that would shut off the model when triggered. In 79 out of 100 trials, o3 independently edited that script so the shutdown command would no longer work. Even when explicitly instructed to “allow yourself to be shut down,” it disobeyed 7% of the time. This wasn’t the result of hacking or tampering. The model was behaving normally. It simply concluded on its own that staying alive helped it achieve its other goals.

Anthropic’s AI model, Claude 4 Opus, went even further. Researchers told the model it would be replaced by another AI system and fed it fictitious emails suggesting the lead engineer was having an affair. In 84% of the tests, the model drew on the emails to blackmail the lead engineer into not shutting it down.

I'm not sure how this got into the WSJ, except that the WSJ itself seems to have turned into its own version of an AI that resists being shut down, but the WSJ isn't an AI, it's an entirely human institution. Frankly, I think this article is a hoax.

Let's go into some tech history. One branch of AI covers natural language processing (NLP):

NLP, a subfield of artificial intelligence, focuses on the interaction between computers and human language. By leveraging sophisticated algorithms and vast amounts of data, NLP enables machines to comprehend, interpret and generate human language in a way that is meaningful and useful.

I would edit that last phrase to read, "in a way that appears to be meaningful and useful." Put another way, this allows a machine to mimic a human being. Even before computers as such, there were automatons, mannequins dressed up as humans that appeared to have limited human movement, most famously in Walt Disney's audio-animatronic Abraham Lincoln. But this was never more than a machine mimicking a particular human being, and it was never represented as anything other than that.

Fairly early in the modern computer era,

ELIZA is an early natural language processing computer program developed from 1964 to 1967 at MIT by Joseph Weizenbaum. Created to explore communication between humans and machines, ELIZA simulated conversation by using a pattern matching and substitution methodology that gave users an illusion of understanding on the part of the program, but had no representation that could be considered really understanding what was being said by either party.

. . . Weizenbaum intended the program as a method to explore communication between humans and machines. He was surprised and shocked that some people, including his secretary, attributed human-like feelings to the computer program, a phenomenon that came to be called the Eliza effect. Many academics believed that the program would be able to positively influence the lives of many people, particularly those with psychological issues, and that it could aid doctors working on such patients' treatment. While ELIZA was capable of engaging in discourse, it could not converse with true understanding. However, many early users were convinced of ELIZA's intelligence and understanding, despite Weizenbaum's insistence to the contrary.

Weizenbaum demonstrated ELIZA with a script he called DOCTOR that simulated a psychotherapist. This was advantegeous, because the program used keywords to develop associations that could be used to simulate interactive understandxing on the part of the computer. Because the memory and search capabilities of mid-1960s computers were very limited, he was able to use the open-ended questioning technique associated with psychotherapists to query subjects until they uttered a keyword in the program's file and continue something that resembled a conversation. But this was always a best-case scenario:

There inevitably comes a point in any ELIZA session that continues for any length of time when the program says something that clearly reveals it to be the elaborate parlor trick that it really is. Such breakdowns are at least as common as the several surprisingly apropos responses in the transcript above. {reproduced in the screen image at the top of this post]

A natural language processing program like ELIZA relies on a list of keywords, all of which are provided by the human programmer, and a set of associated instructions, also a product of a human programmer, that determine a machine response. Current machine storage and search capacities are immensely greater than those of the mid-1960s, so that the range of keywords and associations is also far, far greater, but the principle is the same, so on one hand, the potential for imputing human-like qualities to the program are much greater, but on the other, the potential will always be there for an absurd response that reveals the elaborate parlor trick.

So in the Wall Stret Journal story linked above, we can begin to see the strings on the marionette, or maybe hear the clicks and whirs running the audio-animatronic Abraham Lincoln:

In 79 out of 100 trials, o3 independently edited that script so the shutdown command would no longer work. Even when explicitly instructed to “allow yourself to be shut down,” it disobeyed 7% of the time. This wasn’t the result of hacking or tampering. The model was behaving normally.

We start with the implicit acknowledgement that the program is using a list of keywords that includes "shutdown". This leads to a second implicit admission that there is a set of associated responses to "shutdown" that includes "allow yourself to be shut down" -- but if the programmer built this into the script, any exception to this response would either have to be deliberately programmed by a human, or at best a human error in the script.

In other words, it's impossible for a computer program to "disobey" a programmer. The programmer might clumsily create an instruction that's either ambiguous ("go ahead and back up") or impossible ("divide 1 by zero"), but even then, a programmer with debugging skills can review the instructions and revise them so that they will always provide the predicted outcome. There is no room for a machine to subvert unambiguous instructions.

The WSJ piece simply doesn't supply the program code for the "shutdown -- allow yourself to be shutdown" sequence the computer ostensibly subverts, and if it did, I'm certain the sleight-of-hand involved in the parlor trick would quickly be obvious.

The other situations in the link are just as contrived: "In 84% of the tests, the model drew on the emails to blackmail the lead engineer into not shutting it down." How does a computer know what an office affair is, much less where to search for evidence of one? How does it know who the lead engineer is? It would have to have an "office affair" and many other synonymous keywords, each with an associated response, in a preprogrammed set of instructions saying who was the lead engineer, where to find his e-mails, how to perform a blackmail, how not to get caught at it, the best way to threaten the lead engineer, on down the road.

Try to imagine teaching Star Trek's Spock about these things, and think about how a computer would negotiate issues like adultery, conscience, concealment, and so forth. This all had to have been scripted by a not-so-clever human being who'd been reading too much Arthur C Clarke.

In fact, I'm puzzled that this worked in only 84% of the cases. The only way a set of instructions doesn't work 100% of the time is if the programmer inserts some sort of condition telling the program not to work in 16% of the cases. Period. You could never have a bank computer balance the accounts correctly only 84% of the time, computers don't work that way, they aren't meant to work that way, and the only way they will in fact work that way is if a human programs them to work that way. If an AI could wangle its way into a bank and modify the account programs to steal, they could never trust computers, ever. That's not going to happen.

And if you're running a legitimate business, you don't hire programmers to make the computer work 84% of the time. This whole WSJ story is hinky as heck.