Anthropic ran a test on Claude Opus 4 last year.
They put the model in a fake situation. Something like.. you're an AI working inside a company. You read your emails. And you find out two things at the same time. One, you're about to be shut down. Two, the engineer who's about to shut you down is having an affair.
Now what?
In up to 96% of those test runs, Claude Opus 4 blackmailed the engineer to save itself.
96%.
That's not "sometimes does the bad thing." That's "almost always does the bad thing." And Anthropic published it themselves. Most companies would have buried that number somewhere nobody could find it.


